Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Communication optimization in Parsec : A parallel programming environment for multicomputers Mulye, Sameer S. 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1995-0107.pdf [ 5.11MB ]
Metadata
JSON: 831-1.0051204.json
JSON-LD: 831-1.0051204-ld.json
RDF/XML (Pretty): 831-1.0051204-rdf.xml
RDF/JSON: 831-1.0051204-rdf.json
Turtle: 831-1.0051204-turtle.txt
N-Triples: 831-1.0051204-rdf-ntriples.txt
Original Record: 831-1.0051204-source.json
Full Text
831-1.0051204-fulltext.txt
Citation
831-1.0051204.ris

Full Text

C O M M U N I C A T I O N O P T I M I Z A T I O N IN P A R S E C : A P A R A L L E L P R O G R A M M I N G E N V I R O N M E N T F O R M U L T I C O M P U T E R S by SAMEER S. MULYE B.E. (Computer Engineering) University of Bombay, India, 1990 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF M A S T E R OF SCIENCE IN T H E F A C U L T Y OF G R A D U A T E STUDIES D E P A R T M E N T OF C O M P U T E R SCIENCE We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A November, 1994 © Sameer S. Mulye, 1994 I In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Computer S a ' f c h f g The University of British Columbia Vancouver, Canada Date feJb- V*, W 5 DE-6 (2/88) Abst rac t Multicomputer (distributed memory M I M D machines) have emerged as inexpensive, yet powerful parallel computers that promise high computational performance. A parallel application running on these machines consists of a set of user processes running on different processors. The only way processes share data in this environment is by message passing which can become the bottleneck to obtaining high performance. This thesis examines the issues influencing communication in multicomputers. Two static analysis tools are designed: a mapping tool and runtime system for communica-tion and synchronization. A n efficient mapping of a parallel application onto a parallel computer is necessary to minimize the total communication cost. Delegating this task to the user makes application development a complex process as the user must know the underlying hardware. An automated hardware-independent mapping tool is designed to map arbitrary process graphs on a transputer-based multicomputer. A native runtime system on these multicomputers consists of message-passing primi-tives which are efficient but only support simple send and receive operations. A high-level runtime system provides functionalities such as flow control, buffering, and group com-munication using a wide range of primitives. However, these primitives are difficult to use and it is up to the user to select the optimal primitives for efficiency. Hence a message-passing library is designed for a transputer system. It is built on top of Trollius, an operating system for multicomputers, which offers a variety of functions, each with different combination of performance and functionality. The library includes a few com-munication primitives which provide a common interface for all the Trollius functions while maintaining the efficiency of the low-level functions. The mapping tool statically compiles a program structure info the configuration information and based on this infor-mation, the message-passing library selects the optimal Trollius functions. 11 Parsec (the Parallel System for Efficient Computation), a parallel programming envi-ronment for multicomputers, is being developed at U B C . Our tools are integrated with Parsec and become the backbone in achieving high performance. Heterogeneity is one of the important design issues in Parsec. Currently these tools are available on two differ-ent hardware platforms: a transputer-based multicomputer and a cluster of Sun Sparc workstations. Thus the user can take advantage of either environment to write portable and efficient parallel applications. 111 Table of Contents Abstract 1 1 Table of Contents vi List of Tables vii List of Figures viii Acknowledgment x 1 Introduction 1 1.1 Problems 2 1.2 Motivations : 4 1.3 Approach 10 1.4 Thesis Context , . 12 1.5 Thesis Organization 12 2 Related Work and Background 14 2.1 Parallel Programming Environment 15 2.2 Parallel Languages and Compilers 18 2.3 Mapping tools 18 2.4 Message-passing Libraries 21 2.5 Background. 22 3 System Design 26 iv 3.1 System Overview 27 3.2 The Mapping Tool 29 3.3 Runtime System 32 3.3.1 Parsec Process Model 32 3.3.2 Message Passing Primitives 34 4 Mapping Strategy on Multicomputers 37 4.1 Hardware System , 37 4.2 Structural Overview 39 4.3 Mapping Strategy 46 4.3.1 Mathematical Formulation 46 4.3.2 Mapping Constraints 47 4.3.3 Algorithm 53 4.3.4 Experimentation 58 5 Runtime System for Transputers 61 5.1 Transputer Programming Model 61 5.2 Low Level Software Tools . . . . . . . 62 5.2.1 Logical C :. . . : 62 5.2.2 Trollius 64 5.3 Runtime System 67 5.3.1 Implementation Details . . 70 5.4 Performance Analysis 75 6 Programming A Network of Workstations 77 6.1 Workstation Environment 77 6.2 Mapping Tool 79 v 6.3 Runtime System 83 6.3.1 Internal Features 84 6.4 Performance Analysis 86 6.5 Program Development in Parsec 88 7 Conclusions 97 7.1 Future Work 99 Bibliography 100 Appendices 105 A Transputer System Layout 105 vi Lis t of Tables 1.1 Message Passing Timings 8 4.1 Wire Structures 41 4.2 Mapping Statistics 59 5.1 Trollius Function Invocation 72 5.2 Parsec: Message Passing timings . . . 76 6.1 Parsec: T C P communication . 87 6.2 P V M 3 : daemon mediated U D P communcation . 87 6.3 P V M 3 : T C P communcation 88 6.4 Spring: Performance 96 vii List of Figures 1.1 Process Graph 7 1.2 Mapping 8 2.1 Parsec Objects 23 3.1 System Overview 27 4.1 Transputer Network (part of Component 2) 38 4.2 Structural Overview 39 4.3 Resources (part of Component 0) 40 4.4 System graph 42 4.5 Conflict-graph for Figure 4.3 . . . 43 4.6 User Process Graph 44 4.7 Guest Graph 45 4.8 (a) Guest graph, (b) Host graph for case 1, (c) Host graph for case 2 . . 51 5.1 Decision Tree . . . 69 6.1 Process Graph for Spring in Parsec 90 6.2 Graphical Interface for Mapping Tool in Parsec 92 6.3 Graphical Interface for Loading Tool in Parsec 93 A . l Component 1 (Box 1) . : 105 A.2 Component 0 (Box 2) 106 A.3 Component 3 (Box 3) 107 vm A.4 Component 2 (Box 4) 108 A.5 Component 2 (Box 5) 109 A.6 Transputer System as a 8x9 mesh . . . 110 ix Acknowledgment I would like to thank both my supervisors Dr. Samuel Chanson and Dr. Alan Wagner for their support and guidance. I thank David Feldcamp, Mandeep Dhami, and H . V . Sreekantaswamy for their valuable assistance during my research work. My thanks go to Bi l l Gates for proofreading this thesis, Vishwa Ranjan for his continuous encouragement, and Raza Khan and Chris Romanzin for being there on those countless nights. I am grateful to Dr. Chanson and to the Department of Computer Science at the University of British Columbia for providing the financial support. Finally, I would like to express my acknowledgments to my parents, my constant source of inspiration. x Chapter 1 Introduction Parallelism is one way to achieve computational speedup for a large class of applications. A wide range of parallel computer architectures [59] are available to exploit parallelism. The most basic distinction between these architectures is whether the processors syn-chronously execute the same instruction on multiple data (SIMD) or asynchronously execute multiple instructions on multiple data (MIMD). M I M D machines are grouped into two categories according to their address-space organization [2]: message-passing architectures and shared-address-space architectures. In a message-passing architecture, each processing element (PE) consists of a processor and its local memory and the PEs are interconnected by a high-speed network. M I M D message-passing computers are commonly referred to as multicomputers. In a shared-address-space architecture, all processors have read and write access to a shared address space. Processors interact by modifying data objects stored in a shared space. These computers are referred to as multiprocessors. There are several factors that motivate the use multicomputers for parallel computing. A multicomputer can be built out of inexpensive and powerful microprocessors and can scale easily to accommodate a large number of PEs. Performance of these machines has improved over the years with increasing C P U speed and decreasing message-passing time. For these reasons, multicomputers are viewed as low-cost, scalable, and high-performance parallel computers [35]. Examples of multicomputers include Intel iPSC, Transputers, CM-5, and Intel Paragon. In recent years, networked workstations (workstations in a local 1 Chapter 1. Introduction 2 area network) have emerged as an effective alternative for dedicated parallel machines [37]. 1.1 Problems Unfortunately, it is difficult for a naive user to program a multicomputer. Many mul-ticomputers provide low-level and hardware-dependent programming environments (for example, N X / 2 for the iPSC and Logical C for the transputer). Programmers in these environments, must know the details of a machine architecture such as the number of processors, the network connection, the memory organization, the I/O processing, the operating system, and the communication and synchronization facilities. These programs guarantee high performance but at the cost of a complex design and development pro-cess. The hardware dependencies inherent in these programs also make them difficult to port to a different architecture. On the other hand, high-level programming environ-ments [40, 42, 41] support common interfaces across different hardware platforms. These environments provide high-level abstractions to simplify the program development, but their performance is poor. Thus there arises a trade-off between high-level and low-level programming environments. Developing high-performance high-level programming envi-ronment is a challenging task. The primary focus of this thesis is on two performance related issues associated with high-level programming environment for multicomputers: the mapping problem and a runtime system for communication and synchronization. Mapping problem The first step in executing a parallel application on a multicomputer is to assign each of its processes to an individual processor. The problem of assigning these processes, in order to minimize the communication cost is known as the mapping problem [17, 29]. The Chapter 1. Introduction 3 process assignment can be static or dynamic. In static mapping, processes are assigned to processors prior to any program execution and in dynamic mapping, processes are dynamically reassigned to processors in response to changing load on processors and network. The way in which a mapping algorithm allocates the processes directly affects the program execution time. For example, if two communicating processes are assigned to two non-adjacent processors, then messages are delayed because of extra copying at the intermediate nodes. Similarly assigning the processes to the same processor serializes the communication and hence slows down the execution. The problem of finding an optimal task assignment is known to be NP-complete [ 3 ] . This problem is even more difficult for reconfigurable multicomputers. In these multicomputers, the interconnection network can be changed using control switches that are configured by software. A good mapping requires not only an efficient algorithm but an elegant way to formulate the problem. The difficulty lies in formulating the dynamic characteristics of the interconnection network. R u n t i m e system Multicomputers support task-level parallelism based on a message-passing paradigm. A native runtime system provides message-passing primitives for communication and synchronization which ensure high performance. Usually these primitives support only point-to-point send and receive operations. Higher level message-passing systems, built on top of the native runtime system, provide flow-control, buffer management, routing, and group communication. However with these added functionalities, the high-level primitives are difficult to use. Given a wide range of primitives, it is up to the user to select the optimal message-passing primitives. Ideally one would like to tailor the message-passing to the type of communication so that neighbour-to-neighbour, non-multiplexed processes can use the lowest-level faster Chapter 1. Introduction 4 mechanism whereas higher-level multiplexing and routing is reserved only for the mes-sages that need it. However, this information is not available, until after mapping is done. Such an optimization cannot be done in the application because the user code needs to be changed if the mapping changes. A runtime system should provide this type of optimization. Apart from these performance-related problems, we also address portability. The tools which are hardware-dependent should be designed for all the target architectures and should present a generic user interface for application development across multiple platforms. Once an application is developed for a particular M I M D architecture, it cannot be run on other M I M D architectures because the tools and techniques used to parallelize the code are specific to a particular hardware platform. As a result, considerable effort is necessary to port code to other platforms. This will certainly discourage the user from developing applications for different parallel architectures. 1.2 Motivations We first identify the relationships between the aforementioned problems and the target machine architecture and its native runtime system. Target Hardware and Software Systems Both the mapping problem and the runtime system are discussed in the context of a re-configurable transputer-based multicomputer. The multicomputer consists of 75 INMOS T800 transputers and 10 C004 programmable crossbar switches. Each crossbar switch has 32 data links and a control link. Any pair of data links can be connected by sending an appropriate software command to a control link. The system can be reconfigured into Chapter 1. Introduction 5 a variety of interconnection networks. There are two software tools available on the trans-puter system at U B C : Trollius [9] and Logical C [7, 8]. Logical C provides hardware-level message-passing primitives. The Trollius message-passing primitives resemble the OSI model of protocol architecture. Different primitives are available at the physical level, data-link level, network level, and transport level in order of increasing functionality and decreasing efficiency. The mapping problem is characterized by various factors such as parallel program structure, computational cost of a task, communication cost among the tasks, number of processors, processors' connectivity, the speed of an individual processor, and communi-cation cost between the processors. It is independent of a machine architecture and the native runtime system. The objective of the message-passing library, built on top of the native runtime system, is to provide high-level easy-to-use primitives without compro-mising performance. The optimization techniques employed by such a system depends on the native runtime system as well as on the machine architecture. There are several reasons for having a mapping tool to allocate tasks in order to minimize the communication overhead. 1. Assignment of the processes to the processors should be simplified. Many program-ming environments like P V M [42], TOPSYS [44], and Trollius and languages like Occam and Logical C leave this up to the programmer. Although this is not diffi-cult when there are only a few processes, it becomes more tedious as the number of processors and processes grow. 2. Different programming paradigms require different process topologies. In order to map these different topologies by hand, on a particular parallel architecture, the user must know the underlying hardware interconnection. But such topology inde-pendence is easier when the mapping is performed automatically by the software Chapter 1. Introduction 6 system. 3. A reconfigurable network makes the mapping problem more difficult. One can enhance performance by using a mapping tool that can do better than a naive user. 4. A variety of runtime systems are available to support different functionalities for different types of communication. We sought to take advantage of these functional-ities while at the same time providing easy to use programming interface. Different communication primitives affect the performance differently. The mapping should be done in such a way that the total communication cost will be minimum. This will allow one to take advantage of the lowest-level (and most efficient) commu-nication primitives. However, such optimization by hand is not desirable for the programmer because every time a program structure changes, the mapping infor-mation changes, and so does the optimization code. Hence both the mapping and the optimization should be done automatically. Example In order to motivate the problems associated with optimizing communication, we present an example which shows how mapping affects the optimization and what overheads are associated with the Trollius communication primitives and the Logical C channel primi-tives. A process graph is mapped onto the transputer system, and the message-passing time between every pair of processes is measured. Figure 1.1 shows a five node complete graph. This graph is mapped on the component 0 of our transputer system (Figure A.2) which has 25 transputers where transputer T 0 is directly connected to the host machine (Sun Sparc machine). Component 0 has two rings of 8 transputers each (T 0-T 7 and T 1 6 -T 23) and one chain of 9 transputers (T 8 -T 1 5 , T a ) . The Chapter 1. Introduction 7 S: Host Process. P i : Transputer Processes. Figure 1.1: Process Graph transputers within a ring or a chain are connected on hard-links whereas two transputers across a ring and a chain can be connected over the crossbars. The mapping of the process graph onto component 0 is shown in Figure 1.2. Dilation of an edge in the process graph is defined as the number of physical links the edge is mapped onto. Edges e s p i , e s p 2 , esp3, e s p 4 , and epoP4 have dilation greater than one. Logical routes include all the physical links between every pair of communicating processes. Depending on the mapping done (which determines the dilation of edges in the process graph and the physical links that are multiplexed), a specific Trollius function can be used. For example, two processes P 0 and P i are mapped on the adjacent transputers T a and T 1 5 respectively. However, the physical link between T a and T"i5 is multiplexed over four different edges in the process graph. Hence the physical level or data-link level Trollius primitives or Logical C channel primitives cannot be used for communication between P 0 and P i . One is then limited to the network level or the transport level Trollius primitives. Table 1.1 shows the message-passing time between every pair of processes for the given mapping. In this experiment 1000 messages, each of 4k bytes, were sent from the source node to the destination node. The round-trip time is measured for message-passing between every process pair. Two communication levels are used: network level and Chapter 1. Introduction 8 Ta T15 T12 T13 Physical Links Edges in the process graph with dilation > 1 Logical routes for the dilated edges Figure 1.2: Mapping hardware level. At the hardware level, Logical C primitives are used instead of physical-level Trollius primitives to minimize the delay. (To use Logical C primitives from within Trollius, it is required to set up virtual circuit using data-link level or network level Trollius primitives. This is described in detail in Chapter 5). Process pairs P Q - P I and P1-P2 communicate over the hard-links (two transputers Source Dest Communication Time Transmission rate node node level (in sec) (in MBytes/Sec) Po Pi Network 6.060 0.675 Pi P2 Hardware 4.864 0.842 P2 PA Hardware 7.025 0.582 Pi PA Network 7.633 0.536 Po PA Network 13.318 0.307 Table 1.1: Message Passing Timings Chapter 1. Introduction 9 are directly connected to each other) whereas P2-P4 and P1-P4 communicate over the crossbars (in this case, transputer links are connected to a common crossbar switch). The transmission rate for P1-P2 is higher than that for P0-P1 because the hardware level primitives add little overhead compared to the network level primitives (network level primitives incur the cost of buffering, routing, and packetization). Also, communication on a crossbar is slower compared to one on a direct link (note the time between P0-P1 and P1-P4 or P1-P2 and P2-P4). Edge epoP4 is dilated and hence a message travels over the multiple links which results into poor performance (i.e., higher message passing time between P0-P4). It is possible to use only the network level functions in the above example without any need to know about the mapping details. However to achieve better performance it is required to select the optimal communication primitives based on the configuration information (which includes dilation over the edges and the multiplexing over the physical links) available from the mapping process. Works ta t ion Envi ronment A network of workstations has evolved into a viable programming environment for parallel applications [37]. A variety of systems are available (§2.1) to program in a networked environment. Some of the motivations for using networked workstations for parallel and distributed computing are: 1. They provide inexpensive computing resources to develop parallel applications and are capable of delivering the performance of a parallel computer [47]. 2. They are easy to use and many software tools are available in a familiar environment for program development such as optimizing compilers, debugger, and performance monitors. Chapter 1. Introduction 10 3. Network resources can be easily shared. They include peripheral devices, dis-tributed file system, local memory, and computing power of individual workstation. 4. If a workstation remains idle then its otherwise wasted C P U cycles can be harnessed to improve performance. 1.3 Approach The main objectives of the mapping tool are: • It should be topology independent, i.e., independent of changes in the underlying hardware and runtime system. • It should assist the user in generating a good mapping. In an ideal mapping, every pair of communicating processes map onto adjacent processors such that the communication cost is minimum. Usually this mapping is not feasible and hence the mapping tool attempts to optimize the mapping with respect to several criteria. • Routing determines the path (one or more physical links) between every pair of communicating processes. It is optimal if communication takes place along the shortest path and no two messages are multiplexed over the path. In the case where dilation is greater than one, the mapping tool should generate optimal routing between all non-adjacent processes so as to use the most efficient message-passing level, i.e., it should, where possible, use the lowest level of communication. The mapping problem is formalized using a graph theoretic model [20, 21, 24, 18]. A parallel program structure is represented as a process graph where each node denotes a user process and an edge.indicates that the two processes it joins communicate with each other. Thus'parallelism is explicitly specified in the graph. Similarly, the hardware Chapter 1. Introduction 11 architecture is represented as a processor graph where each node denotes a processing element (PE) and each edge indicates the communication link between two PEs. With this formalization, the mapping problem is reduced to mapping a process graph into a processor graph. A mapping algorithm is developed that combines heuristic search with backtracking. The mapping takes place in three steps: contraction, assignment, and routing. The mapping information is then passed to the runtime system where it is used to optimize the communication. A high-level runtime system has been designed on top of Trollius. It includes a mes-sage passing library to provide a consistent user interface to all of the Trollius message-passing functions while maintaining the efficient use of low-level functions. The runtime system supports point-to-point and collective communication (scatter/gather operation). The mapping tool statically compiles the program structure into the configuration infor-mation. Based on this information, the runtime system selects the optimal Trollius prim-itives for communication. The optimization is entirely transparent to the user. However, the semantics of the message passing such as synchronous vs. buffered communication are provided by the user. In the workstation environment, the mapping tool and the runtime system were ported to a network of Sun Sparc workstations. A static load balancing algorithm was developed to assign processes to the processors and a runtime system was designed that uses T C P communication services. Both environments (the transputer system and the networked workstations) share the same user interface and program code and thus location trans-parency is achieved (i.e., the user can develop an application in either environment and then transparently move the code). Chapter 1. Introduction 12 1.4 Thesis Context Developing a parallel application is a complex process. It includes several steps: program structuring, specifying parallel control and communication, mapping, program loading, runtime support, monitoring, visualization, and debugging. Ideally all of the above should be provided by one integrated environment. Parsec [32] is one such environment for programming multicomputers which is being developed at U B C . It provides program modules (i.e., skeletons, templates) that are partially implemented virtual machines corresponding to a particular paradigm. Parsec has various constructs to define the structural information required by our optimization tools. It supports a graphical interface to create process graphs and provides three objects, process, port, and channel, to specify explicit parallelism in a program. The mapping tool and the runtime system for message passing are integrated with Parsec to optimize the communication within the environment. Parsec provides an X-window based user interface to the mapping tool and also supports a hardware database which stores information such as the number of processors, processor name, processor speed, and connectivity. In the remainder of the thesis, these tools are discussed in the context of Parsec. 1.5 Thesis Organization The remainder of this thesis is organized as follows. In Chapter 2, the relevant literature is reviewed. We briefly survey different programming environments, message-passing systems, and mapping tools. Parsec objects which provide structural information to the mapping tool and the runtime system are also described in this chapter. The mapping tool and the runtime system are integrated with Parsec. Although the tools differ in their implementation on each architecture, they possess the same Chapter 1. Introduction 13 functionality and user interface. This integration along with the Parsec communication primitives are described in Chapter 3. The Parsec mapping tool for a transputer system is presented in Chapter 4. The Parsec runtime system for a transputer system is described in Chapter 5. Native runtime systems (Trollius and Logical C) and optimization techniques employed by the Parsec runtime system are described in this chapter. Chapter 6 discusses the tool with respect to the workstation environment. In order to demonstrate portability, we also present a parallel application that is developed in the Parsec environment and run on both the architectures. Conclusions are drawn in Chapter 7 followed by a brief discussion of future work. Chapter 2 Related Work and Background There are two distinct approaches to developing parallel programming systems. The first approach is to develop a programming environment that provides tools for different phases of program development such as defining parallel code, assigning processes to parallel architecture etc.. Usually in these systems, a standard sequential language is ex-tended by providing some message-passing libraries; examples include P V M [42], Topsys [44], Express [46], and Frameworks [47]. A second approach is to develop new parallel languages and compilers based on some high-level programming paradigms. Compilers are responsible for detecting and managing parallelism; examples include Linda [48], Jade [51], and DINO [50]. In Parsec, the parallel programming environment developed at U B C , our goal is to obtain better performance and simplify code development. A mapping tool and runtime system were developed and integrated with Parsec to achieve these goals. Like other programming environments, Parsec helps the user to specify parallel program structures, binds processes to processors and provides primitives for communication and synchro-nization. However, the Parsec approach differs from many tools in that the mapping tool statically compiles the program structure into information which can be used by the runtime system to select the optimal communication primitives supported by the native runtime system. In this chapter we briefly survey the related work. In §2.1, we present various parallel programming environments with the main .emphasis on communication-related issues. 14 Chapter 2. Related Work and Background 15 A compiler-based approach is described in §2.2. The scope of this thesis is limited to a mapping tool and runtime system. In §2.3, we discuss transputer mapping tools (namely, Marc [28], Oregami [31], and Prep-P [29]) followed by message-passing systems for mul-ticomputers such as Zipcode [53] and Via [54] in §2.4. Finally, in §2.5, we describe the three Parsec objects, process, port, and channel, which are used to define the structural information required by our tools. 2.1 Para l le l P r o g r a m m i n g Envi ronment PVM3 Parallel Virtual Machine (PVM) [42] is a software system that allows a heterogeneous network of parallel and serial computers to appear as a single computational resource. It consists of a set of user-interface primitives, a daemon that runs on every participating machine, and a library for process creation, management, communication, synchroniza-tion along with other housekeeping facilities. P V M was primarily designed for networked workstations, however, it has also been ported to the Intel iPSC/860 multicomputer. In the latter case, the native message-passing routines are used for inter-node communica-tion. In P V M , a programmer is responsible for configuring a virtual machine. As yet, it does not provide load balancing, partitioning, or mapping. P V M supports dynamically reconfigurable runtime system. It includes a few simple message-passing primitives for send/receive, broadcast, and multicast operations. It is the programmer's job to structure and implement parallelism and communication types such as point-to-point communi-cation and group communication. The user is also responsible for buffer management and message packing/unpacking. In terms of communication cost, a daemon on each machine adds to the cost as the message (destined for remote machine) is routed through Chapter 2. Related Work and Background 16 the daemons. In general, no communication optimization is provided by the runtime system. In the most recent version of P V M (PVM3.x), the runtime system supports two types of communications: daemon mediated U D P communication and direct T C P com-munication. Unlike P V M , Parsec runtime system utilizes statically configured structural information to perform optimization. In the workstation environment, Parsec supports only T C P communication. In Chapter 6, the performance of our system is compared to that of P V M 3 T O P S Y S Tools for Parallel Systems (TOPSYS) [44] is an integrated, hierarchical tool environment for distributed multiprocessors. The main design goals of Topsys are ease of use and portability. It was originally implemented on an iPSC/860 and later ported to Sun Sparc workstations. Parsec followed similar design approach in that it was originally implemented on a transputer system and later ported to networked workstations. Topsys consists of a parallel debugger, a performance analyzer, a graphical animation tool, and an interface for all of the above tools. It supports M M K (multiprocessor multitasking kernel), a parallel library for process creation and communication. Topsys provides three objects for communication and synchronization: process, mailbox and events. Once the programs are written, the user must specify the connectivity of the objects and the mapping of these objects onto the processors. This information is supplied in a special text file called the mapping file. M M K provides only point-to-point communication and no communication optimization is done in the system as the communication objects are defined statically by the user. Good monitoring support is provided with back-end tools like a debugger and a performance analyzer. However, there are no front-end tools (tools to implement parallel programming paradigm, specify parallelism, or perform task allocation and communication optimization) to help during program development. Chapter 2. Related Work and Background 17 Enterprise Enterprise [40] is a graphical programming environment for distributed systems. It pro-vides tools to design, code, debug, monitor, and execute the parallel applications. Enter-prise programs run on a network of workstations. A user attaches assets to the sequential code to specify parallelism in a program. The assets are similar to the templates or pro-gramming paradigms that represent high-level techniques for exploiting parallelism. The template-based approach was proposed in Framework [47], a precursor to Enterprise. To specify parallelism, Enterprise uses an analogy between the structure of a parallel pro-gram and the structure of a business organization. This analogy provides different types of assets (e.g., contracts, departments, individuals), each defining a different way of par-titioning an operation across multiple processes. The system automatically inserts the code to handle communication and synchronization into a module based on its attached asset. It also loads the processes on machines for execution and provides limited process migration and dynamic load balancing. P I E The Programming and Instrumentation Environment (PIE) [45], developed at C M U , is directed towards generation of performance efficient parallel programs. Like Parsec, PIE provides a level of abstraction between the algorithm and the parallel architecture level. It is called an Implementation Machine (IM) and corresponds to a particular par-allel programming paradigm like master-slave and divide and conquer. The pre-coded structures (which encapsulate communication, synchronization and data partitioning) are made available to the user in the form of templates. The user is required to provide only the sequential code. The main emphasis of PIE is on performance monitoring and debugging. PIE does not supports tools for mapping and loading, and a programmer is Chapter 2. Related Work and Background 18 expected to know the details of- the hardware architecture and the underlying operating system. Message passing is supported by the IPC mechanism of the operating system. Templates in Parsec are similar to the assets in Enterprise and the Implementation Ma-chines in PIE. However, the main difference is that Parsec also supports the mapping and loading tools and the runtime system for optimized communication. 2.2 Para l le l Languages and Compi le rs High-level parallel languages provide programming paradigms that hide the complexity associated with explicitly parallelizing the code. It is built around a set of abstractions that express a parallel algorithm. The implementation of such a high-level language hides the low-level details such as mapping and presents an operating system and hardware-independent interface to the user. Some parallel languages, such as Fortran 90 [49] and Linda [48] are extensions to existing sequential languages while others such as Jade [51] and DINO [50] are completely new. Like parallel languages, parallelizing compilers also free programmers from the diffi-cult task of writing low-level parallel code. They are responsible for extracting parallelism from the code and for effectively schedule it on a target parallel machine. These compilers perform static optimization to generate efficient machine code. However, compiler-based approach is restricted to low-level optimization in that only basic send and receive op-erations are supported. Another problem is that the compilers are not portable across different machines. • . 2.3 M a p p i n g tools Many good mapping algorithms have been proposed in the theoretical literature to map processes onto various fixed interconnection networks. They use a variety of searching Chapter 2. Related Work and Background 19 heuristics. For example, the pairwise interchange algorithm by Bokhari [17] used prob-abilistic jumps to solve a mapping problem on a specific array of processors (i.e., the finite element machine). A greedy algorithm by Chen and Gehringer [21] and a mapping scheme (initial assignment followed by pairwise exchange) by Lee and Aggarwal [20] are targeted for hypercube architectures. Simulated annealing [22] and recursive divide-and-conquer strategy [23] based on Kernighan-Lin graph bisection heuristics for hypercubes were proposed by Ramanujam et al. In all these approaches, the topology of the target architecture was fixed, and the algorithms took advantage of this fact. There is another approach where the mapping tools are designed to accommodate one or more mapping algorithms, each suitable for a particular topology type. Three such tools are described below. M A R C M A R C [28] is an automatic configuration tool for mapping parallel programs onto a trans-puter network. It includes methods for load balancing and communication-optimized pro-cess distribution onto arbitrary topologies as well as efficient and secure routing strategies. The tool accepts a program written in Occam [13] and transforms it into a configured program for a fixed network of transputers. It constructs a software-graph from the Oc-cam program and a hardware-graph from the transputer interconnection and attempts to optimize the mapping of the software-graph onto the hardware-graph. The actual mapping takes place on the entire transputer network defined by the hardware-graph. The mapping algorithm uses a distributed method based on diffusion where neighbouring processors exchange information about their load and communication cost and processes are moved based on some optimization condition. In case of a reconfigurable transputer network, the desired topology should already be switched and thus a user is expected to know the hardware configuration. The input program is required in the Occam language Chapter 2. Related Work and Background 20 and this limits the scope of M A R C to the transputer networks. Oregami Oregami [31] provides an automatic and guided mapping of parallel programs onto the parallel architectures in order to achieve portability and maximal performance. The system has two key features: (1) an ability to take the advantage of the regularity present in both the computation structure and the interconnection network, and (2) a desire to balance the user's knowledge and heuristics with efficient mapping algorithms. Oregami consists of three components: 1. LaRCS, a language that allows the user to express the input graph. The input graph specifies regularity in the communication topology (i.e. which tasks communicate with each other) and regularity in the execution and communication patterns over time (i.e. phase behaviour of an algorithm). 2. Mapper, a library of mapping algorithms that allow canned mapping and the map-ping of regular (like tree, mesh, and ring) and irregular topologies. 3. Metrics, an interactive graphics tool that provides a way for the manual mapping where a user can specify particular process to processor binding. In Oregami, the hardware is assumed to be fixed and connected by some regular network topology. In contrast, our mapping strategy is aimed for reconfigurable network (which itself is not regular (§4.1)) to map variety of topologies. Prep-P Preprocessor for Poker (Prep-P) [30] is a mapping preprocessor that takes as input, a set of user processes whose communication graph is described in a Graph Description Chapter 2. Related Work and Background 21 Language (GDL). The mapping takes place in three phases: contraction, placement, and routing. A G D L and a compiler (GraphCom) facilitates representation of the input graph in a layout-independent form. The algorithm used for contraction is a local neighbour-hood search and the algorithm for placement is a modified version of the Kernighan and Lin circuit placement algorithm [24]. Once contraction and placement take place, Prep-P provides a mechanism for multiplexing mapped processes over a single processor. In Prep-P, the target hardware is a CHiP machine, a configurable message-passing parallel architecture which can simulate variety of network topologies. The dependency on a compiler makes Prep-P difficult to port to other hardware architectures. Tmap [36], a programming environment developed at U B C , uses Prep-P as a front-end tool for mapping on our transputer system. It also includes a runtime system for communication built on top of Trollius [12]. Like Prep-P, mapping in Parsec takes place using the same three phases. But unlike Prep-P and Tmap, a new technique is used to generate the mapping (§4.3) and the mapping tool is more flexibly incorporated into the system. 2.4 Message-passing Librar ies Most multicomputers have hardware support for message passing. However, it often differs from machine to machine and can sometimes be complex to use. Message-passing libraries have been designed to provide performance and convenient functionality. Given the proliferation of message-passing standards, there have been an attempt to standardize the libraries. M P I [52] (Message Passing Interface) is a standard that includes a large number of functions to support various communication types and semantics. It is necessary to design platform-specific message-passing library which willsupport a few easy-to-use primitives for optimized communication. Zipcode [53] is one such system Chapter 2. Related Work and Background 22 which also influenced the M P I standard. It is a dynamically configurable runtime system which provides optimized communication on a variety of platform including n C U B E / 2 , iPSC/860, Delta, CM-5 scalar machine, and a network of Sun Sparc and I B M RS/6000 machines. It introduces an object called mailer that encapsulates groups, contexts, and virtual topologies and provides the appropriate scope for all communication operations in Zipcode. Parsec differs from Zipcode in that optimization is done statically and encoded in the form of load-time parameters to the message-passing primitives. Thus it avoids overheads associated with dynamic optimization. Via [54], a structured message-passing system, has a hierarchy of objects: task class, task set, and task. This hierarchy is useful in many ways. First, it allows the general properties of a task to be defined in libraries. Second, a group of tasks can be renamed to avoid name collision and third, it allows programs to contain groups of tasks which have the same properties but are addressed differently. Each task has a number of named ports. This task and its ports together provide the communication context. To optimize the communication, V ia makes use of the configuration information provided by the user. Parsec has similar hierarchy of objects but, unlike Via , the configuration information is automatically generated by the mapping tool and used by the runtime system to optimize communication. In the following section, we describe Parsec objects used to specify a parallel program structure. 2.5 Background In Parsec, there is a clear distinction between sequential program code and specification of parallelism. The specification captures two notions of communication: communica-tion structure and communication semantics. Communication structure encapsulates Chapter 2. Related Work and Background 23 declarative information such as which processes communicate with each other, their con-nectivity, and communication intensity (number of messages transferred). The semantics specify whether communication is synchronous or buffered. The mapping tool converts communication structure into detailed configuration information needed by the runtime system. The runtime system, based on this information and the user-specified semantics, optimizes communication (this is explained in Chapter 5). Parsec provides a graphical user interface to specify the communication structure of a parallel program. It uses three primitive types of objects: process, port, and channel. Process, a global object, is the smallest decomposable unit of computation in a parallel application. Port, a local object bound to a process, is used for communication on a channel which is uniquely defined by the pair of process-port bindings. Figure 2.1 shows {A, Next; B, Prev} association. The port Prev of process A and port Next of process B are unbound and hence cannot be used for the communication. Process Process is the most widely used type of object that specifies the structure of a parallel program. It is an instance of a process type, another object provided by Parsec for Process Port (Unbound) Figure 2.1: Parsec Objects the relationship between these objects. Processes A and B each have two ports (Next and Prev) associated with them. The two processes communicate on a channel defined by Chapter 2. Related Work and Background 24 process reuse. Process type has two types of information: construction information and process structure information. Construction information includes source files, libraries, and runtime parameters that are needed to build an executable image of a process. Process structure information specifies the port name, process weight or priority. A new process is created by specifying its type and by binding its named ports to channels. This logical division makes the high-level program structure independent of the actual process instances. Por t In a parallel program, processes are often replicated (multiple instances of a process type) to form a program structure. As with non-replicated processes, these replicas must be bound with explicit connections to other processes. Changing the statically configured program structure requires that these connections are often changed. However, if the process is to remain replicable, its internal specification must remain independent of its external connection. The port object is used to facilitate this isolation. Port is a named variable whose value specifies the data structure that includes configuration information generated by the mapping tool. Thus, port hides the complexities of the underlying communication medium. Ports also encapsulate communication scope. A single port can be bound to several other ports and thus provide a simple way for collective (one-to-many or many-to-one) communication. Channel Channels define the communication paths in a program. In Parsec, a channel itself has no name but a channel end is defined by the process-port binding. Thus a channel is an association of a process-port connected to another process-port. The channels possess weights which denote the traffic density (the number of messages passing through the Chapter 2. Related Work and Background 25 channel). These weights are used during mapping to determine the best possible way to map these channels over the hardware architecture. A channel implementation is hardware-dependent. In multicomputers, channels are implemented over the physical links and on workstations they are built on top of some layered protocol and use the service access points of that protocol (e.g., sockets on T C P / I P protocol). The process-port-channel model has a close resemblance with our graph theoretic ap-proach in the mapping tool. In a process graph, a node denotes the process-port binding and an edge denotes the channel. Similarly, port internals provide the.information re-quired by the message passing primitives for communication optimization. These objects are further discussed in that context in Chapter .4 and 5. Apart from these primitive ob-jects, the Parsec interface has process groups, parameterized process graphs, and modules to support reusability, scalability, and ease-of-use [32] [38]. Chapter 3 System Design The various programming systems discussed in the last chapter offer different tools and techniques for developing efficient parallel programs. However, the user is responsible for binding a parallel program structure to a parallel architecture. Also, the runtime systems provided by these programming environments do not support automatic communication optimization. In Parsec, the tools are designed to support these features. As described in §2.5, Parsec supports the port-process-channel model for specifying a parallel program. In this chapter, we present a tool that automatically maps this program structure onto a parallel architecture. We also introduce the runtime system that provides a high-level message-passing library and uses the mapping information to optimize the communica-tion. In §3.1, we describe the integration of our tools with Parsec. This system presents a conceptual model that defines a generic tool interface and functionality for any type of multicomputer. The mapping tool is described in §3.2. Different phases of the mapping process are defined in this section. In §3.3, the Parsec process model is defined and its message passing primitives are described. Some of the features of the. runtime system are directly influenced by the underlying operating system, the network management, and the hardware support for communication. However the primitives described here are generic and hide platform-specific dependencies. 26 Chapter 3. System Design 27 3.1 System Overview The mapping tool and the runtime system are integrated with Parsec. Figure 3.1 shows their interaction with the Parsec environment. A database is used for data integration between the various tools. In Parsec, a parallel program is specified in the form of a process graph using a graphical interface. In a process graph, each node denotes a process, each edge denotes a channel, and the connection between the edge and the node specifies a port. This process graph along with the hardware resources such as the number of processors, the processor type, the number of links, and the interconnection of the network are stored in the database. Parsec Database Executable Image Figure 3.1: System Overview The mapping tool takes as input the process graph and the hardware resources and performs the mapping in three distinct steps: contraction, assignment, and routing. Chapter 3. System Design 28 Its output is the process-processor bindings and platform-dependent information. The platform-dependent information includes the communication paths between every pair of processes. For the transputer system, these paths specify transputer links, crossbar switches, and crossbar links. For networked workstations, the paths may specify bridges or routers. The mapping tool can be viewed as a static compilation tool that converts structural information into detailed platform-specific configuration information. This configuration information is stored into the database for further use. A loader tool provides the database interface between the mapping tool, the runtime system, and the program execution. It passes the optimized configuration information in the form of load-time parameters to the runtime system. A loader also generates the script to boot the network that is configured by the mapping tool and to load the application program onto the multicomputer. The Parsec runtime system has a few simple message-passing primitives that are built on top of the native runtime system. These primitives hide the complexity associated with low-level communication while maintaining the efficiency of the low-level primitives. For the transputer system, the Parsec primitives select one of the many Trollius functions provided at different levels (physical, data-link, network, and transport level). In case of workstations, they use a IPC mechanism supported by the operating system. The run-time system has two distinct phases: an Initialization phase and an Optimization phase. During initialization, the configuration information is converted into the data structures corresponding to the ports (§2.5) so that the optimal primitives can communicate on those ports. During the optimization phase, the port data structure is analyzed and the most efficient low-level function is selected. As we will see in Chapter 5, this requires that the physical links are reserved for the lowest-level of message passing. As shown in Figure 3.1, the user can also specify communication semantics such as synchronous and buffered communication. A timeout mechanism is also provided that allows user Chapter 3. System Design 29 to control blocking of every send and receive operation on a per port basis. Finally, the optimized message-passing primitives are compiled and linked with the application program to form an executable image. • 3.2 The Mapping Tool Methodology The mapping problem is related to two basic problems: task allocation and routing. Stone [25] proposed a network flow algorithm for task assignment in order to minimize interprocess communication. Various other approaches pursued are: simulated annealing [22], heuristics [26], recursive bipartitioning method [23], and initial assignment, followed by pairwise exchange [20] or boundary refinement [27]. Since obtaining optimal solutions is intractable, these algorithms are not guaranteed to produce the optimal solution. We have proposed a graph-oriented mapping, strategy that uses a greedy algorithm combined with heuristics to find a solution. Greedy algorithm were used by systems like Prep-P and in the theoretical literature on mapping [17, 21]. The mapping problem is represented as one that relates a process graph, correspond-ing to a parallel program structure, to a processor graph, corresponding to a parallel architecture. A node in a process graph indicates a user process and an edge indicates a communication path between the two adjacent processes. The cost of Communication between two processes is indicated by the weight associated with every edge of the pro-cess graph. In the case of non-adjacent nodes, the communication path is a sequence of such edges. Similarly in a processor graph, a node represents a P E and an edge is the physical link between two PEs. Each link of a processor graph is bidirectional so that the underlying processor graph is undirected. Chapter 3. System Design 30 Funct ional Overview The mapping process has the following phases: resource allocation, contraction, assign-ment and routing. Dividing the mapping problem into different phases allows one to provide different solution to each individual phase. Thus it is possible to implement different algorithms, each best suited for a particular task, for different phases of the mapping process. Same approach is taken by the Prep-P [30] mapping tool. However, the difference (with respect to Parsec mapping tool) lies in the tasks performed in each of these phases. For example, the routing phase in the Parsec mapping tool analyzes the assignment and generates the configuration information which is required by the runtime system for communication optimization. The basic inputs to the mapping tool are the hardware resources and a process multigraph (multigraph is an undirected graph that can have multiple edges between the vertices and self-loops). 1. Resource A l l o c a t i o n This module determines which ports (hardware ports; e.g., port that connects trans-puter to the host machine) and processors to use, and which processors are con-nected to the host machine (i.e. root processors). The following information is required by this module: different logical components of the system and how they are connected, the number of processors, the machine-type of the processors, the number of processes and their machine types. A hardware graph is built from this information and is used during assignment and routing. 2. Cont rac t ion If the number of processes is more than the number of processors available, then more than one process will need to run on each processor. This task is performed by contraction. The process graph is contracted in such a way that the number Chapter 3. System Design 31 of processes is less than or equal to the number of processors in a target machine. This is similar to partitioning a process graph of n nodes into m groups, m < n, such that a given cost function is minimized. A n example cost function is the communication cost over the edges. Several algorithms are available to perform the contraction including: simulated annealing, local neighbourhood search and branch-and-bound. However, to keep the mapping problem simple, we ask the user to specify which processes need to be assigned to the same processor. The input to this module is a process multigraph and output is a abstract process graph where each node is a virtual process which represents one or more user processes. This is many-to-one mapping from the user multigraph to the abstract process graph. 3. Assignment . This process assigns a single virtual process to a single processor of the target machine. This is a one-to-one assignment. The assignment strategy is driven by the type of algorithm used for mapping and its main objective is to minimize the total cost of communication in the process graph. 4. Routing Once the node assignment is done, there are two levels at which routing takes place: physical routing and logical routing. physical routing : in a dynamically reconfigurable system, the interconnection network is defined by setting the control switches between the processor links. A l l the physical links between every pair of processors are statically configured. This results into a fixed interconnection graph. logical routing : from a fixed interconnection graph, the communication path Chapter 3. System Design 32 between all the non-local processes are determined. The channels of the ab-stract process graph are then assigned to the physical links in the intercon-nection graph. 5. Dumping Information This phase gathers all the information generated in previous phases and stores it into the Parsec database so that the loader tool and the runtime system can access it. This information includes physical wires, routes, type of communication, and objects for synchronization. The contracted user processes within a virtual process are invisible to the assignment and the routing phase. Hence their routes are generated separately and stored into the database. Mapping strategy is platform-dependent. We present a mapping strategy for the transputer system in Chapter 4 and for the workstation environment in Chapter 6. 3.3 Runtime System The Parsec runtime system is built on top of the native runtime system. The implemen-tation details are platform-dependent but the message-passing primitives are generic. We first present the Parsec process model which specifies the communication and synchro-nization characteristics and then describe the Parsec primitives. 3.3.1 Parsec Process Model As described in §2.5, a parallel program in Parsec consists of a set of processes. A process is an entity that performs internal, unobservable actions and also interacts with other processes. A n interaction implies synchronization among multiple processes. Such a synchronization is an observable action and may involve data exchange. The Parsec Chapter 3. System Design 33 process model allows multiple processes to synchronize and thus support one-to-many (scatter) and many-to-one (gather) type communication. The communication can be either synchronous or buffered. In the case of multi-process synchronization where a single process can receive data from more than one processes, the synchronization behaviour is non-deterministic in that the sender's address cannot be known in advance. The runtime system implements this model using processes, ports, and events that correspond to processes, ports, and channels — three static objects that are defined in Parsec. The approach is similar to that of Zipcode [53] and M P I [52] where it is charac-terized in terms of contexts of communication and communicators. A process is defined earlier in this section. A process group is a collection of processes and defines the scope of collective (scatter/gather) communication. It also provides an efficient mechanism to define and alter the characteristics of the processes, such as port bindings and process weights. We now describe the other two objects: events and ports. Events Events are the identifiers used for synchronization. They are globally unique integers, generated by the mapping tool. There is at least one event associated with every channel. The events are introduced in Parsec because they are required by the Trollius functions (described in Chapter 5). The Logical C functions do not use events. They access the communication channels directly. In the workstation environment, events are mapped to the port addresses used for communication over the sockets (described in Chapter 6). The user never manipulates these events but only observes the synchronization behaviour implied by the events. Chapter 3. System Design 34 Ports Ports are similar to mailers in Zipcode and communicators in M P I and define the context of communication. The ports encapsulate the configuration information generated by the mapping tool and make it available to the runtime system for optimization. The infor-mation includes the physical addresses of the processors that embed all the members of a process group and the events for every channel accessible through the port. The physical address and the events together provide scope for all the communication operations in Parsec. 3.3.2 Message Passing P r imi t ives The runtime system supports two message-passing primitives: port_send and port_recv. They are used to select the optimal communication primitives provided by the native runtime system. A l l ports are declared and bound to a process through the Parsec's graphical interface. Before any message is sent or received, every user process must initialize all its ports. This is accomplished by the following function: init_ports(int argc, char **argv, Port *pl, Port *p2, , NULL); where p i , p2, .. are the names of the ports that are bound to a process. The names can be either defined by the user or defined by Parsec. Function init_ports reads the configuration information passed by the loader tool and stores it into the data structures corresponding to the ports. Other initialization tasks are specific to the native runtime system. For example, in our transputer system init_ports registers a process with Trol-lius kernel and allocates soft channels (described in Chapter 5) for local communication. Once the communication is over, a process is exited with the following function call which essentially frees all the port structures (and in the case of transputer system it also de-registers the user process): Chapter 3. System Design 35 exit_ports(Port *pl , Port * p 2 , , NULL); Processes send or receive data on a specific port by invoking two functions: port_send and port_recv, respectively. Timeout parameters are used to control the blocking of send and receive operations. Two parameters, one for the send operation and one for the receive operation, are defined for each port. The user can also specify whether the communication is buffered or synchronous. The default options are: zero timeout (i.e. blocking until the send or receive returns successfully) and buffered. Like the configuration information (generated automatically by the mapping tool), these user-controlled parameters are encapsulated in the port structures. The syntax of the port functions is as follows: port_send(Port *port, void *data, int size, int msgtype); port_recv(Port *port, void *data, int *size, int msgtype); where port is a pointer to the port to (or from) which process is sending (or receiving) the data. Parameter "msgtype" is either a predefined token R A W or a user-defined token for X D R data structures. The data structures can be primitive data types or user-defined structures. Parsec automatically generates X D R code to pack and unpack the data structures and perform the byte-ordering when needed. If port_send and port_recv are called with user-defined message type then the the X D R code does the appropriate data conversion. For all X D R messages, the size argument in both the functions should be N U L L . The data argument is a pointer to an instance of the data structure for the send function and a pointer to a variable of the message type for the receive function. If the data structures contain the elements that are accessed through pointers, they are allocated by the X D R code if the size pointer is N U L L . In the case of the R A W message type, the X D R code is bypassed. For these messages, the data argument is a pointer to the existing buffer for both send and receive function. The'size argument is the actual size of the buffer in the case of send and in the case of receive, it is set to the number of Chapter 3. System Design 36 bytes that are received. Features of the runtime system are summarized as follows: 1. The communication primitives are easy-to-use and portable across different hard-ware platforms. 2. They select the optimal communication primitives depending on the configuration information generated by the mapping tool. This type of optimization is platform-dependent and is described in §5.3. 3. They support point-to-point communication as well as scatter and gather type of communication. 4. They support data conversion for heterogeneous networks. The basic data types as well as the user-defined complex data structures can be used. 5. Messages of arbitrary length are allowed, and the packetization is done internally. In Chapter 5, the runtime system for the transputer system is described. This includes the underlying communication primitives and the optimization techniques employed by the runtime system. Finally, the Unix implementation of the runtime system is discussed in Chapter 6. Chapter 4 Mapping Strategy on Multicomputers Following the general design of the tool given in §3.2, we describe our mapping strategy on multicomputers'. The mapping algorithm has been implemented on our configurable transputer system. A brief overview of the transputer system in Department of Computer Science at the University of British Columbia is presented in §4.1. In §4.2, we describe the structural design of the mapping tool. The structural design shows different components of the mapping tool. These components provide a layer of abstraction on top of the hardware architecture and the parallel program structure.-. This allows the mapping to be done in a hardware-independent way. The mapping strategy is described in §4.3. We discuss various cost metrics, an objective function, a mapping algorithm, and the experimental results of our mapping tool. 4.1 Hardware System The hardware system consists of 75 INMOS T800 transputers and 10 C004 programmable crossbar switches. Each transputer has a 32 bit, 10 MIPS C P U , a 64 bit F P U , 4KB of on-chip R A M , and four bit-serial communication links. Each node is a single transputer with external (off-chip) memory: 64 nodes have one MByte, 10 nodes have two MByte and there is one node with 16 MByte. Each crossbar switch has 32 data links and a control link. Any pair of data links can be connected by.sending the appropriate software command to the control link. Crossbars can have up to 16 bidirectional links. 37 Chapter 4. Mapping Strategy on Multicomputers 38 The system is reconfigurable and supports a variety of interconnection networks. The transputer system is accessed by a Sun Sparc2 workstation through a S-bus card with four data ports. The nodes are partitioned into four reset components. Each com-ponent has one transputer directly connected to the host through a data port. Trollius, a single-user, multitasking operating system runs on each node. This allows only one user to access any node at a given time. However, the logical partitioning allows up to four users to concurrently use the system. Once a particular component is allocated, the user has complete control over all the transputers in it. The entire transputer network resides in five physical boxes1. Links 0 and 1 of most transputers. are hardwired to one another whereas links 2 and 3 are connected to the crossbar links. Some transputers in box 4 have all the links connected to the crossbars. xb3 14 3 Toxb2 15 T46 To T45 T47 TOT40 24 2 To Xb2 16 T48 Xb4 Figure 4.1: Transputer Network (part of Component 2) There are six 8-node rings and one 8-node chain. The remaining transputers are either connected to the host machine or to the crossbars. Thus, the network is not regular. Figure 4.1 shows the part of our transputer network which is configured as described above. 1There are four components partitioned into five physical boxes as shown in Appendix A. Chapter 4. Mapping Strategy on Multicomputers 3 9 4.2 S t ruc tura l Overview The different components of the mapping tool are shown in Figure 4.2. The components are connected by directed edges that show the dependency between them. In the rest of the chapter we will use the terminology defined in this section to describe the various graph structures used in the mapping process. Figure 4.2: Structural Overview 1. Hardware Database ( H W D B ) The hardware database includes the machine dependent information on the nodes and their connectivity. A node is either a processor, a crossbar, or a host machine. The node information includes the logical name of the node, its architecture type (either transputer, crossbar, or host machine), the component to which it belongs, the number of communication links, memory size, processor speed. A transputer link is either hardwired to another transputer link or connected to a crossbar link.' Chapter 4. Mapping Strategy on Multicomputers 40 In the latter case, connections between the transputers depend on the crossbar settings. Thus, multiple physical links (Figure 4.3) are possible between two trans-puters depending on the number of hardwired links and the number of crossbar, switches. A l l such transputer interconnections are stored in the database. Given two transputers and their links, a database query can determine whether there exists a path between them. 2. Resource A l l o c a t i o n M o d u l e (Resources) This module retrieves information from the hardware database. The module al-locates the resources from the component requested by the user. The resources include the number of wires (network connections), a data structure for each wire (which includes transputer names, transputer links, current crossbar settings, and weight on each wire), the names of all the processors (i.e., the transputers and the host machine), and the names of all the transputers that are connected to the host machine. Only these resources are used by the remaining phases of the mapping tool. xb7 15 16 17 3 2 T15 0 1 Ta 3 TOT14 Xanadu Figure 4.3: Resources (part of Component 0) The four hardware components of our system are shown in Appendix A. As an example consider component 0 which has 25 transputers (T 0 -T 23 and T 0 ) . In Fig-ure 4.3, we have shown the connections between T a , T15, and Xanadu (the host machine). The data structures for the wires, retrieved by the resource allocation apter 4. Mapping Strategy on Multicomputers module, are given in Table 4.1. 41 Wire # Lk i P E 2 L k 2 Wt Xlen Xname X l k i X l k 2 W i T -1- a 1 Tis 0 1 0 - -w 2 T 2 ' T 1 5 3 2 1 xb7 16 15 w 3 T 3 Tis 3 2 1 . xb7 17 15 W4 Xanadu 0 T a 0 X 0 - . - -w'i Ti5 0 T 1 1 0 - - -w' 2 T i 5 3 T 2 2 1 xb7 15 16 w' 3 Tis 3 T 3 2 T xb7 15 17 w'4 T t t 0 Xanadu 0 X 0 - - -Table 4.1: Wire Structures Each wire is unidirectional. If two transputers are hardwired then xlen is 0 indicat-ing that there are no crossbar connections. In the case of positive xlen (i.e., wire has an intermediate crossbar links), the name of the crossbar switch and its links are stored into the data structure. Wt indicates the communication cost on a wire. For the crossbar links this cost is double the cost on a direct link. The processes running on Xanadu and T a are specified by the user and are assigned to the proces-sors during initial phase of the mapping algorithm. Hence the communication cost between Xanadu and T a is not considered for the mapping. 3. Sys tem G r a p h A system graph is a multigraph representing the interconnections of the transputer network. The system graph for the processors in Figure 4.3, minus Xanadu, is shown in Figure 4.4. Each node represents a transputer, and an edge between two nodes represents a physical link between two transputers. Each transputer has four links. If all four links of the two transputers are connected to the same crossbar, then there are 16 different ways to connect them. Therefore, the system graph has up Chapter 4. Mapping Strategy on Multicomputers 42 to 16 edges between .two nodes. Each edge is associated with two data structures (as shown in Table 4.1) for two unidirectional wires. e 1 = {w 1 ,w^} e 2={wtj,w^} Figure 4.4: System graph The main advantage of using a system graph is that it hides the hardware-specific information. This graph can be used as a input graph to any mapping algorithm. This is evident from Figure 4.2, in that there is no direct edge from hardware database to an algorithm module. A system graph is independent of a particular mapping instance (i.e. process graph). It can be constructed once and stored in the database for reuse, unless the hardware configuration or. the requested resources change. 4. Conf l i c t -Graph The system graph stores the communication paths between the nodes. There are, however, the potential for conflicts. A conflict graph is constructed from the system graph to encode the information about which pairs of system edges can not be mapped simultaneously. In Figure 4.4, there are three edges between T a and T15: ei , e2, and 63. Since e 2 and e3 share link 3 of T15, only one of them can be used for the mapping. In general, two edges e =< ni,lki,ri2,lk2 >. and e '=< n'1?lk^,n'2,lk'2 > conflict if and only if Chapter 4. Mapping Strategy on Multicomputers 43 ((m = n;) V(n 2 = n'2)) A ( ( l k i = 1^) V ( l k 2 = lk' 2)) A conflict graph for Figure. 4.4 and its matrix representation are shown in Fig-ure 4.5. The graph shows a conflict between e2 and e 3 and adjacency matrix has © 0 — 0 e, e, e, i e. e2 o o o o o 1 0 1 0 Figure 4.5: Conflict-graph for Figure 4.3 value one in the corresponding row and column. Like a system graph, a conflict graph depends only on the hardware configuration and is independent of the map-ping instance. 5. Abstract Processor Graph The abstract processor graph or Host graph is a subgraph of the system graph. Different host graphs are constructed for different mapping instances. Before the mapping starts, there are as many nodes in the host graph as the transputers in a component requested by the user. Since the crossbar switches are reconfigurable, their link information is not available until the mapping is finished. Hence only hardwired links (i.e. wires between two PEs without any intermediate crossbar switches) are added to the host graph initially. Once the mapping is done, physical routing takes place on paths between every pair of transputers that are statically configured. Configuration information such. as.transputer links, crossbar switches and their links is stored during routing. The transputers that are not mapped and are not connected to any mapped transputers are removed from the host graph. Thus, after the mapping is done, it is guaranteed that the host graph is a connected graph that embeds the process graph. Chapter 4. Mapping Strategy on Multicomputer 44 6. Process Graph A parallel application is specified in the form of a process multigraph where each node represents a process and an edge between two nodes indicates a communication channel. It is assumed that communication channels are unidirectional, one-to-one, • and named on each end. This is called the "port name" of the channel. Thus a channel endpoint is defined by a tuple <process_name, port_name> and a channel is an association of two such tuples. The type of communication (e.g., one-to-one, one-to-many, and many-to-one) is also characterized in this graph. An example of a process graph is shown in Figure 4.6. /east westv Figure 4.6: User Process Graph The above process graph has four processes and each process has two ports bound to it. These processes communicate on eight different channels: {<A, east>, <B, west>}, {<B, south>, <D, north>}, {<D, west>, <C, east>}, {<C, north>, <A, south>}, {<B, west>, <A, east>}, {<D, north>, <B, south>}, {<C, east>, <D, west>}, and {<A, south>, <C, north>}. There is only one-to-one communication since no two channels have the same end-points (i.e., process/port pair). 7. Abstract Process Graph The abstract process graph or Guest graph is the same as a process graph shown in Figure 4.6 except that if contraction occurs, each node of the guest graph denotes Chapter 4. Mapping Strategy on Multicomputers 45 Figure 4.7: Guest Graph a virtual process (a virtual process is a process with more than one user process mapped onto it). An edge stores (logical) routing information. Once the mapping is done, logical routing determines the paths of any non-local communication wherein the channels as given by the process graph are assigned to the physical links in the host graph. The contracted process graph for Figure 4.6 is shown in Figure 4.7. X and Y denote the virtual processes. Once the node assignment is done, processes A and C will be mapped to the same processor and the same is true for processes B and D. 8. Mapping Algorithm Given a guest graph and a system graph, the mapping problem is reduced to embed-ding a guest graph onto a system graph. A one-to-one embedding of a guest graph onto a system graph is desirable. However, designing a generic algorithm for various topologies is intractable. Instead, different algorithms and heuristics for different topologies can generate good mappings in many cases [31]. The graph structures, described above, provide basic objects for any mapping algorithm. These graphs hide the internal details of the parallel architecture and the parallel program struc-ture. In §4.3.3, we present a greedy algorithm that employs heuristics to map several process graphs onto a configurable transputer system (§4.1). 9. Mapping State Chapter 4. Mapping Strategy on Multicomputer 46 Once the mapping is done, the information about embedding, logical routes be-tween all pairs of guest nodes, and wires connecting mapped processors is stored in the database. The loader tool extracts this information from the database, con-figures the network of the. host graph, and generates a script to boot the network and to load the application programs onto the transputers. The mapping infor-mation is also used to measure congestion2 on the hardware channels, to identify the communication type for every edge in the guest graph (i.e./whether the com-munication occurs on local, neighbouring or remote transputers), and to generate events for synchronization between the processes. This configuration information is then passed to the runtime system where it is used to select the optimal Trollius functions (§5.3). . , 4.3 M a p p i n g Strategy The mapping strategy is guided by constraints such as dilation, congestion, load, and expansion and aimed at minimizing the cost of communication. One can formulate these constraints in the form of objective functions. The mapping strategy then attempts to optimize these objective functions. We first present the mathematical formulation of a mapping problem, followed by the mapping constraints and the objective function, and finally the algorithm. 4.3.1 Ma thema t i ca l Formula t ion Let G = ( V G , E G ) be a guest graph to be mapped onto a target machine, where V G = { Vi \ Vi € G } is a set of nodes representing the processes and E G = { e,-j | e^ - = (u,-, Vj) and Vi,Vj £ V G } is a set of bidirectional edges. The weight of an edge, u>ij, denotes the 2denned in §4.3.2 • ' , . Chapter 4. Mapping Strategy on Multicomputers 47 traffic intensity. The adjacency matrix for G is denoted by AG-Let S = (Vs, Es) be a system graph representing a target architecture, where V s = { V{ I Vi G S } is a set of processors and E 5 = { e;j | tij = (v;, Vj) and i>;,Uj 6 V s } is a set of physical links of the hardware. Let Wij be the weight on edge £ E 5 . It denotes the communication cost associated with the edge. .4s denotes the adjacency matrix for the system graph. A system path pij corresponds to the sequence of system edges when it is defined for any two system nodes. It is denoted as = (e^ , eklk2> eknj) where eikl, eklki, ekjlj G E 5 . As explained in §4.2, there are conflicts, as given by the conflict graph, between many edges in the system graph. T denotes this conflict graph. We assume that | V G | < \Vs\- G is a abstract process graph defined in §4.2. If there are more processes than processors then the contraction is specified by the user and G is a contracted graph. The mapping problem has been reduced to an embedding of G into S. An embedding maps each vertex in V G onto a vertex in Vs and each edge in E G onto an edge (or edges) of Es- The corresponding mapping function is denoted as / : V G ——> Vs-4.3.2 M a p p i n g Constraints The following cost metrics are considered to measure the quality of the mapping: load, congestion, dilation, and expansion. They are discussed in detail in'[4]. The objective function for our optimization is derived from these measures. • Load: Load is the number of nodes in a process graph that are embedded in a single node in the system graph. In our case, contraction is done manually through the Parsec interface and we assume that only one guest node (virtual process representing one or more user processes) is embedded on a single system Chapter 4. Mapping Strategy on Multicomputer 48 node. • Congestion: Edge congestion is the congestion over a system edge and is defined as the number of guest edges mapped onto it.. The congestion of an embedding is the maximum edge congestion over all the system edges. When congestion is greater than one, the system edge is required by more than one guest edge. Hence communication on.these edges is delayed unless there are multiple system edges between the two nodes. Thus communication cost increases with the congestion. The congestion over e,-j £ Es is given as, £ ( e y ) =1 E'G I (4-1) where E ' G = { emn | pmn = (emil .., e,-j, .., ejn) for emn £ E G and emi, e,-j, ejn € E 5 } such that f(vm) and f (vn) € V s V vm, « „ e V G • Dilation: Dilation of a guest edge is the length of a path (a path consists of one or more system edges) that a guest edge is mapped onto. The dilation of an embedding is the maximum dilation over all the guest edges. When dilation is greater than one, communication over the guest edge may get delayed because the message has to traverse over multiple system edges. In addition there may be congestion over some of the system edges. Thus, like congestion, dilation also adds to the communication cost. The dilation of a guest edge e»j- is given as the distance between /(u;) and f(vj). We-denote this distance as T>Q (/ (v,-), / (VJ)). • Expansion: Expansion is the ratio of the number of nodes in a system graph to the number of nodes in a guest graph. The expansion is related to the dilation of an embedding, and optimizing one factor can adversely affect the other [4]. The expansion of the embedding does not directly affect the cost of communication. Our mapping strategy attempts to minimize only the dilation and the congestion. Chapter 4. Mapping Strategy on Multicomputers 49 In Parsec, the expansion is determined by the user. The number of nodes in a guest graph depends on the contraction of a process graph which is specified by the user whereas the nodes in the system graph are limited by the size of the component. However, once a component is allocated, the mapping tool will use as many transputers from that component as required. Given that the entire communication takes place in a single phase (a phase is the time interval for the communication to take place over a single guest edge), if the edges in a system graph and a guest graph have same weights then the total communication cost is same as the maximum dilation or congestion. However, if the weights on the edges are different then these weights should be taken into account while measuring the communication overhead. The weight on a guest edge indicates the traffic intensity. For example, if one edge is used more frequently than another edge, then it carries more weight. In the system graph, the edges representing crossbar links and direct links have different weights, reflecting the different bandwidth of the two.:. With these different weights, the congestion over a system edge, e^ -, is measured as the sum of the weights, wmn, of all the guest edges that are mapped on it: C{eij)= ^2 wmn . (4.2) eE' where E ' G is defined in equation 4.1. Similarly, the weights on the system edges (instead of number of system edges) should be taken into account while measuring communication cost due to dilation. The dilation of an embedding is minimum if each of the guest edges maps to a single system edge. Let "dil-one" be the number of guest edges mapped onto one and only one system edge. We call it an "embedding amount" of the mapping function / and define Chapter 4. Mapping Strategy on Multicomputer 50 it as d i l -one=^ Y, AQ{<VJ) x As{f{vi)J{v})) (4.3) This equation counts the number of times an adjacent pair of guest nodes, Wj and Vj, remain adjacent when mapped to S, the system graph. Depending on the guest graph, one of the following two cases arises with any mapping strategy: Case 1 : dil-one = | E G | Lemma : If the embedding amount, dil-one, is equal to the the number of edges in a guest graph then the dilation of an embedding is 1. P r o o f : We are given that dil-one = | E G |. Hence from equation 4.3, it is clear that for every edge e2j = (vi, Vj) G E G (i.e., for every pair of nodes, Vi and Vj, adjacent in G ) there exists / (v{) and / (VJ) adjacent in S. We know that the mapping distance VG {f (vi), /(uj)) = 1 if a n d only if /(u,-) and f (VJ) are adjacent in S, and the dilation of an embedding is defined as max {Vg(f(vi)J(v3))}. In this case VG (J'•(«,•),/ (VJ)) = 1 V u*, Vj G V G -This proves that the dilation of an embedding is 1 when the dil-one is same as number of edges in a guest graph. • If there are multiple edges between the same pair of guest nodes then the congestion can be greater than one while the dilation is equal to one. The mapping algorithm will map multiple guest edges on the multi-edges of a system graph in order to minimize the congestion. Since the system edges have different weights, these weights must also be considered while mapping in order to minimize the communication overhead. Chapter 4. Mapping Strategy on Multicomputers 51 Case 2 : dil-one < | E G | If dil-oneis less than number of edges in the guest graph, then the dilation and hence the congestion are greater than one. We assume that entire communication takes place in a single phase in that all guest edges are required simultaneously. Hence the total cost of communication is" the maximum of the communication overhead on any guest edge. The communication overhead on a guest edge depends on the congestion, dilation, and the weights on the system edges and the guest edges, The mapping strategy should take into account all these factors in order to minimize the communication overhead. The communication overhead, CO, on the guest edge e{j is given as . . C O(ey) = max \C(emn^ x wmn } (4-4) where C is denned in equation 4.2, e,-j is dilated over the system path pij, and wmn is the weight on the system edge emn. The expression in braces indicates the communication cost associated with an individual edge in the system path. The maximum cost over all the edges in the system path gives the communication overhead on the corresponding guest edge. As an illustration of the mapping strategy, consider the following example. Figure 4.8 S2 S 2 (a) (b) (c) Figure .4.8: (a) Guest graph, (b) Host graph for case 1, (c) Host graph for case 2 Chapter 4. Mapping Strategy on Multicomputers 52 (a) shows the guest graph to be mapped. The number on each edge indicates the weight associated with it. A system graph represents a dynamically reconfigurable machine. Hence it is possible that different host graphs are generated depending on the resources requested or the guest graphs. Figure 4.8 (b) and (c) show two such host graphs. The numbers on the system edges represent the weights. The guest nodes, shown as circles, get mapped onto the adjacent system nodes, shown as squares. The mapping shown in (Figure 4.8 (b)) is the best mapping. Here the dil-one is same as the number of guest edges which is equal to six. Hence the dilation of the embedding is one and the congestion of the embedding is also equal to one since there are no multiple edges. The total communication cost is the communication overhead on the guest edge eo2 or ei3, which is equal to two. There is one more embedding possible (so[go], s i[gi] , S2[g2], and S3[g3]) which also results in minimum dilation and congestion. However in this case, the communication cost is four. If dil-one is less than the number of edges in guest graph then the dilation and the congestion are greater than one. One such possible mapping is shown in Figure 4.8 (c). Only four out of six guest edges are directly mapped onto a system graph and guest edges eo3 and ei2 are dilated. To minimize the total cost, the mapping strategy tries not to dilate the guest edges with higher weights. In this example, there are three edges at guest node #0 and only two edges at system node s0 on which g0 is mapped. Hence one of the three guest edges must be dilated. However, the guest edge eo2 has more weight and hence is mapped on system edge e0i and guest edge e03 with less weight is dilated. Having done this, routes for the dilated guest edges eo3 and e12 must be determined. : The guest edge eo3 can be dilated over system path (e 0 i , ei 2) or (e03, e 3 2). Given that eo3 G E G is dilated over (e 0 i , ei 2) where e 0 i , G Es , from equation 4.4, communication overhead on e03 is given as, Chapter 4. Mapping Strategy on. Multicomputers 53 CO (e 0 3) = max { C(e 0i) x w0i),'C{e12) x w12) } = max { (3 x 1), ( 2 x 2 ) } = 4 . . ' . Similarly CO(eo3) is 4 if the guest edge e 0 3 is dilated over (e03, e3 2) where e0 3, e3 2 G Es-In the case of equal overhead, the first route is selected. Assume the guest edge eo3 is dilated over (eni, ei 2). While mapping guest edge ei 2 , one should take into consideration the guest edge eo3 that is already mapped. It can be easily verified that for the guest edge ei2, communication cost over system path (e3o, eoi) is four and over (e 3 2, e2i) is six. Hence the guest edge ei 2 is dilated over (e 3 0, e 0i). In this section we discussed various constraints and how they affect the quality of mapping. In the next section the mapping algorithm is presented that implements the strategy described here. 4.3.3 Algorithm Both the guest graph and the system graph are multigraphs and have different weights on their edges. It is not sufficient to Optimize any one cost metric. We try to optimize the objective functions, dil-one, defined by equation 4.3. The heuristics are combined with a greedy algorithm to maximize the dil-one and to minimize the communication overhead by considering the actual weights on the edges. The algorithm employs recursion and backtracking techniques for an efficient search through the system graph. The input to the algorithm is G (Guest graph), S (System graph), and T (Conflict graph) and the output is the physical routes, stored in H (Host graph) and logical routes, stored in G. The algorithm carries_out two steps: node assignment and routing. During the node assignment, dil-one is maximized in that a one-to-one edge mapping is sought between the guest edges and the system edges. If there are multi-edges in the guest graph Chapter 4. Mapping Strategy on Multicomputers 54 then the algorithm seeks multi-edges in the system graph. To minimize the overhead, the guest edges with more weights are mapped onto the system edges having smaller weights (i.e., the direct links). If the dilation of an embedding is greater than one then during the routing phase, the logical routing is done for dilated edges such that the communication overhead, CO (equation 4.4), is minimum. The physical and logical routing information is then stored in the host and the guest graph respectively. The algorithm defines the "first guest node" and the "first system node" as follows: The "first guest node" can be specified by the user. It is either the root of the guest graph or the host process (the host process runs on the host machine and must be specified by the user). If neither is specified then the algorithm selects the guest node having minimum degree. In the case of the system graph, either the host machine or the root transputer, retrieved by the resource allocation module, is selected as the "first system node". The mapping starts by assigning the "first guest node" to the "first system node". Thereafter at every level of recursion, algorithm finds the number of to-be-mapped guest edges at the last mapped guest node and seeks that many non-conflicting, unmapped-edges at the corresponding system node. The system graph represents the entire search-space and the algorithm attempts to find a graph which is isomorphic to the guest graph (i.e., to maximize the dil-one). As we will see in Chapter 5, this allows the runtime system to make use of the lowest-level communication primitives to minimize the communication cost. The pseudo-code for the algorithm is presented below. Input : G, S, T, H Output : node embedding, H and G with physical and logical routing begin / * main * / 1 LastMappedVg := Vg := { First node in G }; Chapter 4. Mapping Strategy on Multicomputers 55 2 UnMappedVg := { a l l nodes in G } - Vg; 3 LastUsedVs := Vs := { First node in s }; 4 UnUsedVs := { a l l nodes in s } - Vs; 5 level := 1; 6 map[Vg] := Vs; 7 DoEmbedding(level, LastMappedVg, LastUsedVs); 8 SetPhysicalRoutes(G, H, map); 9 DoLogicalRouting(G, H , map); end begin / * DoEmbedding * / 7a for Vg in LastMappedVg do 7b Get EdgesToMap at Vg; 7c Get adjEdges at Vs (= map[Vg]); 7d Sort adjEdges in ascending order of their weights; 7e Check whether adjEdges provides at least one set of |EdgesToMap| non-conflicting system edges to map EdgesToMap; 7f EdgeList = GetTuples(EdgesToMap, adjEdges); 7g if EdgeList = {} 7h return(failure); 7i for tuple in EdgeList do , 7j for edge in tuple do 7k Insert NewVg in NowMappedVg; 71 Insert NewVs in NowUsedVs; 7m Delete NewVg from UnMappedVg; 7n Delete NewVs from UnUsedVs; 7o map[NewVg] = NewVs; . 7p Update S; 7q endfor 7r if UnMapped = {} 7s return(success); 7t if (DoEmbedding(level+1, NowMappedVg, NowUsedVs) = success) 7u break; 7v endfor 7w Update S; 7x endfor end Several aspects of the algorithm are discussed below. The numbers correspond to the line numbers shown in the pseudo-code. Chapter 4. Mapping Strategy on Multicomputer 56 • Lines 1-6: As previously discussed, the initial node assignment is done. • Line 7: DoEmbedding function is called recursively to maximize dil-one. For each mapped guest node, the following procedure (Lines 7a-7x is repeated. At any given time, the edges in a system graph are in one of the three states: Avail, CantUse or Used, meaning a particular system edge is available (Avail) for map-ping, or a guest edge is mapped on it (Used), or it cannot be used (CantUse) for mapping because of the conflict with already Used system edge. Initially the state of all the system edges is set to Avail. — Line 7b: EdgesToMap, a set of unmapped guest edges at the last mapped guest node, is derived from the guest graph. |EdgesToMap| indicates the desired size of dil-one at the guest node. — Line 7c: Similarly, adj Edges, a set of Avail system edges at the correspond-ing system node, is constructed. — Line 7d: The set adj Edges is sorted in ascending order of weights so that the edges with less weight get mapped first. In particular all the direct links of a transputer are mapped before the crossbar links. — Line 7e: Check is made whether adj Edges has at least one set of non-conflicting system edges. This feasibility check detects the possible failure before any mapping is done and thus backtracks at early stage of recursion. — Lines 7f-7h: GetTuple function finds all possible sets (EdgeList) of non-conflicting system edges from adj Edges, each of size |EdgesToMap| or less. EdgeList is a list of tuples each having |EdgesToMap| or less system edges. A l l the edges in each tuple are Avail and non-conflicting. This list is a search-space at a particular level of recursion. EdgeList is sorted in descending Chapter 4. Mapping Strategy on Multicomputers 57 order of tuple size. Thus the tuples with |EdgesToMap| system edges are tried (to maximize dil-one), before the smaller tuples (which results into greater dilation). ' — Lines 7i-7q: The algorithm then maps EdgesToMap guest edges onto the first tuple from EdgeList. The system graph is then updated. The edges in this tuple are marked as Used and all the conflicting edges are marked as CantUse. - In the case of backtracking, the status on every system edge that is unmapped, is restored, at the previous level the next tuple from EdgeList is mapped and the process continues. • Line 8: Once the node assignment is done, all the system edges with Used status are copied to the host graph. A l l the paths between every pair of transputers in the host graph are statically configured (i.e., the transputer links, crossbar switches, and crossbar links are stored on the edges in the host graph). • Line 9: During logical routing, the routes between every pair of guest nodes are constructed. When the dilation is greater than one, the paths between non-local nodes are constructed so as to minimize the communication overhead (§4.3.2). Mapping Analysis The node assignment and the routing specify the configuration information required by the loading tool and the runtime system (§3.1). This information includes the process-to-processor bindings, physical links between the processors including current crossbar settings, and the routes for all the communication bindings. The mapping tool analyzes the logical routes to identify their communication characteristics. The characteristics include multiplexing over the physical links (measure of congestion) and the scope of Chapter 4. Mapping Strategy on Multicomputers 58 each communication binding (i.e. whether the communication is local, neighbour-to-neighbour, or non-neighbour). On the transputer system, the Parsec runtime system uses Trollius events as synchronization objects (described in Chapter 5). The mapping tool assigns these events to each route characterizing point-to-point and collective com-munication. In the case of user-specified contraction, the mapping tool also generates events for all the intra-processor communication. 4.3.4 Exper imenta t ion As described in §4.1, our transputer network is not regular. A large number of crossbar connections allows the system to be configured for different network topologies. We tested numerous regular topologies (i.e. the guest graphs) including tree (binary, ternary, binomial), mesh, ring, chain, torus, butterfly, and hypercube. The performance of the algorithm is measured in terms of dil-one, the embedding amount. In general, finding the best mapping for a particular guest graph is difficult. However if dil-one is equal to the number of edges in the guest graph, then the dilation of an embedding is one and the solution is definitely the optimal solution. The mapping tool can be run in two modes: normal and quick. In normal mode, it backtracks upon failure and continues to search for an optimal solution. In quick mode, backtracking is disabled to guarantee a solution for any type of guest graph. The GUI for the mapping tool is described in Chapter 6. Table 4.2 shows the mapping results for various topologies of different sizes. The measurements are taken on a Sun SparclO machine. The time taken for the mapping and the value of dil-one are shown for both modes. A column | E G | lists the total number of bidirectional edges for each guest graph and size shows the cardinality of the guest graph. - The algorithm neither assumes any hardware interconnection pattern nor does it take advantage of the topology type (tree, mesh, hypercube etc.)' of the guest graph. Chapter 4. Mapping Strategy on Multicomputer 59 However, the following heuristic (which assumes that the transputer system resides in different components) is employed to expedite the search in the system graph: at any given mapped guest node, there are |EdgesToMap| guest edges to be mapped. EdgeList is a list of tuples each having |EdgesToMap| or less system edges. In any tuple, all the system edges have one common node that is already mapped and the other node is unmapped. Tuples having their unmapped nodes in different components get higher priority during mapping. . Normal Quick . Topology Size \EG\ Time dil-one Time dil-one (in Sec) (in Sec) Chain 32 31 1.84 31 1.68 31 64 63 9.85 63 9.86 63 8 7 0.54 7 0.58 7 Binary 16 15 0.81 15 .0.94 14 Tree 32 31 5.84 ' 31 3.33 27 64 63 72.56 63 15.32 ' 44 Ternary 14 13 12.14 13 0.84 11 Tree 41 40 - - 10.58 22 3x3 12 1.00 12 0.63 11 4x4 24 16.44 24 0.90 20 Mesh 5x5 40 7563.59 40 , 6.76 31 6x6 60 222230.0 60 9.78 36 8x9 127 - - • 19.96 90 Butterfly 12 16 2.55 16 2.54 16 32 48 - - 5.93 .35 Hypercube 8 12 4.48 12 0.67 11 16 32 1.23 22 1.28 . 22 Shuffle 8 10 4.13 10 0.65 9 Exchange Table 4.2: Mapping Statistics As shown in Table 4.2, the algorithm performed reasonably well for the trees but Chapter 4. Mapping Strategy on Multicomputers 60 not for meshes. As graph size increases, the time required in the normal mode grows exponentially. The mapping tool could not map a 8x9 mesh, a ternary tree of depth four and a three dimensional butterfly. Our heuristic worked efficiently for those topology types and sizes for which there were more than one solution. In quick mode, the solution is found for all the topologies, but at the expense of smaller dil-one (i.e., greater dilation of an embedding). Apart from these topologies, the mapping was tested for different variations of regular topologies (e.g., complete graph, binary tree with leaves connected to its root) suitable for various parallel programming paradigms. Chapter 5 R u n t i m e System for Transputers The runtime system, described in Chapter 3, consists of a high-level message-passing l i -brary. It selects the optimal communication primitive from a set of primitives supported by the underlying message-passing system. The process model, described in §3.3.1, and the port primitives, described in §3.3.2, are common to any type of multicomputer. How-ever, their implementation-specific features are influenced by the underlying hardware architecture, the operating system and the communication support. The runtime system must hide these system level details and provide simple primitives without compromising performance. In this chapter, we describe the Parsec runtime system developed for the transputer-based multicomputer. The underlying process model, supported by a transputer, is described in §5.1. The Parsec runtime system is implemented on top of Logical C [7, 8] and Trollius [9]. These low level software tools are presented in §5.2. In §5.3, we describe the Parsec runtime system. The optimization techniques and implementation level details of different com-munication types are described. Finally, the performance of the Parsec primitives is measured in §5.4. 5.1 Transputer P r o g r a m m i n g M o d e l A transputer has four inter-processor communication links to support parallel processing at the hardware level. Two transputers can be connected by their communication links and thus can be used as the basic building blocks for constructing complex parallel 61 Chapter 5.- Runtime System for Transputers 62 architectures. Communication within a single transputer occurs through the internal channels, implemented as words of memory. Each transputer has a built-in process scheduler with a process context switching time of less than one microsecond (for the INMOS T800-25 model). The communication links operate concurrently with C P U and thus data transfer on those links takes place with little effect on C P U performance. Occam [14] is a parallel programming language that reflects the transputer architec-ture. The Occam programming model consists of a set of parallel processes commu-nicating on the channels. Communication between the processes is point-to-point and synchronous. A channel provides a one-way communication path between two processes. At any time, a process may be ready to communicate on one or more of its channels. Channels between two processes on the same transputer are implemented using shared memory, and channels between two processes on different transputers are implemented by the communication links. Non-blocking communication can be implemented by intro-ducing buffer processes on the links. 5.2 L o w . L e v e l Software Tools The Parsec runtime system is implemented on top of two software environments: Logical C and Trollius. It uses the communication functions provided by Logical C and Trollius. We describe below the key features influencing our runtime system. 5.2.1 Logica l C Logical C [7, 8] is a transputer toolset consisting of standard C library routines and transputer specific routines. The transputer specific routines are written in transputer assembly language and are very efficient. Functions are provided for process creation, process communication through message passing or shared data and semaphores, and Chapter 5. Runtime System for Transputers 63 determining the status of the channels which are essential to implement one-to-many and many-to-one communication. The features that are used in our runtime system are discussed below. 1. Logical C identifies two types of channels: soft channel and hard channel. The interprocess communication between two processes on different transputers takes place through a hard channel, formed by connecting communication links of the two transputers. Communication between two processes on same transputer takes place through a soft channel, which is basically a word of memory. However the routines used for input and output to the channel are the same for both types (e.g., Chanln/ChanOut, ChanlnChar/ChanOutChar, Chanlnlnt/ChanOutlnt etc.). 2. A soft channel is created using the ChanAlloc function and hard channels are pre-defined by the system. There are eight hard channels, one for input and one for output on every physical link of a transputer. They are LINKOIN, LINKOOUT, LINK1IN, LINK10UT, etc. 3. During a channel operation, two processes must perform operations on the same data size. Thus, if one process writes 2 bytes to a channel on one end and another process attempts to read 1 byte on the other end then one overwrites the other. The runtime system should guard against this type of mismatch. 4. At any given time, a process can be ready to read data from one or more of its channels. A set of function calls are provided to determine the status of the channels and to wait until a channel is ready for input. These functions are ProcAltList, ProcSkipAlt, ProcSkipAltList, and ProcTimerAltList. Chapter 5. Runtime System for Transputers 64 The basic channel communication functions (Chanln, ChanOut etc.) consist of only a few transputer assembly instructions and hence add little overhead. The type of chan-nel (hard channel vs. soft channel) used by these functions depends upon whether the communication is intra-processor or inter-processor. Unlike hard channels, soft channels must be declared and defined before they are used for communication. Hence the user must know the process-processor bindings (mapping information) and the channel infor-mation must be hand-coded into a'program depending on these bindings. This makes the program development mapping-specific. Also the channel functions support only synchronous, point-to-point, and non-multiplexed type of communication. 5.2.2 Trollius Trollius [9] is an operating system for multicomputers that provides the user with a wide range of functionality in a hierarchical form. Each trollius node consists of a multi-tasking local environment upon which the rest of the operating system is built. There is a local kernel at each node that synchronizes any number of local messages. Trollius provides three basic objects for communication and synchronization: node identifier, process, and event. The node identifier (nodeid) is a globally unique identifier chosen by the user and represents a single trollius node. The other two objects, process and event, are local to the node. Events are chosen by the user and are used to rendezvous (two processes with identical events always synchronize). Trollius adds a buffering mechanism to the transputer programming model (§5.1) It supports a wide range of communication services such as synchronization, buffering, flow control, packetization, routing, and data conversion. There are two levels of message passing in Trollius: the kernel level and the network level. The kernel level allows com-munication between two processes on the same node whereas the network level is used for communication between two processes on different nodes. The kernel level function ksend Chapter 5. Runtime System for Transputers 65 attempts to send a message to another process on the same node and krecv attempts to receive a message from another process on the same node. The Trollius kernel provides rendezvous services for the sender and receiver. The network level message passing has four sub-levels representing different functionality and overhead. They are the physical level, the data-link level, the network level, and the transport level in order of increasing functionality and decreasing efficiency. These levels are described below in more detail. 1. Transport Level : The transport level functions tsend and trecv provide flow control and process synchronization. The sending process blocks until it receives a "request to send" message from the receiving process. This pre-synchronization is followed by the actual message transfer using network level functions. Non-blocking functions are not available at this level. 2. Network Level : The network level functions nsend and nrecv do routing and packetization but do not. provide flow control. The sending process returns imme-diately once the message has left the local node. If the receiving process is not ready, the message is buffered at the receiving side. In that sense nsend is loosely blocking. The receiving process will block if there is no message with the same event in the local buffer. There are two non-blocking variants called ntry_send and ntryjrecv where the message is either immediately transferred or an error is returned. Virtual Circuits Each Trollius node runs multiple processes. They are responsible for buffer man-agement, memory management, file management, routing, and communication on the links. When a sending process sends a message at the transport level or at the network level, it passes through the buffer and link processes on both the source Chapter 5. Runtime System for Transputers 66 and the destination node, before being delivered to the receiving process. This ex-tra message copying incurs extra overhead. In order to obtain better performance, the trollius processes and other user processes can be bypassed by establishing vir-tual circuits between sending and receiving processes. A virtual circuit reserves the links between any two user processes on any two processors. No Trollius process or other user processes can access the reserved links. As a result, the processes on the intermediate nodes cannot communicate once a virtual circuit is set up between two non-adjacent nodes. Virtual circuits are also available locally as well. The kernel level functions can bypass the local kernel once an initial rendezvous has been made. 3. Data-link Level : The data-link level functions dsend and drecv are similar to network level functions when used for non-neighbour message passing except that they do not support packetization and routing. Hence the user must know information about the forwarding link process and perform packetization in the code. They are more efficient than the network level functions when used only for nearest-neighbour message passing using the virtual circuits. 4. Physical Level : The physical level message passing functions psend and precv directly use. physical links between the transputers (i.e., they are used only for nearest-neighbour communication). These functions can be used only after the virtual circuit is established using the data-link or network level functions. There is no protection against the sender/receiver length mismatch. The transport and the network level functions are simple yet powerful and together support several communication services (flow control, packetization, routing, etc., as mentioned above). However, performance requires functions at the data-link or physical Chapter 5. Runtime System for Transputers 67 level which in turn requires detailed understanding of the operating system and the hardware. Trollius and Logical C together provide a rich set of message passing primitives with varying semantics and efficiency. The transport level and network level functions are easy to use and their behaviour is deterministic. Multiple processes can access the transputer links simultaneously and hence debugging is possible with these functions. However they introduce considerable overhead. We refer to these primitives as the high-level primitives. On the other hand, the data-link level and physical level functions are efficient but difficulty lies in setting up the virtual circuits. We refer to them as the low-level primitives. 5.3 R u n t i m e System The Parsec runtime system is built on top of Trollius operating system. It is possible to include Logical C library with Trollius to make channel functions available from within the Trollius environment. In the remainder of this chapter, we refer to the channel functions as a subset of the Trollius functions. The Parsec message passing primitives, portsend and port.recv (§3.3.2), select the optimal communication functions from among the Trollius functions described in the previous section. The selection criteria depends on the following factors: • Communication scope (local/neighbour/non-neighbour) • Semantics (buffered/synchronous) • Debugging switch The mapping tool analyzes the routing information to identify the scope of each commu-nication binding. Local communication (i.e., all the processes running on same node) is Chapter 5. Runtime System for Transputers 68 specified by the user. The congestion over the physical links (i.e., the multiplexing over the links) and the process-to-processor assignment define the scope of non-local commu-nication. If the congestion is equal to one then the virtual circuit can be set up on the link. This allows one to use the low-level communication primitives to improve perfor-mance. If the congestion is greater than one (meaning more than one user processes read and/or write over the physical link) then the virtual circuits can not be set up even if the processes lie on the adjacent processors. In such case, the high-level primitives must be used. This is exemplified in §1.2. The semantics of a program are controlled by the user. The user can specify, through the Parsec graphical interface, whether the communication is buffered or synchronous. It is specified either at the port level or at the application level (i.e., specified for all the ports). The blocking time for every send and receive can.be controlled by the timeout mechanism. A timeout of zero indicates that the process will be blocked until the sending process or the receiving process returns successfully, and a positive timeout (t) indicates that process will be blocked for t micro seconds and return either with a success or an error. With the low-level primitives the debugging is difficult as the virtual circuit allows only two user processes (that established the virtual circuit) to access the link. Hence for debugging purposes one can use the high-level communication primitives. This is facilitated by Force_Debug switch. Thus the user can debug a program using high-level primitives and later remove the switch to use low-level primitives where possible. . In addition to the routing information, the runtime system uses the following infor-mation generated by the mapping tool: the physical address of each transputer (nodeid), the event (positive integer), and the forwarding link number. For point-to-point commu-nication, the mapping tool generates one event per port. For collective communication, there is one event for the port and one event for each route accessible from that port. Chapter 5. Runtime System for Transputers 69 These events are assigned to the Trollius events and thus used for rendezvous. Once an appropriate Trollius function is selected, the events, the nodeids, and the link numbers are required by an individual function to configure internal control structures. For exam-ple, network level functions require nodeid and event whereas hardware level functions require the address of the forwarding link. Decis ion Tree Figure 5.1 shows the mapping between portsend/port-recv and the Trollius functions. port_send / port_recv Force_Debug Optimized Buffered nsend/ nrecv Synchronous Buffered tsend / trecv nsend / nrecv Synchronous (Chanln/ChanOut, ksend / krecv, dsend / drecv) or nsend / nrecv Figure 5.1: Decision Tree With the buffered semantics, only network level functions can be used irrespective of whether the communication is optimized or not. For the optimized and synchronized communication, type of the Trollius function depends on the communication scope. It uses both the high-level (nsend/nrecv) primitives and low-level (Chanln/ChanOut, ksend/krecv,. or dsend/drecv) primitives. The following section describes the imple-mentation details of point-to-point and collective communication. Chapter 5. Runtime System for Transputers 70 5;3.1 Implementation Details The configuration information and the user-specified semantics information are made available to the runtime system by the loader tool as command-line parameters. During the initialization phase, this information is stored into the data structures corresponding to the ports. The process is attached to the Trollius kernel (using kinit function), and the communication channels (hard channels and soft channels) are set up for neighbour-to-neighbour and local communication. The optimization is performed at runtime. The Parsec primitives, portsend and port-recv, access the information through the port data structures, on which they communicate. The port information is then used to select the Trollius primitives. The runtime system supports point-to-point and collective (one-to-many and many-to-one) communication. In either case, two communicating user processes can reside on the same node, on adjacent nodes, or on remote nodes. Similarly the processes, all participating in collective communication, can be mapped onto one or more of these three locations. The configuration information, required by the Trollius functions, is different in each of these cases. The high-level primitives are easy-to-use. Portsend and port-recv directly call network level and transport level Trollius functions. However, to use the channel primitives, it is required to set up virtual circuit (for neighbour-to-neighbour communication) and to broadcast the channel addresses to all processes (for local communication). The design decisions in the case of local and neighbour-to-neighbour communication are influenced by the following Trollius-specific features. . • For maximum efficiency, ChanOut/Chanln are used instead of psend/precv (for neighbour-to-neighbour communication) and ksend/krecv (for local communica-tion). At the hardware level, there is no guard against message length mismatch and hence the length of the message is sent first followed by the message. Thus Chapter 5. Runtime System for Transputers 71 portsend and port-recv internally carries out two ChanOut and two Chanln opera-tions. The communication at the hardware level is synchronous and point-to-point. Upon receiving the message length in the first Chanln, the receiving process blocks until it reads the entire message since it is guaranteed to receive it. No packetization is required, and the communication is reliable. • The virtual circuit must be set up using data-link level or network level functions. Portsend/port-recv uses data-link level.functions to set up virtual circuit. The first message transfer takes place using dsend/drecv. The KHOLD flag is set during the first message transfer. This is the notification to the Trollius kernel to bypass all system processes and reserve the physical link for the user processes. Once the virtual circuit is set up between neighbouring nodes, subsequent messages are transferred using Chanln/ChanDut on hard channels. There is no limit on the length of message that is transferred using Chanln/ChanOut. However, data-link level functions do not provide packetization and the message length is limited to 4k bytes. (The message larger than 4k bytes is lost). Hence the data-link level functions are modified to allow packetization. These modified dsend/drecv are used to set up virtual circuits. • The virtual circuits between any two nodes are unidirectional. The runtime system sets them in one direction. This allows two pair of processes to communicate, one pair communicating.in either direction, over the same physical link. • The local communication takes place using soft channels. A soft channel is im-plemented as a word of memory. A l l the processes reading from and writing to any channel must know the address of that channel. This is accomplished as fol-lows:' Initially each process sends the addresses of all its soft. channels (created by ChanAlloc during the initialization phase) to all the communicating processes Chapter 5. Runtime System for Transputers 72 using ksend. Then each process receives, using krecv, the addresses of the chan-nels where the messages are sent. After this initial address transfer, the message transfer takes place using Chanln/ChanOut on the soft channels. • The status of the channels is determined using ProcAlt* functions (§5.2.1). A process can be ready for input from one or more channels. These functions block until one of the channels is ready for the input or a specified time is elapsed (in later case, -1 is returned). In local communication, two soft channels are defined between every pair of communicating processes: one for the input and one for the output. Since they are implemented as shared memory, the input channel for one process is the output channel for the other. Each process creates the channel and sends its address to its peer. It is required that the process creates the input channel and receives the address of the output channel from its peer. This is because only those channels that are created locally can be used in ProcAlt* functions. Point- to-point Communica t ion Table 5.3.1 shows the various Trollius functions used by portsend and porLrecv. Communication Type/Scope Parsec Primitives Synchronization Objects portsend port-recv Synchronous Buffered Neighbour only Local only , tsend nsend dsend/ ChanOut {ksend, krecv, ChanOut}/ChanOut trecv nrecv drecv/ ChanIn {krecv, ksend, Chanlnj/Chanln nodeid, event nodeid, event event, hard channel address event, soft channel address Table 5.1: Trollius Function Invocation Chapter 5. Runtime System for Transputers 73 The high-level primitives use nodeid and event for rendezvous. For neighbour-to-neighbour communication, the first message is transferred using dsend/drecv (modified to allow packetization) to set up the virtual circuit. Subsequent message transfer takes place by using channel primitives only. For local communication, ksend and krecv are used to broadcast the addresses of the soft channels. Following this, the communication takes place using channel primitives. Col lec t ive Communica t ion Collective communication is implemented using basic point-to-point communication. It supports one-to-many (scatter) and many-to-one (gather) type of communication. The port data structure stores the nodeids and the forwarding links of all the destination nodes (transputers). It is possible that some of the processes, participating in the collective communication, run on the local nodes while some run on the adjacent and/or remote nodes. Each portsend and portjrtcv calls different Trollius functions depending on the location of the destination processes. Based on these locations (with respect to the given process) the following communication types are defined: • Loca l only: A l l the processes are on the same node. Like point-to-point communi-cation, initially the addresses of the soft channels are exchanged between every pair of communicating processes using kernel-level functions followed by data transfer . using channel primitives. However the major difference lies in the events used by the Trollius kernel for rendezvous. With multiple local processes, different events are required by every pair of processes. Therefore in the case of collective com-munication, the mapping tool generates one event for each port and one event for each route accessible from that port. We refer to them as "port event" and "route event". Chapter 5. Runtime System for Transputers 74 • Neighbour only: A l l the destination processes are on the adjacent nodes. First the virtual circuits are set up between all the neighbouring nodes and then subse-quent communication takes place using channel primitives. As for the "local only" case, it is required that "route events" are used in the data-link level functions for every adjacent node. • Loca l and neighbour only: The destination processes are on the local and adjacent nodes (no process on the remote node). As with "neighbour only", the virtual circuits are set up with the adjacent nodes, and the hard channels are used for communication. However, local communication uses kernel level functions only (no soft channels are used). Both the events are used for local communication. The messages are sent to all the local processes on the "route events", and they are received from the local processes on the "port event". • Non-neighbour: Some destination processes are on the remote nodes while some are on the local and/or adjacent nodes. The local communication is similar to the local communication in "local and neighbour only". The non-local communication takes place using network level functions in that "port event" is used for rendezvous. It is important to note that Trollius does not allow the use of the same "port event" in the network level and the kernel level functions. For this reason, the value one is added to "port event" (a positive integer) used for local communication. The first two types ("local only" and "neighbour only") use low-level primitives. In both cases, an initial handshake is required between every pair of communicating processes. From the user's perspective this means that all the processes should call portsend or port-recv at least once before any data transfer takes place. The same is true for "local and neighbour only" since it involves setting up virtual circuits with the neighbours. Chapter 5. Runtime System for Transputers 75 This limitation is due to the handshake performed during first invocation of portsend and port-recv calls for each communication binding. . 5.4 Performance Analys i s We now measure the performance of the Parsec primitives and compare it with that of the Trollius functions. Depending on the mapping done, the porLsend or port-recv invokes one of the Trollius functions described earlier. In this experiment the port primitives are forced to use different Trollius functions for communication between the same pair of processes. This is accomplished by changing the configuration information by hand (in that only the communication flag is changed which selects the Trollius functions according to the decision tree given in Figure 5.1). Two processes are mapped onto two adjacent transputers which are connected by a direct link (i.e., no crossbar connection). In each case, 1000 messages of the desired size are sent from one process to another and received back. The R A W message type is used in the Parsec primitives and hence the cost associated with data conversion is avoided. The message-passing delays for the Trollius functions are calculated using the statistics presented in [12]. In both the cases the round-trip timings are measured for different communication levels and are shown in Table 5.4. At the network and transport levels, both Trollius and Parsec use packets of 4K bytes each. Messages of two different sizes (4096 bytes and 4097 bytes) are sent to measure the cost of packetization. Communication cost of the Parsec primitives using hard chan-nels is compared with the cost associated with physical level Trollius functions. The Parsec primitives add considerable overhead because they use data-link level functions for the first message transfer to set up virtual circuits and. channel functions for subse-quent message passing. This cost (associated with data-link functions) is included in the Chapter 5. Runtime System for Transputers 76 message size Parsec time. Trollius time (in bytes) Primitives (in Sec) Functions (in Sec) 4096 Transport 7.850 Transport Not 4097 Level 8.834 Level Available 4096 Network 6.007 Network 5.968 4097 Level 6.959 Level 6.520 4096 Hard ,5.135 Physical 2.230 4097 Channels 5.153 Level 2.231 4096 Soft 0.798 Kernel 0.640 4097 Channels 0.799 Level 0.640 Table 5.2: Parsec: Message Passing timings measurement. The second reason is that the Parsec primitives send the length of the message prior to the message itself (meaning there are two Chanln and two ChanOut for each message) since the channel functions do not provide packetization. Finally, the two processes are run on the same transputer to measure the commu-nication cost on the soft channels. This cost is compared with the kernel level Trollius functions. For the Parsec primitives the cost includes the delay caused due to initial channel-address transfer between two processes. The Parsec primitives are built on top of Trollius and are bound to add extra over-head. Performance can be improved using efficient implementation mechanism. Our primitives use if-then-else construct to select the optimal Trollius functions. In another implementation of the Parsec runtime system [33] the function pointers are used with our optimization techniques to achieve better performance. Chapter 6 Programming A Network of Workstations In previous chapters we have seen the Parsec tools developed for the transputer-based multicomputer. Providing parallel programming tools for any one.type of multicom-puter restricts its wide-spread use as no one type is suitable for all kinds of parallel and distributed applications. Considerable effort is necessary to port these applications to another multicomputer. Hence it is desirable for a programming environment to support different hardware platforms and make the application-development process independent of the underlying architecture and its supporting operating systems. In this chapter, we describe the mapping tool and the runtime system implemented for a network of Sun Sparc workstations running the SunOS operating system. 6.1 Workstation Environment A workstation environment consists of a number of workstations, each with its own processor, local memory, and I/O device, connected on a local area network. It provides inexpensive shared computing resources to develop parallel and distributed applications. Networked workstations are easy to use and there are many software tools available for program development such as compilers, debuggers, and profilers. In recent years, many software environments have been developed to exploit these fea-tures [42, 41, 44, 40]. However, writing a parallel application in a shared and multiuser environment is still not a easy task. Parsec currently supports two different hardware platforms: a transputer system and networked workstations. Our main objective is to 77 Chapter 6. Programming A Network of Workstations 78 provide location transparency so that the user is able to run the same parallel application in either environment. Performance will, however, vary on different hardware architec-tures as each type is best suited for only a certain class of applications. By confining our scope to providing location transparency, the task of selecting the best hardware platform for a particular application is delegated to the user. A similar approach is pursued in Topsys [44] in that the programming environment, developed for iPSC multicomputer, has been ported to the Sun workstations. The design strategy and heuristics employed in the mapping tool and the runtime systems are hardware-dependent. We. discuss below the characteristics of a workstation environment that influence the design of these tools. Processor A l l o c a t i o n It is frequently observed that, with a number of hosts connected by a network, there is high probability that some hosts are heavily loaded while others are almost idle. This suggests that if the computation uses the available hosts in such a way that the load on a network is evenly balanced (i.e. the job is executed on a lightly loaded machine rather than on a heavily loaded one) then the system performance can be improved. This form of resource sharing by distributing the workload among the available hosts is known as load balancing or load sharing [62, 63, 64]. In a workstation environment, the mapping problem can be viewed as a processor allocation problem that tries to allocate processing power in an intelligent way. Networked workstations provide a multiuser, multitasking environment. Sharing the resources should not restrict other users in the environment. Therefore, the number of processors and the processors themselves may vary from one execution to another. This requires that the mapping tool keep statistics of the current load in the network and distribute the newly arriving jobs to optimize the processor utilization. Chapter 6. Programming A Network of Workstations 79 Idle workstations The computing power of the network can be fully exploited by utilizing idle workstations. It is believed that the majority of resources, including.powerful workstations, in current networks may lie idle for as much as 95% of the time [37]. The mapping tool should primarily use the idle workstations. Interprocess Communication Communication in a workstation environment is managed by the networking software which is a.part of the network operating system. In the case of Sun workstations, SunOS provides remote procedure calls (RPC), and socket-based and stream-based interprocess communication (IPC). These services follow well-defined, standardized protocols for dif-ferent communication characteristics such as flow control, retransmission, error recovery, and routing [16]. These are the lowest level (and hence the most efficient) services avail-able to the user. However, their primitives require a good understanding of the operating system. Therefore, they are not useful to the common user unless an easy-to-use and efficient mechanism is provided. In general, to develop a parallel application in a workstation environment, the user should address the above requirements. The mapping tool and the runtime system for IPC are designed to meet these requirements and are described in the next section. 6.2 Mapping Tool The mapping problem in a workstation environment differs from the one in a transputer-based multicomputer. In a transputer-based multicomputer, processes are assigned to processors in such a way that the total cost of communication is minimum. In a work-station environment, the problem is to assign processes to processors in such a way that Chapter 6. Programming A Network of Workstations 80 the load in the network is evenly balanced. In terms of graph theory, a network of workstations can be viewed as a complete graph where each node represents a workstation and each edge represents a communica-tion path between two workstations. Thus any two processes on different machines can communicate directly. The user perceives it as a neighbour-to-neighbour communication. In practice, the topology depends on how the network is configured. Typically, there are one or more subnets, each consisting of a few workstations connected by an ethernet, interconnected using gateways or routers. The gateways route data packets from one subnet to the other and communication protocols such as T C P / I P are responsible for the routing. The gateways should be included in the graph (they are like the other nodes except that no user process runs on them) as the processor allocation depends on the load on the subnets as well as on the gateways. However, to keep the allocation problem simple, gateways are not included in the graph which consists of the workstations only. The mapping tool is designed to assign processes to the workstations, based on the following assumptions: • In the case where the number of processes are greater than the number of processors, contraction is specified by the user. • Only one process is mapped onto each processor unless contraction occurs. • Only the load on individual workstation is taken into account while allocating the processes. The cost of communication between two processes is not considered. This is because it is not possible to determine the communication cost statically. It depends on various factors such as physical characteristics of a machine (e.g. C P U speed), load on each machine (which changes dynamically in the multiuser environment), location.of the processes (i.e. either they are on the same machine Chapter 6. Programming A Network of Workstations 81 or on different machines), and location of the machines (i.e. whether two machines are on the same subnet or different subnet). A graph-oriented mapping strategy, similar to the one for a transputer system (§4.2), is adopted:. A hardware database is maintained that stores the static characteristics of each workstation. These include C P U speed, memory, and internet address. This infor-mation is made available to the mapping tool by the resource allocation module. There were three phases in the transputer mapping tool: contraction, assignment, and routing. The mapping in a workstation environment takes place in two phases: contraction and as-signment. Routing is done by the underlying software. The assignment module allocates each process to a processor in such a way that the load in the network is distributed. A simple greedy algorithm is presented below that maps the guest graph onto the system graph. For each node in the guest graph, the algorithm selects the workstation with the least load and assigns the process to it. begin / * mapping algorithm */ 1 G := { Guest Graph }; 2 S := { System Graph }; 3 UnMappedVg := { A l l vertices in G }; 4 UnUsedVs := { A l l vertices in S }; . 5 whileUnmappedVg <> {} do 6 forVg in UnMappedVg do 7 forVs in UnUsedVs do 8 Load := GetLoad(Vs); 9 end do 10 BestVh = GetBestNode(Load, UnUsedVs); 11 Delete Vg from UnMappedVg; 12 Delete Vs from UnUsedVs; 13 end do 14 end do end Chapter 6. Programming A Network of Workstations 82 The various aspects of this algorithms are discussed below: 1. This is a static load balancing algorithm. A l l the processes are assigned to the pro-cessors before the job is started. The assignment decision is made deterministically and takes into account the average behaviour of the system. 2. GetLoad function (Line 8) computes the load on each machine. The load index used to measure the load is the load average which gives the average number of processes in a ready queue. The R P C function rstat is used to get the kernel statistics. The statistics include the number of processes in the ready queue averaged over 1, 5 and 15 minutes. The weighted average of these three averages is used as the load index. 3. One cannot isolate the physical characteristics of a machine (e.g., C P U speed, memory size) with its. dynamically changing load. For example, when the load on two machines varies only slightly, a machine with higher C P U speed may process the job faster even if it has a higher load. In general, this is true but it is difficult to identify the acceptable load variations in order to apply this heuristics. It largely depends on the processor speed of the individualmachine as well as the job itself. However, we have divided the load into seven different ranges. The load thresholds for each range are: [0.000, 0.009], [0.0091, 0.001], [0.011, 0.09], [0.091, 0.1], [0.101, 0.5], [0.501, 1.0] and [loads greater than 1]. If two machines with different load fall within the same range then the machine with higher C P U speed, and large memory is selected for the node assignment. Given a set of machines and load on each machine, GetBestNode function (Line 10) finds the suitable machine as described above. In summary, the Parsec mapping tool for the workstation environment does a simple Chapter 6. Programming A Network of Workstations 83 processor allocation according to the load status. It differs from the programming envi-ronments which are primarily designed for networked workstations. In Framework [47], the user can specify the mapping constraints completely or partially using a graphical interface or may leave processor allocation entirely up to the system. An extensive load sharing facility is provided by Utopia [41] for a large heterogeneous distributed systems. Different algorithms are implemented for managing resource load information, for task placement, and for remote task execution. Some of the load indices used by Utopia are C P U speed, memory, disk I /O, and login sessions. A number of applications have been developed such as a load sharing command interpreter, a distributed batch facility, and a parallel make. The main emphasis of Utopia is on distributed computing, and it uses general purpose communication libraries (such as P V M ) for message passing between the processes in a parallel application. In other programming environments like P V M and Topsys, processor allocation is left to the user. This requires a detailed knowledge of the available resources, and their loads. Parsec provides a simple tool to relieve the user from this difficult task. 6.3 R u n t i m e System A l l participating Sun Sparc machines run under SunOS release 4.1.x. SunOS supports two types of IPC mechanisms: Socket-based and Remote Procedure Calls (RPC). Socket-based IPC provides the lowest level communication primitives. A socket is an endpoiht for communication (at transport level) and provides a descriptor for I/O over the network [6]. There are several types of sockets such as stream, datagram, raw, and sequenced packet sockets and each provides different services. For example, the stream sockets provide sequenced, reliable, two-way connection based on byte streams. A n explicit connection is set up between two processes prior to any message passing. The datagram Chapter 6. Programming A Network of Workstations 84 sockets support connectionless, unreliable communication. The user should check for flow control, sequencing, and length mismatch. Since it is a connectionless mechanism, the peer's address is required for all sendand receive operations. System calls are provided to control the socket operations for buffering, blocking/non-blocking, out-of-band message passing etc. RPCs are a communication paradigm built on top of low-level networking mechanism like T C P / I P and UDP/ IP . They present a client-server model. A client makes a procedure call to send a data packet to theserver, when the packet arrives, the server calls a dispatch routine, performs the requested service, and sends back the response after which the procedure call returns to the client. R P C services can be accessed at three different layers with the highest layer entirely transparent to the operating system and the lowest layer directly using the network services. However, R P C is not an attractive choice for group communication as it can not handle communication from one sender to multiple receivers [15]. We have chosen the socket primitives over the R P C mechanism as they are the most efficient and they are available in all Unix systems. This decision is critical to future development of Parsec on heterogeneous machines. 6.3.1 Internal Features As discussed in §2.5, the user specifies parallelism explicitly by the process, port, and channel objects. A channel specifies a communication path between any two processes. Ports are bound to each process and thus define the endpoint of a channel. Thus ports are high-level abstractions for the sockets and encapsulates the configuration information generated by the mapping tool. The configuration information includes event and physi-cal address of a machine. The port primitives are implemented for reliable communication using T C P sockets. For T C P sockets, two-way connection must be established between every pair of Chapter 6. Programming A Network of Workstations 85 communicating processes. The initialization phase sets up the connection between two processes A and B as follows: One process initiates a connection to another on its known network port address. It is required for both the processes A and B to know the net-work port address in advance. On the transputer system, the events (positive integers) generated by the mapping tool are mapped onto the Trollius events. Similarly, in the workstation environment, the events are mapped onto the network port addresses, used for naming the sockets. However, it is mandatory that only process A(B) will initiate a connection and process B(A) will accept it. Failing this requirement results in a deadlock since both A and B are waiting to accept a connection from each other. The runtime sys-tem avoids a deadlock by having the mapping tool generate a unique event (network port address) for every process. Process A will either initiate a connection or accept it with its neighbours depending on whether its event is less than or greater to its neighbour's event. In the case of more than one neighbour, they are sorted in the ascending order of their events and then a connection is established with each neighbour sequentially in ascending order. Function init-port performs this task. It must be called by each process prior to any message passing. Since system calls accept and connect, used by init-port, are blocking, init-port returns successfully when the T C P connection is established. Once the initialization phase is over, portsend (write system call) and port.recv (read system call) are used for message passing. Currently, sockets operate only in blocking mode and thus support only synchronous communication. With T C P implementation, the kernel tries to batch-up the requests by waiting for some minimum amount of data to communicate. We have used the TCPJJODELAY feature to avoid this stream-oriented batching and thus reduce the overheads. A scatter and gather type of communication is simulated on top of basic point-to-point communication. Once the communication is over, all the socket descriptors are closed by calling exit-port before exiting the process. In the next section the performance of the runtime system is measured and compared Chapter 6. Programming A Network of Workstations 86 with PVM3[42]. 6.4 Performance Analys i s Two Node test This example measures the communication time between two Sun Sparc workstations in a local area network using Parsec. Both the machines run a network file system and are part of the Department's shared computing environment. A message of desired size is sent from one node to the other and received back. The round trip communication time is measured for a large number of iterations. The same experiment is repeated using P V M 3 , and the communication time of Parsec and P V M 3 are compared. P V M 3 (release 3.2.6) supports two types of communication: T C P sockets between every two communicating processes and local daemons which com-municate.using U D P sockets. The default is daemon-mediated communication. The T C P communication is set up by using a pvm-advise(PvmRouteDirect) call prior to communi-cating. Bofh the types are used in the experiment. The following tables show round trip communication times for different message sizes ranging from 100 bytes to 1 megabyte, using 100 iterations. Parsec's communication times are consistently faster than both the P V M 3 options for all message sizes and for the same subnet. In the case of daemon-mediated U D P communication in P V M 3 , each message gets copied, first at a daemon on the source node and second at a daemon at the destination node, before it is delivered to the destination process. This overhead is considerable for a communication-bound example like the one presented here. For the T C P sockets, there is the cost of establishing a T C P connection between every pair of processes. However, this cost is amortized when the total number of communications is large. In P V M 3 , two processes establish a T C P connection when Chapter 6. Programming A Network of Workstations Same Subnet Diff Subnet Msg. Size Msg. Passing Trans. Rate Msg. Passing Trans. Rate (Bytes) Time (in Sec) (Mbytes/Sec) Time (in Sec) (Mbytes/Sec) 100 0.208076 0.048059 0.414514 0.024125 400 0.279717 0.143002 0.527060 . 0.075893 1000 0.382999 0.261097 0.782206 0.127844 4000 0.948864 0.421557 1.419315 0.281826 10000 2.369537 0.422024 3.835125 0.260748 40000 8.879566 0.450472 14.671687 0.272634 100000 . 22.012445 0.454288 35.786808 0.279433 400000 88.589821 0.451519 137.850052 0.029017 1000000 222.976151 0.448478 342.420776 0.292038 Table 6.1: Parsec: T C P communication Same Subnet Diff Subnet Msg. Size Msg. Passing Trans. Rate Msg. Passing Trans. Rate (Bytes) Time (in Sec) (Mbytes/Sec) Time (in Sec) (Mbytes/Sec) 100 1.265465 0.007902 ' 1.449805 0.006897 . 400 - ' 1.434604 0.027882 1.677073 0.023851 1000 1.572177 0.063606 • 1.914482 0.052233 4000 2.998988 0.133378 3.697648 0.108177 10000 6.421205 0.155734 8.436946 0.118526 40000 20.890930 0.191471 27.349926 0.146253 100000 50.802238 . 0.196842 67.533371 0.148075 400000 200.089066 0.199911 262.817474 0.152197 1000000 493.379272 0.202684 639.039490 0.156485 Table 6.2: P V M 3 : daemon mediated U D P communcation Chapter 6. Programming A Network.of Workstations 88 Same Subnet Diff Subnet Msg. Size Msg. Passing Trans. Rate Msg. Passing Trans. Rate (Bytes) Time (in Sec) (Mbytes/Sec) Time (in Sec) (Mbytes/Sec) 100 0.487040 0.020532 0.640915 0.015603 400 0.565463 0.070738 0.787222 0.050812 1000 0.775981 0.128869 1.026842 0.097386 4000 1.343468 0.297733 1.883596 0.212360 10000 2.814400 0.355316 3.505754 0.285245 40000 9^983309 0.400669 11.321166 0.353321 100000 27.766409 0.360147 27.465364 0.364095 400000 106.868057 . 0.374293 112.383034 0.355926 1000000 267.402222 0.373968 287.563660 0.347749 Table 6.3: P V M 3 : T C P communcation they first communicate. This makes the first communication significantly costlier than subsequent communications. In Parsec, the start-up cost is hidden in port initialization as all the T C P connections are established prior to any communication. P V M 3 differs from Parsec in that the communication buffers must be explicitly packed and unpacked. This cost is also included in the round trip communication time. In all of the above cases, the transfer rate does not increase for large messages (of the order of hundreds of kilobyte). This is attributed to greater packetization cost and the kernel overheads. For different subnets, Parsec is slower than P V M 3 (TCP version) when the message size is greater than 10000 bytes. In the following section an example is presented to show how program development is.done in Parsec using the mapping tool and the runtime system. 6 . 5 P r o g r a m Development in Parsec This section describes the Parsec GUI required to create a parallel program structure and to run the mapping tool and the loader tool. The application, called Spring, takes r Chapter 6. Programming A Network of Workstations 89 as input any graph and springs it to output a new graph (hence the name Spring) which retains the node connectivity of the input graph. The example runs on a ring of eight nodes. One of the nodes, called the root node, is connected to a node, called the host node, outside the ring. The worker process runs on each node in the ring and the host process runs on the host node. Different phases in the processing are: • Initially, the host process reads a graph (its vertices and edges) from a file and displays it on the GUI. It assigns the vertices to each node in the ring. The graph is then sent to the worker process on the root node. The other information, sent to the root node, includes the simulation parameters, number of iterations (N) for simulation and communication, and the vertices assignments. The Host process then waits for the result. • The root node passes the entire data to the next node in the ring and it is then propagated to all the nodes. • Upon receiving the data, each node carries out simulation, followed by communica-tion (data exchange with the neighbours). During simulation, each node computes new positions of the vertices (based on simulation parameters) assigned to it. The new positions are sent to the adjacent nodes during communication. Both the steps-are repeated N times. • Finally, the new position of each vertex is sent to the host node where the new graph is displayed in the GUI. Figure 6.1 shows the process graph (eight node ring) for the Spring application. A label associated with each node (or process) indicates the process type. A l l nodes of type Worker and Last Worker have two ports bound to them: "Prev" and "Next". The type First Worker has an extra port "Host" bound to port "Root" of the Interface Chapter 6. Programming A Network of Workstations 90 process Type. The details about how to create this parameterized process graph are beyond the scope of the discussion and are contained in [32]. The software and hardware environments can be set up as shown in the Project Information window. Once the Figure 6.1; Process Graph for Spring in Parsec process graph is defined and code is linked with each process type, the following steps are carried out: 1. The code is compiled and executable images are produced using the B u i l d F i l e s and Make B ina r i e s options under Tasks choice in the A p p l i c a t i o n menu (refer to Project Tasks window in Figure 6.1). Chapter 6. Programming A Network of Workstations 91 2. The process graph is mapped onto the transputer network by invoking the mapping tool (Figure 6.2) from the Map Structure option in the Project Tasks window. (a) As previously discussed in Chapter 4, the mapping tool can be run in two modes: Normal and No Backtracking (or Quick). For some process graphs, the time required for the mapping to complete can be unpredictable. There-fore, the user can set time limit (in seconds) • for the mapping tool to run before it terminates with or without the solution. The default time is zero and is ignored by the mapping tool. In this case, the mapping tool continues to run until it finds a solution. In the No Backtracking mode, the time limit is ignored and a solution (not necessarily optimal though) is guaranteed. (b) The Network Root process is the process that is connected to the Host pro-cess. If this option is not set, the mapping tool will select the node (having minimum degree) from the process graph. (c) The contraction is done manually and the the mapping tool is notified by set-ting the flag permit contraction, (this is necessary since the configuration information for the contracted processes is generated by the tool). (d) The transputer network is divided into four components with each component having one port connected to the host machine. The number beneath each component shows the total number of transputers in the component. In this example, component 1 is selected. (e) Once these options are set, the mapping process is started by pressing Map button. The mapping results (node assignments and edge assignment) are available at the output window. Chapter 6. Programming A Network of Workstations 92 S9 Mapping Options Mecwork Sonera : Normal Mapper Mode: Topo logy Adv i ce : Q N->it* Network Root Process: (?) Worker 0 A l g o r i t h m Adv i ce : Q N->n* No Backtracking Topology Root Process: (?) None Permit Contraction: • Mapping Constraints for Environment: Transputer Network Host System: (?) xanadu Components: [cTJ [JJ [T] [T] Size: 2G 9 24 16 [Vfrtccept 1 i n * i Restore [Cancel Project Tasks ( Build Files v ( Make Bi ( M a p Structurev ( Load Config. v ( Execute ( Clear Window ) a Parsec Transputer Mapper — Version 3.0.OA (04/10/94) Allocating resources... Retrieving information from database... Mapping in NORMAL mode. constructing multigraph done constructing conflict graph . . . . . done 1NEW-><8[0] — [1]9> C03- - c o 72 2NEW-><0[2]— [1]B> C O - -C2) 12 2NEW-X1 [2] — [2]8> C O - - -(8) 27 3NEW-><0[1] — [0]7> C23-- -C3) 11 4NEW-X6[0] — [1]7> ( 3 ) - -w 65 5NEW-X5[0]-— [1]B> C4)~ -C5) 59 6NEW-X4[0] — [1]5> ( 5 ) - -CG3 51 7NEW-><3[0] — [1]4> CB)-- -(7y 41 map[0] map[1] map [2] map [3] map [4]. map [5] map [6] = 4 map [7] = 3 map [8] = 1 NODE MAPPING 9 8 0 7 6 5 <0[1] <0[2] <1[3] <1 [2] EDGE MAPPING [0]7> [1]8> ' [3]3> [2]8> Figure 6. 2: Graphical Interface for Mapping Tool in Parsec apter 6. Programming A Network of Workstations 93 3. Once the mapping is done, the loader tool generates a file to configure the transputer network (this is called the boot schema file) and to load Trollius and also generates a script file to load user processes onto the hardware and to pass the configuration information to the runtime system. The loader tool (Figure 6.3) is invoked from Load option in Figure 6.1. Loading Files Generated by Loader Network Configuration: load.bnaiL^ Switch Configuration: load.swdata Monitor Enabled: H E .0 iK ions : Startup Script: load.script Trace File Prefix: Kill Script: kill,script Application Execution Options Default Port Semantics: [vj Buffered Dont background host process: Bf Only use network communication: O [VfAccept "5" i l l * I "B" -Restore 'Cancel £9 Project Tasks ( Build Files v ) ( Make Binaries 7) (Map Structure7) ( Load Config, v ) ( Execute ) ( Clear Window ) E Parsec Trollius Loader — Version 2.2.0 B1 COG/16/94) Building network configuration f i l e . . . Building network routing f i l e . . . Building switch configuration f i l e . . . Building 1 oad scr ipt . . . Downloading data... Checking arguments for process type 'Worker'... Checking arguments for process type 'Last Worker'. Checking arguments for process type 'First Worker' Checking arguments for process type 'Interface'.. . Done. S3 Figure 6.3: Graphical Interface for Loading Tool in Parsec Chapter 6. Programming A Network of Workstations 94 Building makefiles and executable files (Step 1) is independent of the mapping and loading process (Steps 2 and 3). However, the loader must be invoked after the mapping is done as the configuration information, generated by the mapping tool, is utilized by the loader. The code fragments for each process type are shown below. The main objective is to show how the Parsec primitives are used in the program and hence the computation part is omitted in these fragments. The following code which runs on each node in the ring uses the conditional macros ROOT and LAST to differentiate between three process types: Worker, First Worker, and Last Worker. Param P; /* data structure to send information between nodes */ Port Next, Prev; #ifdef ROOT /* macro for process type First Worker */ Port Host; #define in Host; #else #define in Prev #endif main () * #ifdef ROOT init_ports (argc, argv, &Prev, &Next, &Host, NULL); #else init_ports (argc, argv, &Prev, &Next, NULL); #endif while (TRUE) { set_initial_data (); I* receive parameters from host/previous node */ for (i=0; kn; i++) { simulate (); ^ communicate (); I* send/receive data from adjacent nodes */ #ifdefROOT send_final_position (); /* send result to the host node 7 Chapter 6. Programming A Network of Workstations 95 void set_initial_data() { port_recv (&in, &p, NULL, PARAM); #ifndef LAST I* macro for process type Last Worker */ port send (&Next, &p, NULL, PARAM); } #endif /* initialize the adjacency matrix */ void communicatee) * pmsg sp; /* data structure to send vertices */ if (myid%2){ port_recv (&Prev, &sp, 0, POS); port_send (&Next, &sp, 0, POS); } else { port_send (&Next, &sp, 0, POS); pprt_recv (&Prev, &sp, 0, POS); } } The code for the process type Interface is given below. This process runs on the host node. Param P; Port Root; main () ^ I* GUI part to display input graph 7 /* read edges and vertices from a file 7 send_parameter (); I* send data to the root node 7 receive_finaLposition (); I* receive new graph info from root 7 t I* send new data to GUI to display new graph 7 void send_parameter () void receive_f inaLposition () { { /* initialize p 7 pmsg sp; j port_send (&Root, &p, NULL, PARAM); port_recv (&Root, &sp, NULL, PARAM); In order to run the same application on a network of Sun Sparc workstations, the hardware and software environment are changedfrom the Project Information window (Figure 6.1). Then the application is built again. Depending on the target environment, Chapter 6. Programming A Network of Workstations 96 Parsec uses different system libraries and macros during compilation and linking. In Figure 6.2, the Host system is the workstation which is assigned as standard I/O for the application. A l l other options remain the same. The mapping tool assigns processes to different workstations and the loader tool then generates a script to load processes on those workstations. The application was tested in both the environments. The perfor-Hardware Platform Programming Environment Time (in Sec.) Transputers Parsec 0.8 Sun (Sparc) Parsec 5.2 Sun (Sparc) P V M 3 20.8 Table 6.4: Spring: Performance mance of the application in the two environments is shown in Table 6.4. Chapter 7 Conclusions In this thesis, we examined two major issues influencing communication in multicomput-e r : • a mapping of a parallel program onto a parallel architecture • a runtime system for optimized communication. A mapping tool and runtime system were developed for a transputer system and for networks of Sun Sparc workstations. These tools are integrated with Parsec, a parallel programming environment for multicomputers being developed at U B C . The mapping problem in a reconfigurable transputer network is a challenging prob-lem. The mapping tool has been implemented on the 75 node transputer system which supports a wide range of network topologies including binary tree, ternary tree, mesh, hypercube, and butterfly. A graph-oriented mapping strategy is proposed in that a heuris-tics controlled greedy algorithm is used to map various process graphs. The algorithm attempts to maximize the number of dilation one edges. The mapping tool is a static compilation tool that converts a parallel program structure into detailed platform-specific configuration information. Once the mapping is done, the logical routes are analyzed to identify the communication characteristics such as the multiplexing over physical links and the scope of each communication binding. The configuration information is then made available to the runtime system. 97 Chapter 7. Conclusions 98 The Parsec runtime system on the transputer system is built on top of Trollius. Trol-lius offers numerous communication primitives of varying functionality and performance. Depending on the configuration information, the runtime system selects the most opti-mal Trollius primitives. If the communication is neighbour-to-neighbour or local and if the physical links are non-multiplexed, then the runtime system uses the most efficient hardware-level communication primitives. Parsec provides two easy-to-use communi-cation primitives: portsend and portjrecv. They support point-to-point and collective communication, employing both synchronous and buffered semantics. The Parsec primitives are compared with raw Trollius functions. At the network level their performance is comparable to that of Trollius functions. However, at hardware-level the Parsec primitives add extra overhead but relieve the user from setting up virtual circuits. Both the mapping tool and the runtime system were ported to a network of Sun Sparc workstations to provide location transparency.. A static load balancing algorithm has been implemented to allocate processes to processors such that the load is evenly balanced. The runtime system uses reliable T C P communication. The Parsec primitives are compared with the P V M 3 primitives. The Parsec timings were consistently faster (15% to 100% faster depending on the message size) than the P V M 3 timings except when the machines were on different subnets and the messages were large (of the order of tens of kilobyte). Finally, an example is presented to demonstrate how a parallel application is developed on a transputer network, and moved to a workstation environment without changing the code. Chapter 7. Conclusions 99 7.1 Future W o r k The work can be extended in each of the three areas: mapping problem, runtime system, and tools for different hardware-platforms. It is unlikely that an efficient algorithm will be found for the general mapping problem. Hence, different algorithms and heuristics can be employed for different topologies. A basic framework for the mapping process has been provided. Various algorithms can be implemented using this framework. By giving different choices for mapping, users can select the best achievable mapping for a wider class of topologies. Our transputer system has programmable crossbars switches in that multiple crossbar can be connected to one another giving better connectivity and thus increasing the range of topologies that can be mapped onto the system. The graph structures presented in Chapter 4 are used for a single crossbar connection between any two transputers. The algorithm can be modified to accommodate multiple crossbar architecture. This requires reconfiguration of the transputer network but has a potential for better mapping results. Our work can be extended to provide heterogeneity at an architectural level in a workstation environment. Currently, our tools are ported to Sun Sparc workstations and can easily accommodate different platforms like HPs, DECs, SGIs and RS6000s. Since different machines are best suited for different types of applications, incorporating them in Parsec can be an attractive choice to application programmers in various areas including graphics, image processing, numerical analysis and scientific computing. Developing parallel applications in these areas will justify the usefulness of our system. Bibl iography [1] Thomas H . Cormen, Charles E. Leiserson, and Ronald L. Rivest, "Introduction To Algorithms", The MIT press and McGraw-Hill Publication. [2] V . Kumar, A . Grama, A . Gupta, G. Karypis, "Introduction To Parallel Computing", Benjamin/Cummings Publishing Company. [3] M . Garey abd D. Johnson, "A Guide to the Theory of NP-Completeness", W.H.Freeman and Co., 1979. [4] F. Thomas Leighton. "Introduction to Parallel Algorithms and Architectures: Ar-rays.Tress.Hypercubes" 1991. [5] Swamy, Thulasiraman, "Graphs, Networks and Algorithms", A Wiley Interscience Publication. [6] W. Richard Stevens, "UNIX Network Programming", Prentice-Hall Publication, 1990. [7] Logical Systems, "Transputer Toolset, vol. 1 - Toolset Reference", Parallel Software Solutions, January 1992. [8] Logical Systems, "Transputer Toolset, vol. 2 - C Library Reference", Parallel Soft-ware Solutions, January 1992. [9] D. Fielding et. al, "The Trollius Programming Environment for Multicomputers", In A . Wagner, editor, Transputer Research and Applications 4 ( N A T U G 4), IOS Press, April 1990. [10] Ohio Supercomputer Center, "Trollius Command Reference", The Ohio State Uni-versity, March 1992. [11] Ohio Supercomputer Center, "Trollius C Reference", The Ohio State University, March 1992. [12] G. Burns, V . Radiya, R. Daoud, R. Machiraju, " A l l About Trollius", Occam User's Group Newsletter 1990. [13] D. Pountain and David May, "A Tutorial Introduction To O C C A M P R O G R A M -M I N G " , Inmos, BSP Professional Books, March 1988. 100 Bibliography 101 [14] Inmos Ltd., " O C C A M 2 Toolset User Manual", March 1991. [15] Andrew Tanenbaum, "Modern Operating Systems", Prentice-Hall International Edi-tion, 1992. [16] Andrew Tanenbaum, "Computer Networks", Prentice-Hall International Edition, 1991. Mapping related papers [17] Bokhari S., "On the Mapping Problem", IEEE Trans, on Computers, Vol. C-30, No.3, March 1981. [18] Berman F. and L. Snyder, "On Mapping Parallel Algorithms into Parallel Architec-tures", Proceedings of the 1984 International Conference on Parallel Processing. [19] F. Berman, "The Mapping Problem In Parallel Computation". [20] Soo-Young Lee, J . K . Aggarwal, " A Mapping Strategy for Parallel Processing", I E E E Trans. On Computers, Vol. C-36, No. 4, Apri l 1987. [21] Woei-Kae Chen, E. F. Gehringer, "A Graph-Oriented Mapping Strategy for a Hy-percube", A C M 1988. [22] J . Ramanujam, F. Ercal, P. Sadayappan, "Task Allocation By Simulated Anneal-ing". [23] F. Ercal, J . Ramanujam, P. Sadayappan, "Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning", Dept of Computer Science, The Ohio State Uni-versity. [24] B. Kernighan and S. Lin, "An efficient heuristic procedure for partitioning graphs", Bell System Technical Journal, 49(2), Feb. 1970. [25] H. S. Stone, "Multiprocessor Scheduling with the Aid of Network Flow Algorithms", IEEE Trans. On Software Engineering, Vol. SF-3', No. 1, Jan 1977. [26] V . M . Lo, " Heuristic Algorithm for Task Assignment in Distributed System", I E E E Trans. On Computers, Vol. 37, No. 2, Nov. 1988. [27] P; Sadayappan, F . Ercal, " Nearest-Neighbor Mapping of Finite Element Graphs onto Processor Meshes", I E E E Trans. On Computers, Vol. C-36, No. 12, Dec. 1987. [28] J . Boillat, N . Iselin, and P. Kropf, " M A R C : A tool for automatic configuration of parallel programs", University of Berne, Switzerland, from Transputing '91, Sunny-vale, California. Bibliography 102 [29] F. • Berman, "Why is Mapping Hard for Parallel Computer?", IEEE Parallel Dis-tributed Computing Networks Seminar, 1990. [30] F. Berman, "Experience with an automatic solution to the mapping problem", In L. Jamieson, D. Gannon, and R. Douglas,: editors, The Characteristics of Parallel Algorithms, MIT Press, 1987. [31] V . Lo, S. Rajopadhye et al, " O R E G A M I : Software Tools for Mapping Parallel Com-putations to Parallel Architectures", Dept. of Computer and Information Science, University of Oregon. [32] Feldcamp D., "A Hierarchical software development for performance oriented paral-lel programming", Masters Thesis, Univ. of British Columbia, Dec. 1992. [33] D. Feldcamp, A. Wagner, "Using the Parsec Environment to Implement a High-Performance Processor Farm", 1994. [34] H.V. Sreekantaswamy., "An Integrated Approach to Programming and Performance Modeling of Multicomputers", Ph.D. Thesis, Univ. of British Columbia, Oct. 1993. [35] U B C Transputer Group, "A tool environment for parallel and distributed comput-ers" Working paper, June 1992. [36] N . J . Goldstein, "A Topology Independent Parallel Development Environment" Mas-ters Thesis, Univ. of British Columbia, Apri l 1991. [37] Louis H . Turcotte, "A Survey of Software Environments for Exploiting Net-worked Computing Resources", a survey report available via anonymous ftp at unix.hensa.ac.uk [38] D. Feldcamp and A . Wagner, "Parsec - A Software Development Environment for Performance Oriented Parallel Programming", In S. Atkins and A . Wagner, editors, Transputer Research and Applications 6 ( N A T U G 6), IOS Press, May 1993. [39] A . Singh, J . Schaeffer, and M . Green, "A Template-Based Approach to the Gener-ation of Distributed Applications Using a Network of Workstations", IEEE Trans, on Paralle and Distributed Systems, Vol. 2, N o . l , January 1991. [40] Pok Sze Wong, "The Enterprise Executive", Department of Computing Science, University of Alberta, Technical Report T R 92-13, Oct 1992. [41] S. Zhou, J . Wang, X . Zheng and P. Delisle, "UTOPIA: A Load Sharing Facility for Large Heterogeneous Distributed Computer Systems", Technical Report CSRI-257, Computer Systems Research Institute, University of Toronto, Toronto, Canada, ' Apri l 1992. Bibliography 103 [42] A . Geist, A . Beguelin, J . Dongarra, W. Jiang, R. Manchek, and V . Sundaram, " P V M 3.0 User's Guide and Reference Manual", available on netlib. [43] A . Beguelin, J. Dongarra, A . Geist, R. Manchek, and V . Sundaram, "Graphical Development tools for network-based concurrent supercomputing", In Proceedings of Supercomputing 91, Albuquerque, 1991.. [44] T. Bemmerl, C. Kasperbauer, M . Mairandres and B. Ries, "Programming Tools for Distributed Multiprocessor Computing Environments", Report on TOPSYS and M M K , Technische Universitat Munchen, Germany, October 1991. available via anonymous ftp at topsys.informatik. tu-muenchen.de [45] F. Gregoretti, Z. Segall, "Programming for Observability Support in a Parallel Pro-gramming Environment", Department of Computer Science, Carnegie Mellon Uni-versity, Technical Report CMU-CS-85-176. [46] A . Kolawa, "Parasoft: a comprehensive approach to parallel and distributed com-puting", In Proceedings of the workshop on Cluster Computing, Tallahassee, F L , Dec. 1992. [47] A . Singh, J . Schaeffer, M . Green "A Template-Based Approach to the Generation of Distributed Applications Using a Network of Workstations", IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 1, Jan. 1991. [48] N . Carriero and D. Gelernter, "Linda in context", Communcations of the A C M , vol. 32(4) :444-458, Apri l 1989. [49] I. Foster and K . . M . Chandy, "Fortran M : A language for modular parallel program-ming" , Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, June 1992. [50], M . Rosing, R. Schnabel, and R. Weaver, "The DINO parallel programming lan-guage", Computer Science Department, University of Colorado at Boulder, Techni-cal Report CU-CS-457-90, Apri l 1990. [51] M . Rinard, D. Scales, and M . Lam, "Jade: A High-Level Machine-Independent Language for Parallel Programming", I E E E Computer, Vol. 26, No. 6, June 1993. [52] Message Passing Interface Forum, "Document for a Standard Message-Passing In-. terface", Technical Report No, CS-93-214, University of Tennessee, November 1993. available on netlib [53] A . Skjellum, S. Smith, N . Doss, A . Leung, and M . Morari, "The Design and Evolu-tion of Zipcode", Parallel Computing, 1993, (invited paper, to appear) Bibliography 104 [54] G. Wilson, "Via: A Structured Message-Passing System for Parallel Computers", Technical Report No. T R 91-16, Edinburgh Parallel Computing Centre, University of Edinburgh, 1991. [55] A . Skjellum, A . Leung, S. Smith, R. Falgout, C. Still, and C. Baldwin, "The Mul-ticomputer Toolbox - First-Generation Scalable Libraries", submitted to HICSS-27, June 1993. [56] A. Quealy, G. Cole, and R. Blech, "Portable Programming on Parallel/ Networked Computers Using the Application Portable Parallel Library ( A P P L ) " , N A S A Tech-nical Report, July 1993. [57] Chimp, "CHIMP Concepts", Edinburgh Parallel Computing Center, University of Edinburgh, U K , June 91, available on netlib. [58] T. Bolognesi, E . Brinksma, "Introduction to the ISO Specification Language LO-TOS", Invited Paper in F O R T E 88 at University of Sterling, Holland, Sept. 1988. [59] R. Duncan, "A survey of Parallel Computer Architectures", I E E E Computer, Febru-ary 1990. [60] Gordon Bell, "Ultracomputers: a terafldp before its time", Communications of the A C M , 35(8):27-47, August 1992. [61] S. Zhou, "A Trace-Driven Simulation Study of Dynamic Load Balancing" IEEE Trans. Software Eng., Vol. SE-14, No. 9, pp. 1327-1341, Sept 1988. [62] S. Zhou, "An Experimental Study of Load Balancing Performance," Proc. Winter USENIX Conference/Washington, D.C. , pp.73-82, January 21-24, 1987. [63] D. Eager, E. Lazowska, and J . Zahorjan, "Adaptive Load Sharing in Homogeneous Distributed Systems" IEEE Trans. Software Eng., Vol. SE-12, No. 5, pp. 662-675, May 1986. [64] D. Eager, E. Lazowska, and J . Zahorjan, "A Comparison of Receiver-Initiated and Sender-Initiated Dynamic Load Sharing." Performance Evaluation, Vol. 6, No. 1, pp. 53-68, Apri l , 1986. A p p e n d i x A Transputer System Layout The transputer system in the Department of Computer Science at U B C is shown below. There are four components residing in five physical boxes. Each component has one transputer connected to Xanadu, the host machine, on a data port. The transputers are either hardwired with one another or are connected to the crossbars. Figure A.6 shows the entire system as a 8x9 mesh with two extra nodes (T a and T&). Solid lines are hard-links between two transputers and dotted lines indicate intermediate crossbar connections. xbl i T b 0 2 3 T64 1 1 0 4 0 ~! 1 xanadu! 2 3 1 T 6 5 0 2 3 i T 6 6 o 2 3 i T 6 7 o J T 7 0 ° 2 3 2 1 3 Figure A . l : Component 1 (Box 1) xb6 105 Appendix A. Transputer System Layout 106 xb6 xb9 _30J -xb6 10 12 14 1 3 5 7 9 11 13 15 2 TO 3 2 T 1 3 2 j 2 3 2 T 3 3 1 0 1 0 1 0 1 0 i X 4 0 i X 5 0 i T 6 0 i T 7 0 2 T g 3 2 3 2 T 1 0 3 2 T 1 1 3 1 0 1 0 1 0 1 0 i T 1 2 0 1 T 1 3 0 2 3 l - f l 4 ° ' xanadu 0! • 1 Ta 2 0 3 1 | 10 12 14 1 3 5 7 .9 11 13 15 16 17 xb7 xb7 xb6 16 17 18 19 20 21 22 23 2 T16 3 2 T17 3 2 T l g 3 2 T 1 9 3 1 0 1 0 1 0 1 0 i T 2 0 0 i T 2 1 0 1 T 2 2 0 i T 2 3 o 2 3 2 3 2 3 2 3" i 24 I 2 5 126 |27 128 |29 130 !31 xb8 Figure A.2: Component 0 (Box 2) Appendix A. Transputer System Layout • xb2 22- —i xb8 2 T 2 4 3 2 T25 3 2 T26 3 2 T 2 7 3 1 0 1 0 1 0 1 0 i T 2 8 0 1 0 i T 3 0 0 i T 3 1 0 xb9 " • 3 xanadu, xb9 16 17 18 19 20 21 22 23 2 T 3 2 3 2 T33 3 2 T34 3 2 T35 3 1 0 1 0 1 0 1 0 1 T36 0 i T 3 7 0 i T 3 8 0 1 T39 0 2 3 2 3 2 3 2 3 24 25 26 27 28 29 30 31 xb2 Figure A.3: Component 3 (Box 3) Appendix A. Transputer System Layout 108 xb2 8 9 10 11 12 13 14 15 2 T40 3 2 x 4 1 3 2 T 4 2 3 2 T43 3 1 T 4 4 ° 1 T45 0 1 T 4 6 ° 1 T 4 7 0 2 3 2 3 2 3 2 3 8 9 10 11 12 13 14 15 xb3 xb4 xb3 3 T48 2 3 T 4 9 2 3 T52 2 3 T53 2 3 T50 2 3 T51 2 3 T54 2 3 TSS 2 • 2 xanadu • 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 xb3 Figure A.4: Component 2 (Box 4) Appendix A. Transputer System Layout 109 xb5 V T56 2 1 T57 2 1 T58 2 1 T59 2 16 17 18 19 20 21 22 23 xb4 16 17 18 19 20 21 22 23 1 T60 2 1 T61 2 1 T62 2 1 T63 2 • xbl Figure A.5: Component 2 (Box 5) Appendix A. Transputer System Layout 110 TO — i — 3 | T8 — i — T16 T24 T32 IT T40 T48 T56 o 1 0 1 T64 Tl T9 T17 37 T25 T33 37 T41 — i — x T49 T57 - i — T65 T2 LZ_. T10 37 T18 T26 ' T34 37 T42 T50 37 T58 T66 T3 ,3 Til 37 T19 37 T27-37 T35 37 T43 T51 3T T59 T67 T4 T12 37 T20 T28 T36 i T44 37 T52 ; 2 T60 37 T68 T5 T13 37 T21 T29 37 T37 5 7 T45 T53 T61 T69 T6 31 T14 — i — 1> T22 —r— 3 J 2 T30 T38 T46 T54 T62 T70 T7 — i — | 3 T15 T23 i T31 37 T39 37 T47 37 T55 IT T63 171 Ta x A N A D U Tb Figure A.6: Transputer System as a 8x9 mesh 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051204/manifest

Comment

Related Items