An Integrated Approach to Programming and Performance Modeling of Multicomputers by HALSUR V. SREEKANTASWAMY B.Engg., University of Mysore, India, 1981 M.Tech., Indian Institute of Technology, Kanpur, India, 1983 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF COMPUTER SCIENCE We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA October, 1993 © Halsur V. Sreekantaswamy, 1993 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for refer ence and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not he allowed without my written permission. Computer Science The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1Z1 Date: Il Oc€ T, 1?3 Abstract The relative ease with which it is possible to build inexpensive, high-performance multicomputers using regular microprocessors has made them very popular in the last decade. The major problem with multicomputers is the difficulty in effectively program ming them. Programmers are often faced with the choice of using high level program ming tools that are easy to use but provide poor performance or low level tools that take advantage of specific hardware characteristics to obtain better performance but are difficult to use. In general, existing parallel programming environments do not provide any guarantee of performance and they provide little support for performance evaluation and tuning. This dissertation explores an approach in which users are provided with programming support based on parallel programming paradigms. We have studied two commonly used parallel programming paradigms: Processor Farm and Divide-and-Conquer. Two runtime systems, Pfarm and TrEK, were designed and developed for applications that fit these paradigms. These systems hide the underlying complexities of multicomputers from the users, and are easy-to-use and topology independent. Performance models are derived for these systems, taking into account the computation and communication characteristics of the applications in addition to the characteristics of the hardware and software system. The models were experimentally validated on a large transputer-based system. The models are accurate and proved useful for performance prediction and tuning. Pfarm and TrEK were integrated into Parsec, a programming environment that supports program development and execution tools such as a graphical interface, mapper, loader and debugger. They have also been used to develop several image processing and numerical analysis applications. U Contents Abstract ii Table of Contents iii List of Tables viii List of Figures x Acknowledgements 1 2 xii Introduction 1 1.1 Motivation 2 1.2 Methodology 3 1.3 Synopsis of the Dissertation 5 Background and Related Work 2.1 8 Parallel Programming Approaches 8 2.1.1 High Level Approaches 2.1.2 Low level Approaches 10 2.1.3 Other Approaches 10 2.1.4 Parallel Programming Paradigms 11 9 111 2.2 3 14 2.2.1 Performance Measures and Models 15 2.2.2 Integration 18 Methodology 21 3.1 An Integrated Approach 22 3.2 System Model 23 3.3 Experimental Testbed 24 3.3.1 Hardware System 24 3.3.2 Software Environments 25 3.4 3.5 4 Performance Modeling Task-oriented Paradigms 26 3.4.1 Processor Farm 26 3.4.2 Divide-and-Conquer 29 Chapter Summary 32 Processor Farm: Design and Modeling 33 4.1 Pfarm: Design and Implementation 34 4.1.1 Process Structure and Scheduling 35 4.1.2 Task Scheduling 36 4.1.3 Buffering 37 4.1.4 Topology Independence 38 4.2 Performailce Modeling 38 4.2.1 General Analytical Framework 39 4.2.2 Balanced Tree Topologies 49 4.2.3 Communication Bound 57 iv 4.3 4.4 5 6 Discussion . 58 4.3.1 Optimal N and Topology 58 4.3.2 Problem Scaling 61 4.3.3 Granularity 62 Chapter Summary 63 Processor Farm: Experiments 65 5.1 Determining System Overheads 65 5.2 Arbitrary Topologies 67 5.3 Balanced Tree Topologies 70 5.3.1 Steady-state Validation 71 5.3.2 Start-up and Wind-down Validation 76 5.4 Robustness 77 5.5 Chapter Summary 79 Divide-and-Conquer: Design and Modeling 80 6.1 TrEK: Design and Implementation 80 6.2 Performance Modeling 85 6.2.1 Arbitrary Tree Topologies 86 6.2.2 Balanced Tree Topologies 93 6.2.3 Communication Bounds 98 6.3 6.4 Discussion 99 6.3.1 Optimal N and Topology 99 6.3.2 Problem Scaling 101 Chapter Summary 101 V 7 8 Divide-and-Conquer: Experiments 103 7.1 Determining System Overheads 104 7.2 Arbitrary Topologies 106 7.3 Balanced Tree Topologies 107 7.3.1 Steady-State 108 7.3.2 Start-up and Wind-down 108 7.3.3 k-ary Tasks on g-ary Balanced Topologies 111 7.3.4 Variable Split and Join Costs 112 7.3.5 Robustness 114 7.4 Comparison of Divide-and-Conquer with Processor Farm 115 7.5 Chapter Summary 117 System Integration and Applications 118 8.1 Parsec: An Integrated Programming Environment 118 8.2 Performance Tuning 122 8.2.1 Parameter Measurements 122 8.2.2 Performance Analysis Library 124 8.3 8.4 9 User Interface 124 8.3.1 Pfarm 125 8.3.2 TrEK 126 Applications 127 8.4.1 Cepstral filtering 127 8.4.2 Fast Fourier Transform (FFT) 128 Conclusions 132 vi 9.1 Future Directions . 135 Bibliography 137 Glossary 143 vii List of Tables 5.1 Comparison of predicted and measured results for 8 x 3 mesh 69 5.2 Comparison of predicted and measured results for 8 x 8 mesh 69 5.3 Range of processor farm experiments 70 5.4 Comparison of Predicted and Measured Total Execution Time for Pro cessor Farm running on Linear Chain 5.5 72 Comparison of Predicted and Measured Total Execution Time for Pro cessor Farm running on Binary Tree 5.6 73 Comparison of Predicted and Measured Total Execution Time for Pro cessor Farm running on Ternary Tree 5.7 74 Comparison of Predicted and Measured Total Execution Time for Pro cessor Farm running on Linear Chain under Communication Bound 5.8 75 . Comparison of Predicted and Measured Total Execution Time for Pro cessor Farm running on Binary Tree under Communication Bound 5.9 . Comparison of Predicted and Measured Total Execution Time ( . . 75 . Start-up & Wind-down) for Processor Farm running on Linear Chain, Binary Tree and Ternary Tree 76 5.10 Comparison of Predicted and Measured Total Execution Time for uniform task distribution for Processor Farm running on Linear Chain 7.1 Performance Comparison of three different BFSTs of the 8 x 3 mesh. 7.2 Range of Divide-and-Conquer steady-state Experiments vi” 77 . . . 106 108 7.3 Steady-state Performance Comparison for Divide-and-Conquer running on Binary Tree 7.4 109 Steady-state Performance Comparison for Divide-and-Conquer running on Ternary Tree. 7.5 109 Start-up and Wind-down Performance Comparison for Divide-andConquer running on Binary Tree 7.6 110 Start-up and Wind-down Performance Comparison for Divide-andConquer running on Ternary Tree 7.7 110 Comparison of Predicted and Measured Total Execution Time for Divideand-Conquer running on Binary Tree with M 7.8 = 1000. Comparison of Predicted and Measured Total Execution Time for Divideand-Conquer running on Ternary Tree with M =1000. 7.9 111 111 Comparison of Predicted and Measured Total Execution Time for Binary Divide-and-Conquer tasks running on Ternary Tree 112 7.10 Comparison of Predicted and Measured Total Execution Time for Divideand-Conquer Tasks with Variable Split & Join Costs 113 7.11 Comparison of Predicted and Measured Total Execution Time for Binary Divide-and-Conquer Under Split and Join Bound . . 114 7.12 Comparison of Predicted and Measured Total Execution Time for Binary Divide-and-Conquer Tasks with Subtasks of Unequal Size 115 7.13 Comparison of Total Execution Time for Binary Divide-and-Conquer Ap 8.1 plications with TrEK and Pfarm 116 Experimental results for FFT on a 16-node binary tree 131 ix List of Figures 3.1 Ideal Manager-Workers Architecture 27 4.1 Process Structure on a Worker Processor in Pfarm 36 4.2 An example of the steady-state analysis 41 4.3 (a) node graph (b) process graph (c) subtree decomposition 43 4.4 Binary and Ternary Trees with D 51 4.5 The affect of / 3 on efficiency 4.6 Plot of throughput curves for a linear chain (with Te !3e 4.7 = 4 and 3 55 = lOms, /3e = 482ts, 453is) 59 Comparison of processor farm throughput on linear chain, binary tree and ternary tree configurations 60 4.8 Measured speedup for processor farm on linear chain 62 4.9 The affect of granularity on speedup 63 5.1 Configurations for determining 5.2 Three breadth-first spanning trees of the 8 x 3 mesh 5.3 Error graph for processor farm on linear chain with tasks of bimodal 13e and j 3 1 66 68 distribution 78 6.1 Divide-and-Conquer Task Structure 81 6.2 TrEK Process Graph on an Intermediate Worker Processor 82 x 6.3 An example of the steady-state analysis 88 6.4 (a) node graph (b) process graph (c) subtree decomposition 89 6.5 Plot of throughput curves for Binary Divide-and-Conquer Tasks on Binary Tree 6.6 100 Comparison of divide-and-conquer throughput on binary tree and ternary tree topologies 101 6.7 Measured speedup for divide-and-conquer on binary tree 102 7.1 Configurations for determining /3e and 1 f 3 105 7.2 Three breadth-first spanning trees of the 8 x 3 mesh 107 8.1 Graphical Interface to Pfarm in Parsec 121 xi Acknowledgements First and foremost, I would like to thank both my supervisors Dr. Samuel Chanson and Dr. Alan Wagner for their guidance, support and encouragement throughout my thesis research. I was very fortunate to have Dr. Wagner as my supervisor, whose direction and continuous involvement made it possible for my thesis research to take this final shape. I thank the members of my supervisory committee Dr. Mabo Ito and Dr. James Little for their valuable input to my research. I thank Mandeep Dhami, David Feldcamp, Norm Goldstein, Kunhua Lin, Sameer Mulye and other past members of the parallel computation group for their cooperation during this research. I appreciate all the help provided by the system and administrative staff during my stay in the department. In addition, I would like to thank Chris Healey for proofreading parts of this thesis. My thanks go to many fellow graduate students (Don Acton, Ian Cavers, Parag Jam, Sree Rajan, Vishwa Ranjan to name a few) for their friendship, help and support. Many thanks are also extended to numerous friends in UBC and outside who made our stay in Vancouver enjoyable. I would like to acknowledge the financial support provided by the Canadian Com monwealth Scholarship and Fellowship Administration. I also thank the Government of India for sponsoring me for the fellowship. Finally, I thank my wife Latha for her love, endless support and patience during the last five years. xii Chapter 1 Introduction Parallel processing is becoming popular with the advent of inexpensive, powerful mi croprocessors made possible by advances in VLSI technology. Several kinds of parallel computer architectures [Dun9Oj have been proposed and built. These parallel computers have been used successfully to achieve remarkable performance for applications in several areas including scientific computing, and signal and image processing. The domain of parallel computer architectures includes Single Instruction Multiple Data (SIMD) ma chines, and Shared memory and Distributed memory Multiple Instruction Multiple Data (MIMD) machines {F1y72]. Distributed memory MIMD machines, generally known as multicomputers [AS88], consist of a number of processors each with their own local memory, connected by a message-passing network. Several research and commercial multicomputers such as Hypercubes[Sei85], Transputer-based systems{Lim88] and iWARP[K+90j have been available since the mid 1980s. Multicomputer architectures have several advantages. These machines are able to take advantage of the latest and fastest microprocessor tech nology making them cost-effective in comparison to other parallel architectures. They are easily scalable compared to other architectures. In the case of reconfigurable ma chines, it is possible to take advantage of the communication patterns of the problem to improve performance. With these advantages, multicomputers are gaining importance as general purpose parallel machines useful for applications in a wide range of areas [FJL+88]. The focus of this research is on multicomputers and their effective use. 1 Chapter 1. Introduction 1.1 2 Motivation The major stumbling block to the widespread use of multicomputers is the tremendous difficulty in effectively programming them. Software development for multicomputers has not kept pace with the advances in hardware technology [Kar87, KT88, CK91]. Message passing gives finer control over resources, but at the cost of added complexity. Users must address difficult problems such as partitioning, mapping, communication, synchronization, data distribution, and load balancing. There are a few high level languages based on functional and logic programming models [Kog85, Dav85, FT9O]. Those that are available provide high level abstractions with universal interfaces to all applications, but their overall performance is generally poor because of the difficulties in taking full advantage of the underlying structure of the application and the architecture. Most of the recent work on parallelizing com pilers [PW86, CK88, HKT91] has focussed on extracting loop level and lower levels of parallelism. They are restricted to exploiting parallelism in certain loop structures and thus can improve performance only for certain problems such as SPMD (single program, multiple data) type programs that are data-parallel. Most of the commercial multicom puters provide low level machine-dependent environments [Lim88, 1im87, Inc9l] that can be used to achieve high performance. These environments provide very low level programming abstractions that makes program design a complex process. This leads to higher software development costs and programs that are not easily ported to other machines. Difficulties in parallel programming do not end with the design and development of a working parallel program. The primary motivation for using parallel computers is to obtain higher performance for application programs. In general, existing programming environments do not provide any guarantee of performance, moreover they provide little support for performance evaluation and tuning. In the case of parallel systems, per formance depends on the computation and communication characteristics of a parallel program in addition to the characteristics of the hardware and software system. Users generally have very little knowledge about the performance of their programs until they Chapter 1. Introduction 3 are implemented and run. Even though one may think that using more processors will improve performance, this is not always the case. Simple models [5to88] have shown that using more than a certain number of processors for a given application may not improve performance. In practice, it may actually degrade performance. 1.2 Methodology Providing abstractions that are efficient and easy-to-use for programming multicomput ers is a difficult problem. One recent approach to reconciling ease of use and reuse with performance is the construction of software components (libraries, modules, templates, skeletons) based on the parallel programming paradigms that have appeared in the liter ature [K+87, Nel87, Co189, Pri9O]. These paradigms, taken together, represent the state of the art of parallel programming. Software components based on these paradigms can hide the complex distributed system code needed to implement the paradigm, thereby allowing the application programmer to concentrate on the computationally intensive code rather than parallelization and the coordination of the processors. Several projects such as Chameleon [A1v90], PIE [RSV9O] and VMPP [Gab9O] have looked at providing programming support for some of these paradigms on shared-memory machines. It is difficult to obtain a single performance model that can be used for all ap plications on a parallel system. Performance depends on the computation and com munication characteristics of the algorithm, in addition to the characteristics of the hardware and software system. There are some performance metrics, such as Amdahl’s serial fraction [Amd67], experimentally determined serial fraction [KF9O] and average parallelism [EZL89], which are based on simple characterizations of parallel systems. Although, they can be used to obtain rough bounds on performance, it is not as easy to use them for performance prediction and tuning. One approach to obtaining more accurate models that could be used for prediction and tuning is to model simpler and more restricted systems. Parallel programming paradigms are more restricted and suf ficiently general to be of more general use. Models based on paradigms can take into account the computation and communication characteristics of the applications and also the characteristics of the hardware and software system. Chapter 1. Introduction 4 Abstraction and added functionality that diminishes performance leads to constant re-design and re-implementation of the software component. Therefore, it is necessary to formalize these paradigms to better understand their expressiveness, their limitations, and most importantly their performance characteristics. It is important to understand the effect of scaling the component to execute on a larger number of processors. It must be possible to easily modify the behavior and performance characteristics of the compo nent in order to take advantage of application specific optimizations (e.g., fixed versus variable data packets). By understanding the behavior and performance characteristics of the paradigm, it may be possible to guarantee performance and provide guidance to the use and design of these paradigms on different topologies or systems with different primitives. The challenge is to construct a system based on paradigms that is reusable and achieves close to optimal performance. In this dissertation, we consider two task-oriented parallel programming paradigms, processor farm and divide-and-conquer. The processor farm paradigm is widely used in parallelizing applications in several areas [CHvdV88, BTU88, CU9O, CCT93j. Divideand-conquer is a well-known problem solving strategy in both sequential and parallel programming [AHU74, HZ83, GR88, Sto87J. The principal contributions of this dissertation research are: Development of performance models for two commonly used parallel programming paradigms: processor farm and divide-and-conquer. We have developed models that accurately describe the behavior and performance characteristics of processor farm and divide-and-conquer applications on multicom puters with the characteristics described in Section 3.2. These are realistic models that can help in understanding the capabilities and limitations of these paradigms. These models have been experimentally validated on a transputer-based system. They provide guidance for system design, and can be used for performance predic tion and tuning. • Design and development of execution kernels for processor farm and divide-and conquer applications. Chapter 1. Introduction 5 Execution kernels for both processor farm and divide-and-conquer have been de signed and implemented on a transputer-based machine. The systems are topology independent, i.e., they can be used on machines of any size and topology. They have been integrated into a programming environment that includes supporting tools such as a graphical interface, mapper, loader and debugger. Several applica tions have been developed using these kernels. 1.3 Synopsis of the Dissertation Chapter 2 provides an overview of the related literature that puts this dissertation work in context. It includes a discussion on various existing parallel programming approaches, highlighting their advantages and disadvantages, and comparing and contrasting our approach to these approaches. Existing performance measures and models for parallel systems are reviewed, emphasizing their applicability and limitations. Chapter 3 describes the integrated approach we have taken to address the program ming and performance modeling problems in multicomputers. This approach provides programming support based on parallel programming paradigms to the application pro grammers. We discuss the characteristics of processor farm and divide-and-conquer paradigms that are studied in this thesis. We describe how these paradigms can be used to parallelize several different applications. In Chapter 4, we describe the design of Pfarm, a topology independent processor farm runtime kernel, detailing the trade-offs involved to make it efficient. Pfarm implements a distributed dynamic task scheduling strategy. The affect of process structure, schedul ing, and buffering on performance has been investigated. We have developed a general analytical framework that can be used to derive performance models for processor farms on an arbitrary tree topology. For a fixed topology, we have shown that a breadth first spanning tree provides maximum performance, and the steady-state performance of all breadth-first spanning trees are equal. Since a processor farm system behaves like a pipeline, we have also analyzed start-up and wind-down. The ideal architecture for Pfarm is a balanced k-ary tree, where k is the number of links on each processor. Chapter 1. Introduction 6 Performance models for this case have been derived from the general framework. We also describe how the models can be used in performance tuning and restructuring of application programs. The Pfarm system has been implemented on a 75 node transputer-based machine. The performance models were experimentally validated as reported in Chapter 5. The models are sufficiently accurate that they can be used to predict performance of this design on any tree topology. The robustness of the model under our assumption of average task size was tested for uniform and bimodal distributions. The model was accurate for the uniform distribution. In the case of the bimodal distribution, we found that the model remained accurate as long as the arrival pattern of the two task types was mixed. In Chapter 6, we extend the design of Pfarm to provide runtime system support for divide-and-conquer applications. This system, called TrEK (Tree Execution Kernel), can execute divide-and-conquer computations of any degree and depth on an arbitrary tree topology. TrEK is designed to make use of intermediate processors for subtask processing in order to increase the overall performance. We expanded the general analytical frame work given for Pfarm to derive performance models for fixed degree divide-and-conquer applications on an arbitrary tree topology. Experimentally, we found that performance depends on the depth and number of leaves in the tree topology. Thus, on a fixed topol ogy, a breadth-first spanning tree with a maximum number of leaves achieves maximum performance. With a reconfigurable network, a g-ary balanced tree, where g is the num ber of links on each node, provides maximum performance. We derived models that can predict the performance of any fixed k-ary divide-and-conquer computation on any g-ary balanced tree topology. Chapter 7 describes the experiments conducted to validate the performance models for divide-and-conquer. The experiments show that our framework performs well even for applications that consist of a single large divide-and-conquer task in addition to those with a flow of tasks. In some cases, it is possible to use the processor farm strategy for divide-and-conquer applications. We found that TrEK outperformed Pfarm for applications with larger tasks and for those applications that consist of a smaller Chapter 1. Introduction 7 number of tasks. In order to make it easier for application programmers to use these paradigms on a multicomputer system, a programming environment that supports all phases of program development and execution is needed. In Chapter 8, we describe Parsec, an on-going project at the University of British Columbia in developing an integrated programming environment for the support of paradigms. Parsec provides Pfarm and TrEK with supporting tools such as a graphical interface, mapper, loader, monitor and debugger. We have also discussed applications that have been developed using Pfarm and TrEK. Chapter 9 provides a summary of the dissertation with a discussion of topics for future research. Chapter 2 Background and Related Work The relative ease with which it is possible to build inexpensive, high-performance mul ticomputers using commodity microprocessors has made multicomputers very popular. The major problem with multicomputers is the difficulty in effectively programming them. Programmers must either use a high level programming tool that is easy to use but provides poor performance or a machine dependent low level tool that can provide high performance but is difficult to use. To be successful, a parallel programming environment should address both the basic issues, namely programming and performance. First, programmers should be provided with easy-to-use programming abstractions that hide the complexities of the system. Second, the environment should be able to assist programmers in obtaining the maximum performance on a given parallel architecture for their applications. In this chapter, we provide a overview of the related literature. Section 2.1 describes various parallel programming approaches, emphasizing their advantages and disadvan tages. In Section 2.2, a review of the literature on existing performance measures and models for parallel computing is provided. 2.1 Parallel Programming Approaches In this section, various parallel programming approaches are reviewed, highlighting their advantages and disadvantages, and comparing and contrasting them with our approach. 8 Chapter 2. Background and Related Work 9 We also review the existing research on identifying parallel programming paradigms. 2.1.1 High Level Approaches Parallelizing Compilers Parallelizing compilers are aimed at extracting loop level and lower levels of parallelism in a sequential program. Considerable research work is being done in developing com pilers that automatically parallelize FORTRAN DO loops [PW86, CK88, HKT91J. The programmers write sequential programs in standard FORTRAN, and the compiler an alyzes data dependencies and uses parallelizing and vectorizing constructs to optimize the program for a given parallel hardware. With automatic parallelizing compilers, users need not be concerned with writing explicitly parallel code. In some cases, users can provide compiler directives for program partitioning and mapping. Compilers generally perform local optimizations which may not always lead to an overall improvement in performance. They have to use conservative values for data unknown at compile time. Parallelizing compilers have been successfully used on multicomputers for certain classes of problems such as SPMD (Single Program Multiple Data) programs. In comparison, we have studied task-oriented paradigms on multicompnters. Our approach concentrates on global optimizations that lead to overall improvement in per formance of application programs. This is done by considering classes of applications separately, and identifying their characteristics to decide on the necessary global opti mizations to efficiently run them on a particular hardware system. High-Level Languages There is a group of researchers that advocate the use of high-level languages based on functional, logical and data-flow models of computation [Kog85, Dav85, Den8O]. Using these languages, the programmer needs only to write a high-level declarative description of the algorithm, which is free of concurrency. The compiler and the runtime system produce code suitable for a specific parallel system. Chapter 2. Background and Related Work 10 The advantage of being able to write programs in a very high level is generally out weighed by the resnlting poor performance. It is very unlikely that the standard imple mentation decisions used by the compiler will be optimal for all situations. Programmers who understand the specific structure of their algorithms can always do better optimiza tions than the generalized transformations included in a compiler. In our approach, different classes of applications are considered separately, and good optimizations for each of them are obtained by understanding the structure of the underlying algorithms. 2.1.2 Low level Approaches In multicomputers such as the transputer-based [1im87] and C40-based [Inc9l, 1nc92j machines, the user is totally responsible for implementing parallelism. The program ming environments provided in these cases consist of languages such as occam [Lim84] or extended C that provide process creation and inter-process message passing. The pro grammer is responsible for partitioning the work into processes and mapping them to exploit the parallelism provided by the hardware. The programmer also has to manage communication between processes. It is possible to extract maximum performance out of the system if the programmer has good knowledge of the underlying hardware architecture and how well it can be used for a given application program. Even though high performance is achievable, it is difficult since the programming environments provide minimal support for these machines. In our approach, the programming system provided to the users efficiently imple ments parallelism on a given machine and manages communication among processors. The user has to concentrate only on the application dependent sequential code. 2.1.3 Other Approaches The Linda project advocates the use of a coordination language such as Linda in conjunc tion with computational languages like C or FORTRAN to make parallel programming easier [GC92]. A coordination language provides operations to create multiple execution threads (or processes) and supports communication among them. Linda[CG89, ACG91] Chapter 2. Background and Related Work 11 consists of a few simple tuple-space operations. Adding these tuple-space operations to a computational language produces a parallel programming dialect such as C-Linda. In this case, the programmers are responsible for creating and coordinating multiple execution threads. Linda provides a shared memory paradigm independent of the un derlying parallel architecture. The processes can communicate and synchronize using the tuple-space, which is in fact a shared memory. Implementation of the Linda tuple-space on a distributed memory machine is gener ally difficult since the tuple space has to be distributed and replicated, which can lead to poor performance. With our approach, the system takes responsibility for process creation and coordination, rather than the user. Foster and Overbeek[F090j propose an approach called bilingual programming. In this approach, the key idea is to construct the upper levels of an application in a highlevel language, while coding the selected low-level components in low-level languages. They argue that this approach permits the advantages of the high-level notation (ex pressiveness, elegance, conciseness) to be obtained without the usual cost in performance. They introduce a particular bilingual approach in which the concurrent programming language Strand [FT9O] is used as the high-level language and C or Fortran is used to code low-level routines. Strand provides a high level notation for expressing concurrent computations. With this approach, overall performance is determined by the decisions on how to partition concurrent processes into tasks and map them onto various nodes. The user is responsible for partitioning and mapping, although there are some tools which can provide guidance. Our runtime systems take care of creatin5g concurrent processes and communicating among them. Each system efficiently implements partitioning and scheduling for a particular class of applications. The performance models can be used for restructuring application programs to obtain better performance. 2.1.4 Parallel Programming Paradigms There are several well-known programming paradigms such as divide-and-conquer, branch-and-bound and dynamic programming techniques that are commonly used in Chapter 2. Background and Related Work 12 designing sequential algorithms. These paradigms are not exactly algorithms, but they are problem solving techniques or high level methodologies that are common to many efficient algorithms. We can find similar problem solving techniques that are commonly being used in designing parallel algorithms. Identifying and analyzing useful parallel programming paradigms will help the programmer in understanding parallel computation and in the difficult process of designing and developing efficient parallel algorithms. In general, programming paradigms encapsulate data reference patterns. In the case of parallel programming paradigms, they encapsulate underlying communication patterns. Since they identify useful communication patterns, they can help in designing architectures that can effectively support commonly used communication patterns. The analysis of these paradigms can provide guidelines for designing programming tools that can assist application programmers in obtaining better performance on a given parallel machine. The following paragraphs summarize the related research on identifying and under standing useful parallel programming paradigms. In 1989, Kung et. al. [Kun89j identified several computational models based on their experiences in parallel algorithm design and parallel architecture development. These models characterize the interprocessor commnnication and correspond to different ways in which cells in 1D processor arrays exchange their intermediate results during computation. The models are: 1. Local computation 6. Recursive computation 2. Domain partition 7. Divide-and-conquer 3. Pipeline 8. Query processing 4. Multi-function pipeline 9. Task Queue 5. Ring Fox [FJL88] and Karp [Kar87J have discussed SPMD paradigm for programming shared and distributed memory multicomputers. In the SPMD model, the same program is executed on all the processors. Processors communicate their intermediate results to Chapter 2. Background and Related Work 13 their neighbors and synchronize at a barrier point. Fox and others[FJL+88] have success fully used the SPMD model to solve a number of large applications on multicomputers. Nelson [Ne187] has studied compute-aggregate-broadcast, divide-and-conquer, pipelining and reduction paradigms for distributed memory parallel computers. He has discussed how these paradigms can be used to develop algorithms for solving many numerical and non-numerical applications. He has also studied the contraction problem, the problem arising when an algorithm requires more processors than are available on a machine, for algorithms based on these paradigms. Cole{Co189] advocates an approach in which the users are presented with a selection of “Algorithmic Skeletons” instead of an universal programming interface. Each skeleton captures the essential structure of some particular problem solving style or technique. To solve a particular problem, the user is required to select a skeleton which describes a suitable problem solving method. The procedures and data structures are added to the skeleton to customize it to the specific problem. Since each instance of these proce dures will be executed sequentially, they can be specified in any sequential programming language. He has discussed four different skeletons - Divide-and-conquer, Task Queue, Iterative Combination and Cluster. He proposes to embed suitable topologies for var ious skeletons on a grid architecture. In terms of performance, he has focussed on the asymptotic efficiency with which a large grid of processors can implement a system with respect to the performance of a single processor. Pritchard and Hey [Pri9O, Hey9O] discuss three useful paradigms for programming transputer arrays: Processor Farm, Geometric Array and Algorithmic Pipes. Processor Farm uses a manager-workers setup to solve an application that consists of a large number of independent tasks. Geometric Array is same as the SPMD model mentioned earlier and Algorithmic Pipes is similar to the pipeline approach. PIE project [RSV9O] uses parallel programming paradigms as an intermediate layer of abstraction, called implementation machine (IM) level, between the application level and the physical machine level for uniform memory access multiprocessors. Each TM has two representations: an analytical representation and a pragmatic representation. The analytical representation helps in predicting the performance of a class of applications Chapter 2. Background and Related Work 14 using the TM. The model predicts the upper bound and lower bound on performance of an application that uses this TM. A pragmatic representation of TMs is made available in the form of modifiable templates. All necessary communication and synchronization for the TM are correctly and efficiently implemented in the template. All the user needs to do is to insert the application dependent code. With the help of the TM layer, the user can write performance efficient parallel programs with relative ease. The aualytical models help the user to select the most appropriate or efficient TM for a given application and parallel machine. Two TMs, master-slave and pipeline, have been implemented on a Encore Multirnax, a bus-based shared memory multiprocessor. In this thesis, we explore an approach in which users are provided with program ming support based on parallel programming paradigms for multicomputers. This ap proach is similar to Cole’s proposal of Algorithmic Skeletons. In contrast to Cole’s theoretical study of how various skeletons can be implemented on a grid architecture, we have implemented runtime systems for two widely used paradigms, Processor Farm and Divide-and-Conquer, that are topology independent. Performance models derived in this thesis are analytical models, unlike Cole’s asymptotic models, and hence can be used in performance tuning. Our approach is similar to that followed by PIE. It differs in the underlying parallel architectures being considered and the apparent fact that we can obtain accurate models on multicomputers. 2.2 Performance Modeling In the case of sequential computation, performance can be adequately characterized by the instruction rate of the single processor and the execution time requirement of the software on a processor of unit rate. Predicting the performance of a parallel algorithm on a parallel architecture is more complex. Performance depends on the computation and communication characteristics of the algorithm, in addition to the characteristics of the hardware and software system. In order to use parallel systems effectively, it is important to understand the per formance of parallel algorithms on parallel architectures. This can help in determining Chapter 2. Background and Related Work 15 the most suitable architecture for a given algorithm. It can also help in predicting the maximum performance gain which can be achieved. In this section, we summarize the relevant research in understanding the performance behavior of parallel systems and highlight the applicability and limitations of each. 2.2.1 Performance Measures and Models It is a well known fact that the speedup for a fixed-size problem on a given parallel architecture does not continue to increase with an increasing number of processors, but tends to saturate or peak at a certain value. In 1967, Amdahl [Amd67j argued that if s is the serial fraction in an algorithm, then the speedup obtainable is bounded by 1/s even when an infinite number of processors are used. For an N processor system, speedup is given by 1 s + (1 — s)/N This observation, which is generally known as Amdahl’s Law, has been used to argue against the viability of massively parallel systems. In the recent years, researchers have realized that it is possible to obtain near-linear speedup by executing large problems. In 1988, Gustafson and others at Sandia National Lab [Gus88] were able to obtain near-linear speedup on a 1024-processor system by scaling up the problem size. Gustafson argues that in practice, users increase the problem size when a powerful processor is made available; hence, it may be more realistic to assume runtime as constant instead of problem size. He introduced a new measure called scaled speedup, defined as the speedup that can be achieved when the problem size is increased linearly with the number of processors. For an N processor system, scaled speedup is given by N + (1 — N) x s. Karp and Flatt [KF9O] have used experimentally determined serial fraction as a metric in tuning performance. The experimentally determined serial fraction, f is defined as 1/S 1/N 1—1/N — where S is the speedup obtained on an N-processor system. If the loss in speedup is only due to the serial component, that is, there are no other overheads, the value of f is Chapter 2. Background and Related Work 16 exactly equal to the serial fraction s used in Amdahl’s law. With the help of experimental results, they argue that this measure provides more information about the performance of a parallel system. If f increases with N, then it is considered an indicator of rising communication and synchronization overheads. An irregular change in f as N increases would indicate load balancing problems. Eager, Zahorajan and Lazowska[EZL89J use a simple measure called average paral lelism to characterize the behavior of a parallel software system. The software system is represented by an acyclic directed graph of subtasks with precedence constraints among them. Average parallelism is defined as the average number of processors that are busy during the execution time of the software system, given an unbounded number of avail able processors. Once the average parallelism A is determined, either analytically or experimentally, the lower bounds on speedup and efficiency are given by NA (N+A-1) and A (N+A-1) respectively. This measure can be used only if the parallel system does not incur any communication overheads or whenever these overheads can be easily included as part of the tasks. Kumar and Rao[KR87] have developed a scalability measure called the isoefficiency function, which relates the problem size to the number of processors necessary for an increase in speedup proportional to the number of processors used. When a parallel system is used to solve a fixed-size problem, the efficiency starts decreasing with an in crease in the number of processors as the overheads increase. For many parallel systems, for a fixed number of processors, if the problem size is increased then the efficiency in creases because the overhead grows more slowly than the problem size. For these parallel systems, it is possible to maintain efficiency at a desired value (between 0 and 1) for an increasing number of processors, provided the problem size is also increased. These systems are considered to be scalable parallel systems. For a given parallel algorithm, for different parallel architectures, the problem size may have to increase at different rates with respect to the number of processors in order to maintain a fixed efficiency. The rate at which the problem size is required to grow with respect to the number of processors to keep the efficiency fixed essentially determines the degree of scalability of Chapter 2. Background and Related Work 17 the parallel algorithm for a specific architecture. If the problem size needs to grow as fz(P) to maintain an efficiency E, then fE(p) is defined as the isoefficiency function for efficiency E. If the problem size is required to grow exponentially with respect to the number of processors, then the algorithm-architecture combination is poorly scalable since it needs enormously large problems to obtain good speedups for a larger number of processors. On the other hand, if the problem size needs to grow only linearly with respect to the number of processors, then the algorithm-architecture combination is highly scalable. Isoefficiency analysis has been used in characterizing the scalability of a variety of par allel algorithm-architecture combinations [GK92]. Using isoefficiency analysis, one can predict the performance of a parallel program on a larger number of processors after testing the performance on a smaller number of processors. Stone[Sto88] has used a simple model to determine how granularity affects the speedup on a multiprocessor. The model considers an application program that consists of M tasks and obtains the maximum speed with which this program can be executed on an N processor system. It assumes that each task executes in T units of time. Each task communicates with every other task at an overhead cost of T units of time when the communicating tasks are not on the same processor, and at no cost when the communicating tasks are on the same processor. The results of this model indicate that the speedup is proportional to N up to a certain point. After this, as N increases, the speedup approaches a constant asymptote which can be expressed as a function of the task granularity. This model gives a general picture of how granularity and over head affect the performance of a multiprocessor system. It also gives some indication of the importance of minimizing overhead and selecting a suitable granularity. Stone’s studies indicate that there is some maximum number of processors that is cost-effective, and this number depends largely on the architecture of the machine, on the underlying communication technology, and on the characteristics of each specific application. Flatt and Kennedy[FK89] have derived some upper bounds on the performance of parallel systems taking into account the effect of synchronization and communication overheads. They show that if the overhead function satisfies certain assumptions, then Chapter 2. Background and Related Work 18 there exists a unique value N 0 of the number of processors for which the total execution time for a given problem size is minimum. However, for this value, the efficiency of the system is poor. Hence they recommend that N should be chosen to maximize the product of speedup and efficiency and analytically compute the optimal values of N. A major assumption in their analysis is that the per-processor overhead grows faster than 0(N), which limits the applicability of the analysis. Performance metrics such as serial fraction (s and f) and average parallelism (A) are simple measures that can be used to obtain rough bounds on performance. These cannot be easily used for performance prediction and tuning, especially for multicomputers in part because they neglect communication overheads. Also, the values of these parameters often cannot be obtained easily. We have concentrated on considering the characteristics of both the system and the applications in order to obtain accurate performance models that can be used for performance prediction and tuning. Our models take into account all the communication overheads involved in implementing different classes of applications on multicomputers. The values of the parameters used in our models can be determined in a relatively easy manner, and we discuss some of the techniques for obtaining them. 2.2.2 Integratioti There has been very little work done in integrating performance tuning into program ming environments to provide performance-efficient parallel programming. To make best use of the underlying parallel architecture for an application program, in addi tion to programming support, users must be provided with performance models that can help in predicting how well their programs are going to perform. The environment should be able to assist the programmers in restructuring their applications to improve performance. PIE[SR85] addresses the issues of integrating performance tuning into the program ming environment through the support of specific implementation paradigms coupled with a performance prediction model. The model provides performance trade-off infor mation for parallel process decomposition, communication, and data partitioning in the context of a specific implementation paradigm and a specific parallel architecture. Chapter 2. Background and Related Work 19 Gannon[AG89] describes an interactive performance prediction tool that can be used by the user to predict execution times for different sections of a program. This perfor mance predictor analyzes FORTRAN programs parallelized by an automatic paralleliz ing and vectorizing compiler targeted for the Alliant FX/8. Programmers can use the predictor to estimate the execution time for a segment of the code produced by the compiler. The predictor uses a database to estimate the total number of CPU cycles needed for the segment. They also incorporate a simple model of memory contention into the predictor to include the effects of caching. This predictor can be used only to predict the execution time of a segment of the program; it does not give any specific information about the overall performance of a parallel program. Kennedy and Fox[BFKK91] have worked on an experimental performance estimator for statically evaluating the relative efficiency of different data partitioning schemes for any given program on any given distributed memory multiprocessor. The performance estimator is aimed at predicting the performance of a program with given communication calls under a given data partitioning scheme. This system is not based on a performance model. Instead, it employs the notion of a training set of kernel routines that test various primitive computational operations and communication patterns on the target machine. The results are used to train the performance estimator for that machine. This training set procedure needs to be done only once for each target machine, during the environment or compiler installation time. Although the use of a training set simplifies the task of performance estimation significantly, its complexity lies in the design of the training set program, which must be able to generate a variety of computation and data movement patterns to extract the effect of the hardware/software characteristics of the target machine on the performance. The authors argue that real applications rarely show random data movement patterns and there is often an inherent regularity in their behavior. They believe that their training set program will probably give fairly accurate estimates for a large number of real applications, even though it tests only a small (regular) subset of all the possible communication patterns. We have integrated the programming and performance tuning support into Parsec (described in Chapter 8). Parsec is an on-going project at the University of British Chapter 2. Background and Related Work 20 Columbia which is aimed at developing an integrated programming environment to sup port several parallel programming paradigms. It includes supporting tools such as a graphical interface, mapper, loader and debugger. The programmers can make use of performance models to predict the performance of their applications, and also can ob tain optimal values for system parameters such as the number of nodes and topology that can lead to maximum performance. The environment allows programmers to easily change the parameters and accordingly does the necessary mapping and loading. Some of the techniques that can be used to determine the values of the application dependent parameters are described in Chapter 8. Chapter 3 Methodology The diversity of parallel computing architectures and their underlying computation mod els makes it particularly difficult to find universal techniques for developing efficient parallel programs. Choosing an appropriate parallel machine for a given application is a difficult process. Furthermore, in most of the existing programming environments available on multicomputers, the user is responsible for managing both parallelism and communication. As explained in Chapter 2, identifying and analyzing useful parallel programming paradigms may help programmers in the difficult process of developing efficient parallel algorithms. In this chapter, we present an approach based on parallel programming paradigms for developing efficient programs for multicomputers. Section 3.1 describes an integrated approach we have taken to address the program ming and performance modeling problems on multicomputers. In Section 3.2, we ex plain the multicomputer system model used in this dissertation. Section 3.3 describes the transputer-based multicomputer system that is used as an experimental testbed in this research. This approach has been used to develop programming support and per formance models for two commonly used parallel programming paradigms, processor farm and divide-and-conquer. In Section 3.4, we discuss the characteristics of these two task-oriented programming paradigms and how these paradigms can be used for various kinds of applications. 21 Chapter 3. Methodology 3.1 22 An Integrated Approach This approach provides application programmers with abstractions based on commonly used parallel programming paradigms. The application programmers are provided with a set of Virtual Machines (VMs), where each virtual machine corresponds to a parallel programming paradigm. Each virtual machine consists of an analytical performance model, and an efficient runtime system that can be used to run applications that fit into the corresponding paradigm. The user has to choose one of the virtual machines that corresponds to the paradigm that can be used to solve his application problem. With this approach, the users are not responsible for implementing parallelism and communication. Each runtime system implements parallelism and all the necessary communication and synchronization needed for running the corresponding class of ap plications in an efficient manner. It also implements other system dependent aspects such as task scheduling and load balancing. Such a runtime system can be efficiently implemented by a systems programmer who understands the complexities of the under lying hardware and software system. The runtime system provides a simple interface to the user. The user has to write only the application dependent code and execute it with the runtime system. This approach eliminates the difficulties in programming multicomputers and reduces the software development cost. Also, as the user code does not contain any system dependent parts, it is portable across different machines and systems on which the virtual machine implementations are available. The analytical performance model helps in predicting the performance of application programs that use the particular virtual machine. Some of the parameters of the model are application program dependent and the others are dependent on the characteristics of the underlying physical machine and software system. The values of the system dependent parameters can be estimated once the rnntime systems are implemented. The application dependent parameters are either estimated or measured (either from a serial program or from a scaled down parallel program). Once these parameter values are known, the model can be used to predict the actual performance of an application program on a given parallel system. It can also be used in performance tuning, either to Chapter 3. Methodology 23 choose the optimal number of nodes to be used for a given application or to restructure the application to maximize its performance. Two virtual machines corresponding to commonly used parallel programming paradigms, processor farm and divide-and-conquer, were developed. In Section 3.4, we describe the characteristics of these two task-oriented paradigms. 3.2 System Model In this thesis, we consider distributed memory parallel computers (multicomputers). The system consists of processor nodes connected by an interconnection network such as a chain, tree, mesh, hypercube etc. with the underlying support for point-to-point communication. Each processor node consists of a CPU with its own local memory and hardware support for communication links. Execution kernel designs are based on the following assumptions on the characteris tics of the underlying system. The kernels assume a reliable point-to-point communica tion mechanism. Furthermore, it is assumed that the data can be transferred simulta neously on all the links and that data transfer can be overlapped with the computation. However, for each message, there is a CPU start-up cost that cannot be overlapped with the computation. This time may include hardware set-up costs, context switch times, and other system software overheads. Message start-up is an important overhead that significantly affects performance. In addition, it is assumed that the system supports concurrent processes on each processor with process scheduling and levels of priorities. This support could be available either in hardware (as in nhansputers) or in software (as in TI C40s). Performance models are developed assuming homogeneous processors and links. A linear cost communication model is assumed (i.e., for every message, there is a CPU cost for starting the communication, and a transfer cost proportional to the message size). The time for a processor node to send a message of length n-i to its neighbor is given by + rm, where j3 is the CPU start-up cost described in the previous paragraph and r is the transfer rate of the communication links. Chapter 3. Methodology 3.3 24 Experimental Testbed In this section, we describe the transputer-based multicomputer in the Department of Computer Science at the University of British Columbia that is used as an experimental testbed in this thesis. Performance models derived in Chapter 4 and 6 for processor farm and divide-and-conquer are applicable for any multicomputer system that satisfies the system model described in Section 3.2. In addition, the corresponding runtime system designs can be implemented on any similar multicomputer system. 3.3.1 Hardware System The transputer-based multicomputer consists of 75 T800 transputer nodes and 10 cross bar switches. The system is hosted by a Sun-4 workstation via VME bus, with 4 ports that connect the system to the host. There are 64 nodes with 1 MB of external memory and 10 nodes with 2 MB. There is a special node that has 16 MB, and is used as the manager node for both Pfarm and TrEK. The INMOS T800 transputer is a 32 bit microprocessor with a 64 bit floating point unit on chip. A T800 running at 20MHz has a sustained processing rate of 10 MIPS and 1.5 Mflops. It has 4 KB of on-chip RAM and four bit-serial communication links. These communication links allow networks of transputer nodes to be constructed by direct point to point connections with no external logic. Each link runs at an operating speed of 20 Mbits/sec and can transfer data bidirectionally at up to 2.35 MB/sec. The T800 processor has a microcoded scheduler that enables any number of concur rent processes to be executed together, sharing the processor time. It supports two levels of priority for the processes: high and low. A high priority process, once selected, runs until it has to wait for a communication, a timer input or until completion. If no process with a high priority is ready to proceed, then one of the ready low priority processes is selected. Low priority processes are time sliced every millisecond. A low priority process is only permitted to run for a maximum of two time slices before the processor desched ules it at the next descheduling point. Process context switch times are less than 1 s, as little state needs to be saved. The transputer has two 32 bit timer clocks. One timer Chapter 3. Methodology 25 is accessible only to high priority processes and is incremented every microsecond, and the other is accessible only to low priority processes and is incremented every 64 /ts. Each C004 crossbar switch has 32 data links and a control link. These are pro grammable through the control link and each can have 16 bidirectional connections. Thus, the system is reconfigurable and an appropriate interconnection network for an application can be chosen. As the system is not fully connected, there are some restric tions on the possible configurations. For Pfarm and TrEK, we statically configure the system into an appropriate interconnection network. 3.3.2 Software Environments Pfarm and TrEK have been implemented using C on two different software environ ments that are available on our multicomputer: Logical Systems and Trollius. The Logical Systems environment [Moc88, Sys9O] includes a library of process creation and communication functions. The environment has an utility called id-net that runs on the host and downloads the executable programs onto a network of transputers. In the Trollius environment [BBFB89, F+90], the programs are run on top of a kernel on each transputer node. The Trollius kernel manages and synchronizes any number of local processes. There are two levels of message passing in Trollius. The kernel level allows communication between processes on the same node. The network level allows communication between processes on different nodes, as well as between processes on the same node. There are four sub-levels of message passing within the Trollius network level, representing different functionality/overhead compromises. They are, in order of increasing functionality and overhead, the physical, datalink, network and transport sub-levels. Trollius provides both blocking and non-blocking communication functions. It also includes a set of utility programs that run on the host, which can be used to load, monitor and control the programs on transputer networks. Chapter 3. Methodology 3.4 26 Task-oriented Paradigms In this section, we describe the characteristics of two task-oriented parallel programming paradigms, processor farm and divide-and-conquer, that are considered in this thesis. 3.4.1 Processor Farm Processor farm is one of the most widely used strategies for parallelizing applications. There are a large number of scientific and engineering applications that consist of re peated execution of the same code, with different initial data. In addition, there is little or no dependency among the different executions of this code. These differ ent executions can be considered as a set of tasks and can be executed in parallel. When we use multiple processors to execute these tasks in parallel, even though all the processors are executing the same code, they may not be executing the same in struction at any given time as they execute different parts of the code depending on the initial data. Therefore, it is not possible to use SIMD (Single Instruction Multi ple Data) parallel machines to run these applications. These kind of applications can be run efficiently on MIMD (Multiple Instruction Multiple Data) parallel computers. Even though shared memory MIMD machines can be used for this class of applica tions {A1v90], distributed memory MIMD machines are being widely used because they are scalable and cost-effective. Processor farms are described in many transputer pro gramming books [Lim88, Ga190, Cok9l]. This strategy has been used in parallelizing applications in several areas [CHvdV88, BTU88, CU9O, SS91, CCT93, NRFF93]. The typical setup to run these applications is to use a “farm” of worker processors that receive tasks from, and return results to, a manager processor. Each worker pro cessor runs the same program and depending on the data received for individual tasks, it may execute the same or different parts of the program. No communication is re quired among the worker processors to execute the tasks. The manager processor has to communicate with the worker processors to distribute the tasks and to receive the results. The ideal architecture for this “manager-workers” setup is one in which each worker Chapter 3. Methodology 27 Figure 3.1: Ideal Manager-Workers Architecture processor is directly connected to the manager processor as shown in Figure 3.1. In practice, this can not be realized as the processor nodes in distributed memory ma chines generally have a fixed degree of connectivity (e.g., in the transputer case, each processor has four links and a network of transputers can be configured using crossbar links). It is possible to use crossbar switches to dynamically reconfigure the connection between the manager and the worker processors such that every worker will be connected to the manager directly for a certain period of time [Hom93]. The need for special hard ware in addition to not being scalable makes this configuration less useful. Also, many commercial machines use a fixed interconnection network such as mesh or hypercube for interconnecting their processor nodes. Thus, one has to use an interconnection net work in implementing processor farm applications, and the topology has to be chosen with a view of minimizing the communication costs in distributing tasks and collecting results. Since tree architectures that have a minimum possible number of hops to each worker from the manager incur minimum communication costs, they are best suited for implementing processor farm applications. There are several variants of processor farm that increase in applicability. In the generic description of a processor farm, the application program consists of a number of independent tasks that have to be executed on a network of worker processors. All the tasks may be of the same type in which case the workers execute the same code Chapter 3. Methodology or different parts of the code depending on the initial data. 28 The computation time requirements might vary from task to task depending on their data. The manager might have all the tasks readily available to start with, or there could be some computation involved in producing these tasks. Also, in the case of real-time applications, the data for individual tasks arrive in real time. Processor farms can also be used to execute applications that consist of multiple types of tasks. In this case, the worker processors have to be loaded with the code needed to execute each type of task with the appropriate code chosen, according to its type at runtime. Cramb and Upstill [CU9OJ discuss a processor farm implementation of such an application. It is also possible to use processor farms to execute application programs that consist of multiple phases of computation [Son93]. In these applications, the tasks might have some dependency. After a phase of computation, a new set of tasks for the next phase are generated based on the results of the current phase. Also, some applications might consist of a number of distinct phases of computations out of which some phases could be executed efficiently using a processor farm [BTU88]. In this case, the processor farm acts as a computational server that is called at different points in the application program to execute a set of tasks. Most of the processor farm designs discussed in the literature are overly simplistic and hide the issues that affect their performance and reuse. Poor reuse of code is a major problem in parallel programming and this remains the case in processor farm applications. There is very little work done in understanding how the various parameters such as the hardware topology, computation and communication requirements of the tasks affect overall performance. In addition to making it reusable and topology independent, the Pfarm system de scribed in Chapter 4 was designed considering all the factors that affect the performance. Performance models were derived from these realistic implementations of Ffarm. The models have been experimentally validated and are found to be accurate. Thus, they can be used in performance tuning to maximize performance. Chapter 3. Methodology 3.4.2 29 Divide-and-Conquer Divide-and-conquer is a well-known problem solving strategy used in deriving efficient algorithms for solving a wide variety of problems on sequential machines. Efficient divide-and-conquer algorithms have been used for solving problems in several areas such as graph theory, sorting, searching, computational geometry, Fourier transforms and matrix computations [AHU74, HU79, Sed83, Ben8O]. It is also a useful strategy in hardware design as mentioned by Uliman in [U1184]. The divide-and-conquer strategy can be briefly stated as follows: A large instance of a problem is divided into two or more smaller instances of the same problem. The results of smaller instances, called sub-problems, are combined to obtain the final result for the original problem. Sub-problems are recursively divided until they are indivisible and can be solved by a non-recursive method. On uniprocessor systems, after dividing the original problem, sub-problems are solved sequentially. Sequential divide-and-conquer results in a tree structure of sub-problems. Several researchers have discussed the usefulness of the divide-and-conquer strategy in parallel processing [HZ83, GM85, GR88, Sto87]. On parallel systems, sub-problems can be solved concurrently provided that the system has sufficient parallelism. Problem splitting and combining of results can also make use of the available parallelism. These operations require interprocessor communication for distributing the data and for receiv ing the corresponding results. As the sub-problems are independent, there is no need for communication among the processors working on different sub-problems. In its most general setting, a divide-and-conquer algorithm can be described as a dynamically growing tree structured computation, where initially there is a single prob lem and sub-problems are created as the problem is recursively divided. The number of subproblems and the depth of the tree may depend on the data and thus known only at runtime. For example, evaluation trees of functional and logic programs have the above characteristics. In the case of applications such as matrix multiplication and FFT, the degree of division and the depth of the computation tree is fixed. In some applications such as Quicksort, the problems are not equally divided. Chapter 3. Methodology 30 There are real-time applications in areas such as vision and image processing, in which there is a continuous stream of real-time data and each data set has to be processed with a divide-and-conquer algorithm. Also, there are some non real-time applications in areas such as numerical analysis that consist of a set of problems, each of which can be solved using a divide-and-conquer algorithm. In chapter 6, we describe the design of a runtime kernel called TrEK(Tree Execution Kernel) that provides runtime system support for divide-and-conquer applications on multicomputers. TrEK is designed such that it can execute divide-and-conquer compu tations of any degree and depth on an arbitrary tree topology. To improve the overall performance, TrEK makes use of the intermediate processors to process subproblems in addition to splitting and joining. The task-oriented framework chosen here assumes that there is a flow of divide-and-conquer tasks. This framework is well suited for ap plications that consist of a number of divide-and-conquer tasks. It can also be used for applications in which the problem consists of a single divide-and-conquer computation tree with large degree and depth. In this case, when we run such an application on a processor tree of smaller degree, the flow of tasks approximate the computation entering the subtrees. This framework allows us to derive performance models that accurately describe the behavior and performance characteristics of TrEK. In the following paragraphs, we discuss how this work contrasts with the related work in using divide-and-conquer for parallel processing. Horowitz and Zorat [HZ83] have outlined how appropriately designed multiproces sors whose logical structure reflect the tree structure of divide-and-conquer can be used to efficiently execute divide-and-conquer algorithms. They discuss the data movement problem in hardware configurations that have local memory, global memory via common bus and global memory augmented by local caches. Their proposal of a local memory architecture consists of processors with local memory connected as a full k-ary tree for executing k-ary divide-and-conquer problems. In contrast with their model, TrEK makes use of intermediate processors in addition to leaf processors for processing of subproblems to improve overall performance. Also, TrEK can execute divide-and-conquer problems of any degree and depth on distributed memory machines with any given topology. Chapter 3. Methodology 31 Stout [5to87] has discussed the usefulness of the divide-and-conquer strategy ou dis tributed memory parallel computers for solving image processing problems. He has presented a divide-and-conquer algorithm for the connected components problem. He has analyzed some of the requirements of this problem, and outlined some of the implica tions for machine architectures and software. Nelson [Nel87] has studied how the divideand-conquer paradigm can be used in designing parallel algorithms. He has discussed and presented parallel algorithms based on sequential divide-and-conquer algorithms for Batcher’s Bitonic Sort, Matrix Multiplication, and Fast Fourier Transform(FFT) prob lems. Contraction of these algorithms on a binary n-cube show that different algorithms required different contractions to obtain good results. Our work focuses on developing efficient programming support for divide-and-conquer applications on multicomputers to hide the underlying complexities of the parallel machine from the programmer. McBurney and Sleep [M588] have done experimental work on implementing divideand-conquer algorithms on a transputer-based system. Their ZAPP architecture is a virtual tree machine that is capable of dynamically mapping a process tree onto any fixed, strongly connected network of processors. Each processor runs a ZAPP kernel that implements the divide-and-conquer function. Each processor performs a sequential depth first traversal of the process tree, constructing sub-problems. Parallelism is introduced by allowing immediate neighbor processors to steal sub-problems. It is difficult to obtain any performance model for this framework. The task distribution strategy in TrEK differs from that used in ZAPP. Unlike ZAPP, TrEK does not allow a node to grab subtasks from its output queue for processing. This restriction allows us to model the system and does not degrade performance. Cole, in his algorithmic skeletons [Co189], has studied how well a divide-and-conquer skeleton can be implemented on a grid architecture. He proposes to use an H-tree layout to map tree processors to mesh processors, but has not implemented the system. In terms of performance, he has focussed on the asymptotic efficiency with which a large grid of processors can implement the system with respect to the performance of a single processor. In contrast, the TrEK design can work on multicomputers with any topology and has been implemented on a transputer-based multicomputer. Performance models Chapter 3. Methodology 32 derived in this thesis are analytical models, unlike Cole’s asymptotic models, and hence can be used for performance tuning. 3.5 Chapter Summary In this chapter, we have described an integrated approach for addressing programming and performance modeling problems on multicomputers. We have discussed the charac teristics of two task-oriented paradigms, processor farm and divide-and-conquer, that are considered in this thesis. The following chapters describe the design and implementation of runtime systems, development of performance models and their experimental valida tion on a 75-node transputer based multicomputer. An integrated programming envi ronment that includes programming tools such as graphical interface, mapper, loader, monitor and debugger in addition to virtual machines would make programming these machines much easier. Such an environment is described in Chapter 8. Chapter 4 Processor Farm: Design and Modeling In this chapter, we describe a processor farm and detail the trade-offs involved in its design. We derive models that accurately describe the behavior and performance char acteristics of the system. We give the limitations and assumptions on which the models are based and describe how the models were used in the design process. The models are sufficiently general that they can be used to predict performance of our design on any topology. Providing independence from both size and topology while maintaining the ability to tune the performance strongly supports reuse. In Section 4.1, we classify different types of processor farms and present our pro cessor farm system, Pfarm. We compare and contrast our design with that of others at appropriate places within this section. In Section 4.2, we derive general analytical models that describe the start-up, steady-state, and wind-down phases of the execution on any tree topology. As balanced tree topologies provide maximum performance among all topologies of the same size, we apply this modeling technique to derive expressions for balanced tree topologies. We close by discussing how our models can be used in performance tuning and restructuring of application programs. 33 Chapter 4. Processor Farm: Design and Modeling 4.1 34 Pfarm: Design and Implementation We begin by listing some of the important design issues and goals that must be addressed while designing and developing a system to execute applications efficiently using the processor farm strategy. 1. The system should fully exploit all the available parallelism in the hardware, such as the ability to simultaneously communicate on all links. 2. System overheads should be minimized. 3. The system should be topology independent, that is, it should scale and run on any processor topology. 4. The system should support reuse and provide an easy-to-use interface to the ap plication programmer. In a processor farm, control may either be centralized or distributed. In the case of centralized control, requests for work are sent to a central manager processor that assigns the tasks. Usually, as in Hey [Hey9Oj, these requests are routed through the network back to the manager. However, it is also possible to dynamically reconfigure the links, with the use of crossbar switches, so that the manager can load and drain tasks directly from each worker [Hom93]. In the case of distributed control, there is a manager on each processor. In this case, the tasks flow into the system and the local manager either schedules an incoming task to the local worker process or forwards it to a child processor. Distributed control processor farms are common and have been described by many authors (for example see Cok [Cok9lJ). Processor farms may be control driven or demand driven. A control driven scheme is useful when the work can be statically partitioned and assigned to the workers. A demand driven scheme, however, has the advantage that it can dynamically adjust to different sized tasks. Also, in a distributed processor farm, only neighbor to neighbor communication is necessary. Pfarm implements a distributed demand-driven processor farm. Chapter 4. Processor Farm: Design and Modeling 4.1.1 35 Process Structure and Scheduling When implementing a distributed demand-driveu processor farm, each worker processor consists of at least two processes: a task mauager process and a worker process. Since intermediate processors in the network distribute tasks and collect results in addition to processing, from the performance point of view, it is important to overlap communication with computation. It affects the rate at which tasks can be forwarded, which, as shown in Section 4.2, limits the performance of the system. Although the processor farm implementations that have appeared in the literature [Cok9l] mention the importance of overlapping communication with computation, most do not fully implement it. In order to overlap communication with computation, it is necessary to have multiple processes on each worker processor. As a result, Pfarm has one InLink and one OutLink process for each hardware link in the processor, in case of transputers, there will in total be 4 InLink and 4 OutLink processes. The process structure of a worker with three children is shown in Figure 4.1. This figure depicts a single worker in the system. The entire system consists of a collection of these workers, organized in a tree structure, which logically corresponds to the ideal processor farm structure given in Figure 3.1. The result manager process in Figure 4.1 coordinates the collection and forwarding of results to the manager processor (or root) of the system. Pfarm takes advantage of the transputer’s ability to use the links and the CPU simultaneously, and makes it possible to overlap computation with the transfer of tasks and results. Note that there is still a non-overlapped message start-up time associated with each communication. Another important design consideration is the scheduling of local processes. In order to keep the worker processors busy processing tasks, it is important to forward the tasks as quickly as possible. Therefore, all the processes that distribute tasks should be run at high priority. Pfarm uses the hardware scheduler provided on the transputer chip. In Pfarm, all the task communicating link processes and the task manager process are run at high priority, whereas the worker process is run at low priority. In addition, link processes that forward results and the result manager are also run at high priority. As described in Section 4.2.3, this is important whenever system throughput is bound by the rate at which the manager receives the results. Also, this returns the results to the Chapter 4. Processor Farm: Design and Modeling 36 Figure 4.1: Process Structure on a Worker Processor in Pfarm manager as quickly as possible which is important when there are dependencies among the tasks. 4.1.2 Task Scheduling Since Pfarm distributes the control, there is a task manager process on each worker processor. The local manager gives priority to the local worker process while allocating an incoming task. If the local worker process is busy, the manager attempts to forward the task to one of its children using a round robin strategy among the free OutLink processes. The order in which the tasks are assigned to the OutLink processes depends on the underlying topology and affects the start-up time of the system. An analysis of the affect of round robin scheduling of tasks to OutLink processes during start-up is Chapter 4. Processor Farm: Design and Modeling 37 given in Section 4.2. In Pfarm, we initially flood-fill the system with tasks. However, once fnll, the system is demand-driven with new tasks entering the system as tasks are completed. This scheme works well for start-np and steady-state bnt is not as effective for the wind-down phase. Dnring wind-down, there are no longer any incoming tasks and workers closer to the root may idle since remaining tasks are still being forwarded towards the workers farthest from the root. 4.1.3 Buffering The nnmber of task bnffers to be allocated to each processor is an important design issne. In Pfarm, each process in the task distribntion path shown in Fignre 4.1 can have only one task. This provides snfficient buffering so that a process never idles waiting for a task. For example, when the worker process finishes, there will be another task available at the manager process. If the manager process was not there and the worker received the task directly from the InLink, then it would have to wait whenever the InLink process was in the midst of a communication. We call the task bnffer in the manager process, an additional bnffer since it is there for synchronization purpose only. It is important to minimize the number of buffers because the wind-down time increases prop ortiojially with the number of buffers. This occurs because more tasks end up on the workers at the leaves of the tree, resulting in workers closer to the root idling (a complete analysis of wind-down is given in Section 4.2.1). The number of buffers adversely affects start-up as well, as we show in Section 4.2.1. However, the number of buffers does not affect steady-state performance. At any given time, each processor can be viewed to have four tasks assigned to it, one each to the task manager process, local worker process, local InLink process and OutLink process of the parent processor. Thus, in total there are 4N active tasks in any N processor system, irrespective of the topology. We dO not restrict the number of result buffers as this does not affect the overall performance. But, at any given time, the number of active result buffers on any processor is small as the result forwarding communication processes and the result manager are Chapter 4. Processor Farm: Design and Modeling 38 run at high priority. In contrast, the amount of buffering required in a centralized scheme depends on the task size and message latency for receiving a new task. Unlike Pfarm, as N increases, message latency also increases and therefore the number of buffers per processor has to increase. In the centralized scheme, there is also the extra overhead of sending and receiving task request messages. 4.1.4 Topology Independence Pfarm system is designed to be topology independent. For processors and links that satisfy the assumptions described in Section 3.2, the observations about the Pfarm design remain true, independent of the topology. Besides transputers, there are other machines such as iPSC hypercubes [Arl88] and TI C40 [Inc9l] with similar behavior. Moreover, as we show in the next section, accurate performance models can be derived for Pfarm on any tree topology. For a given fixed interconnection, by taking a spanning tree, it is possible to use Pfarm and derive a model to predict its performance. 4.2 Performance Modeling In summary, as a consequence of the design, Pfarm has the following characteristics: 1. The hardware system is a distributed memory message passing architecture with linear message cost model. 2. There is a continuous flow of tasks into the farm. 3. All tasks originate from a single source and results are returned to the source. 4. Tasks are dynamically distributed to the workers. Our objective is to find a distribution of the load to all the worker processors so as to minimize the total execution time, where load consists of both the computational requirements of the tasks and the associated overheads for forwarding and executing the tasks. Chapter 4. Processor Farm: Design and Modeling 39 The system can be either computation bound or communication bound. When it is computation bound, the system acts as a pipeline, with three phases to be analyzed: start-up, steady-state and wind-down. The start-up phase begins when the first task enters the system and ends when all the worker processors have received at least one task. After start-up, the system is in steady-state where it is assumed that the processors do not idle. Finally, the wind-down phase begins when the last task enters the system and ends when all the results have reached the source. The total execution time is given by, Ttotai where is the start-up time, TSS = (4.1) s+Twd 8 Tsu+T is the steady-state time and Td is the wind-down time. For a sufficiently large number of tasks, steady-state time dominates the remaining two phases. However, to better understand the limitations of processor farms with a smaller number of tasks, it is important to analyze start-up and wind-down. When the system is communication bound, the total execution time is determined by the rate at which either the tasks can be distributed to the worker processors or the results can be received from them. In Sections 4.2.1 and 4.2.2, we derive performance models for the case in which the system is computation bound. In Section 4.2.1, we present a general analytical frame work to analyze the steady-state performance of processor farms on any tree topology. We argue that, under reasonable assumptions, Pfarm obtains optimal performance on any topology. Later in this section, upper bounds for start-up and wind-down time are also derived. In Section 4.2.2, we derive steady state, start-up and wind-down models for balanced complete trees using the general analytical framework. Balanced complete trees are interesting because they provide maximum performance among all topologies of the same size. In Section 4.2.3, we analyze the performance of processor farms when they are communication bound. 4.2.1 Let {j General Analytical Framework T be a tree architecture with processors p1,... ,PN. pj is a child of p} denote the children of p in T. Let a = Let C(i) = Te + lie be the processing Chapter 4. Processor Farm: Design and Modeling 40 time plus associated overhead to execute a task locally, and let /3 be the processor over head for every task forwarded to a child processor. [ r includes all the CPU overheads 3 involved in receiving a task, forwarding it to a child processor, receiving the correspond ing result and forwarding it to the parent processor. Let d and r be the average data and result size per task, respectively, and let T be the communication rate of the links. Steady-state Analysis The steady-state phase begins once all the processors have a task to execute and ends when the last task enters the system. It is assumed that no processor idles during steady-state; a processor is either processing a task or busy forwarding tasks and results. Suppose that M tasks are executed during this phase and let V denote the number of tasks that visit p. Then, the following condition holds for all the processors in T, 88 T — c(Vj— = (4.2) f 3 /)+/ jEC(i) + (/3 jC(i) (4.3) - jEC(i) That is, the steady-state execution time (T ) equals the processing time with the asso 5 ciated overhead (cr) for all the tasks executed locally plus the overhead (/3) for all the tasks that were forwarded. .. For a fixed T , c, and 1 8 f, these N conditions form a system of linear equations on N 3 unknowns; Vi, V , 2 VN. By ordering the equations so that the parent of a processor in T appears before its children, it is easy to see that the system is in an upper triangular form. Thus, the N equations are linearly independent and there is a unique solution. At the root of T, V = M. Furthermore it is easily verified, by back substitution, that V 1 is a linear function of T 88 can be expressed in terms of M, a and / . Thus T 8 f, 3 which implies that given M, a and solve for fj, , we can solve for T 55 and V 1 to VN. We can also the fraction of the M tasks executed by the ith processor, jEC(z) ). Chapter 4. Processor Farm: Design and Modeling 41 In addition, we can obtain the steady-state throughput, the rate at which tasks can be processed by the system. An example of this analysis for an arbitrary architecture is given in Figure 4.2. a f—a 3 / f—a 3 / a a / f 3 —a a f—a 3 / a 88 T — 2 5a — 1 V V 2 3 V 4 V 5 V T 8 88 T 88 T 33 T 3M a 6af + Figure 4.2: An example of the steady-state analysis This analysis gives the execution time of the steady-state phase in terms of parame ters that can be determined prior to execution. The total number of tasks, M, is usually known. a can be estimated or measured by using the techniques that are described in Chapter 8. The Pfarm system is designed such that the two overheads 13e and j’ 3 / are application and topology independent. Therefore, they need only be determined once for a particular implementation of Pfarm. In Section 5.1, we describe how the values of these overhead parameters can be determined. As explained above, we can determine 8 T 8 for a given arbitrary tree, T. However, for an arbitrary topology there remains the question of which spanning tree to use for Pfarm. We show that all shortest-path, demand-driven distribution schemes with the same overheads (/3e and / ) are equivalent. 3 Let T (S) 5 8 be the execution time of a shortest-path, demand-driven distribution scheme S. Chapter 4. Processor Farm: Design and Modeling 42 Theorem 1 For any topology G, T (S) 5 5 equals T(Pfarm(T)) where Pfarm(T) is Pfarm executing on T, a breadth-first spanning tree of G rooted at the source of the tasks. Proof: Let L(i) of G denote the set of processors that are at distance i from the source of the tasks. Since S and Pfarm(T) are both shortest-path distribution schemes, tasks executed at a processor in L(i) must have been forwarded from processors in L(i—1). Let si(S) and s(T) denote the combined throughput of all the processors in L(i) for scheme S and Pfarm(T), respectively. We claim that for all i, si(S) = s(T). It is initially true for n, the last level, since in both schemes the processors in L(n) do not idle and can only execute tasks. In general, by induction on the level, the fact that processors do not idle and si(S) = s(T) implies that both schemes must execute the same number of tasks on processors in L(i — 1). Thus 1 s_ ( S) = s_ ( 1 T) and, in particular, se(S) so(T). Therefore for a fixed M, since the throughput for both the schemes are the same, (S) 5 T 8 = (Pfarm(T)). 8 T 5 0 It follows from Theorem 1 that for a fixed M, there is only one value of 8 T 8 that ensures that processors do not idle. This does not exclude the possibility that there exists some non-shortest path scheme or a scheme that introduces idle time that would perform better. However, since this either increases the overhead to forward a task to a worker or reduces the computational power of a processor, it would be surprising if it outperformed the work efficient scheme we have analyzed. Start-up and Wind-down Analysis Start-up and Wind-down costs depend on the task distribution strategy and the hard ware topology. For analyzing start-up and wind-down costs, we consider the structure of the underlying process graph. Given a tree architecture, the process graph of the system can be constructed by replacing each processor by the process structure given in Fig- Chapter LProcessor Farm: Design and Modeling 43 ure 4.1. We remove the processes that gather the results and only consider the processes that are involved in task distribution, that is, the worker process, the manager process and the link processes. The process graph of the architecture given in Figure 4.3(a) is shown in Figure 4.3(b). Notice that there are two types of edges in Figure 4.3(b): edges that represent the inter-processor communication and the intra-processor communica tion. 0 O O (a) e OutUnk InUnk Manager Worker split (b) (c) Figure 4.3: (a) node graph (b) process graph (c) subtree decomposition Let T be the process graph of an N node architecture. We will add, as part of T, an initial OutLink process which we take as the root of T. In total, T has 4N processes or alternatively T can be viewed as consisting of N subtrees (or nodes) of the type depicted in Figure 4.3(c). Start-up Start-up begins when the first task enters the system and ends when all the processors Chapter 4. Processor Farm: Design and Modeling 44 have received at least one task. The duration of the start-up phase depends on how the tasks are distributed to the processors. As mentioned in Section 4.1, the manager pro cess gives priority to the worker process over the OutLink processes while allocating the tasks. Thus, when a worker process becomes free, it will receive the next available task. The OutLink processes, when free, receive tasks from the manager process; if more than one OutLink process is free, the manager allocates the tasks in a round robin ordering of the OutLink processes. Thus, as the tasks start entering the system, a processor keeps the first task it receives and then, as long as the worker process has not completed the current task, distributes the incoming tasks to its children. In order to obtain an upper bound on start-up time, we discretize the start-up into a sequence of steps. On each step, a task is transferred from an OutLink process of one processor to either the worker process, or an OutLink process, or, when they all have tasks, to the last process along the path without a task. Furthermore, it is assumed that during the start-up phase, no worker process finishes its first task. This is a reasonable assumption since the task forwarding link processes run at high priority while the worker process runs at low priority. Assuming that there is a continuous flow of tasks into the system, we seek to bound the number of steps required before every worker process has a task. For analysis purposes, we use a collapsed process graph like the one shown in Fig ure 4.3(c). This graph is identical to the original architecture graph except that each node consists of the InLink, the manager, and the worker process of a processor plus the corresponding OutLink process of its parent processor. In this tree, all the tasks are initially at the root and on each step every node with more than one task forwards, in round robin order, the last task it received to a child. In order to analyze the worst case, we assume that the tasks can be buffered at the manager process and thus each subtree has an infinite capacity to absorb tasks. We first obtain an upper bound on the number of steps to distribute at least one task to every node. This is equivalent to the procedure described above, except that it takes one additional step at the end to ensure that a leaf node forwards its task to the worker process. Given a rooted oriented tree T, with child nodes numbered from left to right starting Chapter 4. Processor Farm: Design and Modeling 45 at one, let c(v) be the child number of node v with respect to the parent of v, p(v). Let p(v) denote the uth ancestor of node v in T, and let deg(v) be the down-degree of node V. Definition 1 Let d(v) equal 0 deg(p(v)). fl Lemma 1 For any rooted oriented tree T, the first task received by node v is the n—2 1 + c(p (v)) + 1 j=o task at the node p(v), the root of T. Proof: Let s(v, i) be the number of the ith task received by node v. Using the fact that the child c(v) receives every deg(p(v)) task arriving at p(v), except for the first which is kept by p(v), we obtain the following recurrence s(v, i) = s(p(v), 1 + c(v) + (i — 1)deg(p(v))). In general, s(p’(v), 1 + c(p°(v)) + (i s(v, i) s(v,i) — 1)do(p(v))) = s(p(p(v)), 1+ c(p(v)) + [1+ c(v) + (i = (v), 1 + c(p’(v)) + c(v)do(p 2 s(p (v)) + (i 2 = (v),1 + c(p(v)) 2 s(p = s(p(v)),1 +c(p(v)) At the root, s(v,i) = s(v, 1) - 1)do(p(v)) — - 1]do(p(p(v))) (v))) 2 1)do(p’(v))do(p + (i — 1)d(p(v))) + (i — 1)d_i(p(v))) i. Thus, when p(v) is the root, = 1 + c(p’(v)) (4.4) D Chapter 4. Processor Farm: Design and Modeling 46 Example: Let us derive the task number of the first task that arrives at node 5 in the graph shown in Figure 4.3(a) using equation (4.4). For node 5, n s(5, 1) = 2. Thus, = 1 + c(p (v)) + c(p°(v))do(p 1 (v)) 2 = 1+2+2x2 =7. The time step at which the worker process on node v receives task s(v, 1) is n + s(v, 1). This follows from the fact that task s(v, 1) leaves the root after s(v, 1) — 1 steps, and it takes n + 1 more steps for this task to reach the worker process on node v as this task gets forwarded in every step because s(p”(v), 1) < s(v, 1). The next theorem follows from these remarks. Theorem 2 For any tree structured process graph, after max {n+s(v,1)} v a leaf steps, every worker process has a task to execute. The time required for each step is determined by the communication cost to transfer a task from a processor to its child and the associated CPU overhead. The average communication time needed to transfer a task from one processor to its child is given by Td = d/T. The associated overhead is 2 f/ since /3 includes the overheads for both 3 / transferring a task to a child node and returning the corresponding result to the parent. Therefore, the time required for each step is given by Td + 3 j/2. From Theorem 2 it follows that 8 T = (Td+/3f/2) max {m+s(v,1)} (4.5) v a leaf For the tree shown in Figure 4.3(a), the start-up time is determined by the time required for node 5 to receive its first task. As derived earlier, s(5, 1) = 7 and m = 2. Chapter 4. Processor Farm: Design and Modeling 47 Thus, the start-up time for this example is given by = 9(Td + 2 f/ 3 / ) (4.6) The upper bound on the number of steps required for every worker process to have a task to execute can be exponential in N. Consider for example a tree with a long path where each node on the path has large degree but all nodes off of the path are leaves. The sum of products in s(v, 1) grows exponentially with respect to N. This is a result of our assumption that subtrees can always accept tasks. In practice, the upper bound cannot exceed the number of buffers in the tree, 4N. But it is possible to come arbitrarily close to 4N. As the upper bound is proportional to the number of buffers, for start-up, ideally the number of buffers should be minimized. This is one of the reasons we chose to have only one additional buffer on every worker processor. Although the degree and depth of the tree is fixed, it is possible to change c(v). Theorem 2 provides a means for determining an orientation of the tree that minimizes start-up time. Example: Earlier, for the tree shown in Figure 4.3(a), we found that s(5, 1) 7. If we change the orientation of the tree by interchanging nodes 2 and 3, the start-up time is still determined by the time required for node 5 to receive its first task. s(5, 1) = Now, however 6. In summary, to minimize start-up, the tree should be oriented so that the longest path appears on the left (that is, on start-up, tasks are forwarded first along the longer paths). Wind-down Wind-down begins when the last task enters the system and ends when the last result leaves the system. The Wind-down phase can be broken into two parts: the time to complete the remaining tasks in the system and the time to return results that are in the system after all the tasks have been executed. Consider the state of the process graph when the last task enters the system. As given in Section 4.1, there are 4N tasks. In order to derive an upper bound, let us Chapter 4. Processor Farm: Design and Modeling 48 assume that all the remaining tasks have just started to execute. We will bound the maximum number of tasks executed by a single processor during the wind-down phase. This gives an upper bound on the time taken to complete the 4N tasks. If all of the tasks have just begun execution, then after time c, each processor has executed one task. Since priority is given to distributing the tasks, some of the tasks buffered in each processor will be forwarded to the child processors. In the worst case, when c greatly exceeds , the tasks will be forwarded as far as possible towards the leaves. A worker process at a leaf can only execute those tasks available at its ancestor processes in the process graph. Therefore, we should consider the longest path from the root to a leaf in the tree (this is at most 3N + 1, for example see Figure 4.3(c)) to derive an upper bound. Let m be the number of ancestor processes of the leaf process in the longest path, each of which initially contains a task. Starting at the root, every third process along the path is adjacent to a worker process. Therefore, after time o, when each of the worker processes have finished executing a task, each worker process adjacent to the path will receive a task from a manager along the path. The remaining tasks on the path shift down towards the leaves filling as many buffers as possible. This results in the following recurrence for p(m), the length of the path, p(O) p(m) = p(n-l). Solving this recurrence for p(m) 1 shows that after log 312 ml + 1 steps, all tasks have been processed. Thus, the time taken for the first part of the wind-down phase is given by c(log m1 + 1). 312 The second part of the wind-down cost is determined by the time required to forward the last result from the leaf process to the root. If r is the average result size per task, the communication time needed to transfer a result from one processor to its parent is given by Tcr = r/-r. The associated overhead per transfer is / 2 since f3 includes the 3 overheads for both transferring a task to a child node and returning the corresponding result to the parent node. Therefore, the second part of the wind-down cost is given by Chapter 4. Processor Farm: Design and Modeling 49 m/3 x (Tc,. + 2 f/ since there are m/3 processors in the path. If m is the length of the 3 / ) longest path, the wind-down cost is given by Td = rn1 +l)+(Tcr+f/2). 312 (log (4.7) In the tree shown in Figure 43(a)), there are two longest paths, the path from the root to node 4 and the path from the root to node 5. In this example m = 9 and the wind-down time is given by Td (4.8) 7a+3(Tcr+/3f/2). This analysis depends only on the depth of the tree and therefore gives the same bound for all breadth-first spanning trees of the topology. Note, however, that the analysis is overly pessimistic since subtrees along a path from the root to a leaf also receive tasks from managers on the path. The actual wind-down time also depends on the number of nodes along the path with down degree greater than one. The fewer the number of nodes of degree one, the smaller is Td. Because of our round robin scheduling policy, any node of degree greater than one can only forward at most one task towards the leaf of the chosen path out of every three tasks that arrive at the node. Therefore, the number of tasks on the path decreases more quickly and if every processor on the longest path has at least two children, then the number of tasks executed by a leaf is bounded by max{ log 3 ml + 1, 4}. Leaves must execute at least 4 tasks, namely those in their local buffers. For the example topology shown in Figure 4.2, the total execution time for processing M tasks is given by Ttotai 9(Tcd + ) 2 f 3 / / + — 4.2.2 20) + 7c + 3(Tcr + 2 f/ 3 / ) . 6c3 + 2i3 (M 3 a 52 — (4.9) Balanced Tree Topologies In this section, we analyze the performance of processor farms on balanced tree topologies using the framework given in the previous section. Balanced tree topologies are of interest because (a) a k-ary balanced tree topology, where k is the number of links on each node, Chapter 4. Processor Farm: Design and Modeling 50 provides optimal performance, and (b) for balanced trees, it is possible to obtain closed form solutions for system throughput and speedup. For processor farms, a k-ary balanced tree topology provides optimal performance among all the topologies of the same size for the following reasons: 1. From Theorem 1, we know that the best topology for processor farms is a breadth first spanning tree. As a k-ary balanced tree is a spanning tree with minimum possible depth among all degree k graphs, it provides a steady state performance that is as good as any possible spanning tree. 2. As explained in Section 4.2.1, start-up cost is proportional to the length of the longest path in the graph. It also depends on the ordering of the children, and can increase when the graph is unbalanced. The length of the longest path in a k-ary balanced tree topology is minimum among all the graphs with the same number of nodes. The symmetry in balanced tree implies that the orientation of the tree does not affect the start-up cost. Thus, a balanced k-ary tree topology has minimum possible start-up time among all trees with the same number of nodes. 3. As shown in Section 4.2.1, wind-down cost is proportional to the length of the longest path in the graph. The wind-down cost also decreases as the number of children at processors in the longest path increases since these children steal tasks from the path. As a result, this decreases the number of tasks forwarded to the leaf processor in the longest path. As mentioned earlier, the length of the longest path in a k-ary balanced tree is minimum among all trees of the same size. In addition, all the non-leaf nodes in a balanced k-ary tree have the maximum possible number of children and thus reduce the number of tasks forwarded to any particular leaf. Thus, a complete k-ary tree has minimum possible wind-down time among all the trees with the same number of nodes. As defined in Section 4.2.1, let o be the average processing time plus the overhead to execute a task locally and let / f be the overhead for every task forwarded to a child 3 processor. Consider a D level balanced k-ary tree with k processors on level i. Figure 4.4 shows binary and ternary tree topologies with D = 4 and 3 respectively. Chapter 4. Processor Farm: Design and Modeling 51 (a) Binary Tree (b) Ternary Tree Figure 4.4: Binary and Ternary Trees with D = 4 and 3 We begin by analyzing the steady-state phase, followed by an analysis of start-up and wind-down. Steady-state Analysis Let M be the total number of tasks processed during the steady-state phase. Assuming that no processor idles during the steady-state, all the processors at a particular level of the tree execute the same number of tasks. We can then express the steady-state time ) for each processor in terms of the number of tasks processed and forwarded and 85 (T their associated costs and overheads. From the general analysis in Section 4.2.1, we have 8 T ci(Vj = ) + /3 — jEC(i) (4.10) jEC(i) Since all the processors on a level execute same number of tasks, by summing equa tion (4.10) over all the processors on level i, we have )+3 = i i i jC(i) jEC(i) (4.11) = i+1 Let L, = Vj Chapter 4. Processor Farm: Design and Modeling 52 be the number of tasks that visit a processor on level i. Therefore, 58 T oL+(/3f—o)kL+l. = By rearranging the above, we have k(c and L 0 oL = — — 55 T (4.12) M. The above recurrence is of the following form described in Chapter 2 of Knuth’s Concrete Mathematics [GKP89]. an k(c—/3f) = = = 8 —T Sn Solving the recurrence, we obtain i L M [k f)] Let [ 1— 1— k(-f) a= Then L (i 1/a 1—1/a — = Ma = MazT/1. — c SinceL=Ofori=D, MaD mID — - TSS — — — (1-1/a) MQ(1 1/a) (1_i/aD) — (4.13) Chapter 4. Processor Farm: Design and Modeling 53 By substituting for a, we obtain M[a—k(o—3 ) 1 ] — 5 T — (4.14) (k(f))D — Discussion From the steady-state execution time given by equation (4.14), we can derive expressions for throughput and speedup. The steady-state throughput of a D level balanced k-ary tree is given by SD M = I 1 a—k(ci—/3f) (k(_f) 1 ) (415) The steady-state speedup of a D level balanced k-ary tree is given by SPD=I k(a — = D — k(3) (4 16) Here, speedup* is defined as the ratio of the execution time of the parallel algorithm on a single processor (execution time includes the associated overhead in addition to task processing time) to the total execution time of the algorithm on the parallel system. We can also determine the fraction of the total number of tasks that are executed on each processor. For a processor on level i, it is given by L—kL 1 M — 417 By substituting equation (4.14) for T 5 in equation (4.13), Maz L = — M(a a —(a —1) aD — — 1) [a 1]’ aD aD_i a = *This is different from the usual theoretical measure of speedup, the time of the most effective serial algorithm divided by the time of the parallel algorithm. Chapter 4. Processor Farm: Design and Modeling 54 By substituting the above in equation (4.17), aD a 3 — 1 a = — — k a’ — a’ — 1 aD(l_k)+ai+l_ai — 418 aD_i — Now consider the affect of 13 on the overall performance. Let g be equal to the ratio /3f/a. This is the inverse of granularity which is generally defined as the ratio of computation to communication overhead. The speedup given in equation (4.16) can now be expressed in terms of g as follows: SPD = a a—k(a—/3f) 1j a — 1—k(1—g) = [k(1 When g —* - 0, we have 1 SPD = — kD 1-k =N, the total number of processors (i.e., when overhead j 3 I is negligible compared to a, speedup is proportional to the number of processors in the system). When g 0 < 1 — —* 1 (i.e., when overhead /3 is almost equal to a), SPD g < 1, (1 — = 1. For the case g) is the factor by which the processing capacity of a processor is reduced due to the overheads involved in dynamically distributing the work. Figure 4.5 shows how overhead / f affects the efficiency. 3 In case of a linear chain of N nodes, speedup is given by N N= 1 g _N 1 g (.) as the second term in equation (4.19) becomes negligible. Thus l/g gives an upper 1—(1—g) For large N, SPN= 9 bound on speedup on a linear chain. Chapter 4. Processor Farm: Design and Modeling 55 0. 0. Efficiency 0. 0. 0.2 0.4 0.6 0.8 granularity Figure 4.5: The affect of /3 on efficiency The analytical results derived here for k-ary balanced trees are similar to those ob tained in [Pri87, Pri9O, TD9Oj. The difference is in the way the overhead parameters are included in the model. With proper substitutions, it is possible to obtain their through put expressions from our model. Pritchard has analyzed processor farm on a linear chain and his model [Pri87, Pri9O] is an abstract one which uses the characteristics of the machine as the overhead parameters to the model. As a result, it does not take into account the scheduling strategy and the associated software overheads. Tregidgo and Downton [TD9O] have extended Pritchard’s analysis for balanced binary and ternary trees. Again, these models only considered the hardware characteristics as the overhead parameters. Tregidgo and Downton have validated their model using a simulator. Con trary to statements in [TD9O], the model they derived also holds for distributed farms as well as small centralized farms. In comparison, we have provided a general framework to derive the performance models for processor farms on any topology. These models assume a realistic dynamic scheduling strategy, and account for all the associated software overheads. Also, we have analyzed start-up and wind-down phases, which are significant for applications consisting Chapter 4. Processor Farm: Design and Modeling 56 of a smaller number of tasks. The models have been experimentally validated and are accurate as discussed in Chapter 5. Start-up Analysis The start-up analysis presented in Section 4.2.1 for arbitrary tree topologies can be used to obtain start-up costs for balanced tree topologies. The start-up cost for an arbitrary topology is given by equation (4.5), which is reproduced below, 8 T f/2) max {m+s(v,1)}. 3 (Td+1 v a leaf = (4.20) In a balanced tree, the rightmost leaf node will be the last among all the nodes to receive its first task. Thus, the start-up time is determined by the number of steps required for the rightmost leaf node to receive its first task. For a D level, k-ary tree, n. = D — 1. For the rightmost leaf node, s(v, 1) can be obtained using equation (4.4), and is given by kD — 1 k—i the total number of nodes. Thus, for a balanced k-ary tree of D levels, the start-up cost is given by 8 T = (N + D — 1)(Tcd + /3/2). (4.21) Wind-down Analysis In this section, we use the wind-down analysis presented in Section 4.2.1 for arbitrary tree topologies to derive the wind-down time for balanced trees. The wind-down time is given by equation (4.7) which is reproduced below: 312 a(rlog Td ml + 1) + + ). 2 / (4.22) where m is the length of the longest path in the process graph. For a N node linear chain topology, the wind-down analysis presented in Section 4.2.1 holds with m = 3N. The wind-down time is given by Td = 312 (log 3N1 + 1) + N(TCr + 2 f/ 3 / ) . (4.23) Chapter 4. Processor Farm: Design and Modeling 57 As explained in Section 4.2.1, in a balanced tree with degree greater than one, the number of tasks in the longest path decreases more quickly. For this case, the number of tasks executed by the leaf node on the longest path is given by max{1og 3 For a D level k-ary balanced tree, m Td = = ml + 1, 4}. 3D, and wind-down cost is given by 3D1 +1)+D(Tcr+/3f/2). 3 (rlog (4.24) The total execution time for processing M tasks on a D level k-ary balanced tree is given by Ttotai = = 88 + Td +T (N + D - + (M 1)(Tcd + - 1 3 3D1 + 1) + D(Tcr +€(rlog 4.2.3 4N)[ — - k(- )] (k(a_/3f)) + /3/2). Communication Bound The performance models derived in Sections 4.2.1 and 4.2.2 are applicable only when the system is computation bound. If the system is communication bound, it may never reach steady-state and processors may idle. In this section, we analyze the performance of processor farms, when the system is communication bound. There are two cases in which the performance of a processor farm system might be communication bound. The first corresponds to the actual transfer portion of the com munication; the second corresponds to the CPU overhead required for communication. Here, we assume that the links are homogeneous. Case (1) In the first case, throughput is limited either by the the rate at which the farm can receive tasks or the rate at which it can transfer results to the manager, whichever is smaller. Let T = max{Tcd,Tcr}. The system throughput corresponding to this limit is Scorni = T±’ (4.25) Chapter 4. Processor Farm: Design and Modeling 58 where [3 is the processor overhead required to receive a task from a parent or to send a result to the parent. [ c includes the overheads required to make the newly arrived task 3 available for processing (or forwarding), to allocate a new buffer for the Inlink process to receive the next task and to initiate the communication. It also represents the corre sponding time required to initiate the transfer of a new result after the communication of a previous result to the parent node is completed. In addition, overhead /3 can be estimated by /3 /3/4 as / f is the total processor overhead for receiving a task, for 3 warding it to a child processor, receiving the corresponding result and forwarding the result to the parent node. Case (ii) The second factor that limits throughput is the CPU overhead in transferring the tasks and results. Since the first worker processor in the farm has to incur an overhead of at least /3 for every task received from the manager and forwarded to a child processor, the overall throughput is limited by 2 Scorn < , (4.26) irrespective of the number of workers in the farm. The communication bound on the throughput of the system is the smaller of the two bounds obtained from equations (4.25) and (4.26). 4.3 Discussion In this section, we discuss how the performance models derived in Section 4.2 can be used in performance tuning. 4.3.1 Optimal N and Topology Our models can be used to determine the optimal N and topology to maximize perfor mance for a given application program. If j 3 < (Tc + J3c), then the intersection of equations (4.15) and (4.25) gives the optimal number of processors to use to maximize throughput. Beyond this optimal Chapter 4. Processor Farm: Design and Modeling 59 number, overall performance of the system does not increase. For this case, ignoring start-up and wind-down, the optimal level of a k-ary processor tree, — _(I3f)] log [i 0 D is given by ( — — rk(—!3f) bogL .2 ) If /3 > (T + j3), then the optimal number of processors is given by the intersection of equations (4.15) and (4.26). For this case, D 0 is given by — log [i — c_k_I3f)] 0 D — ( . log .28) 3000 2500 Comm Iimit2 Steady-state 2000 Throughput 1500 Comm limiti 1000 500 10 20 30 40 50 60 Number of processors, N Figure 4.6: Plot of throughput curves for a linear chain (with Te 13e = 453its) = lOms, /3e Us, 1 2 8 4 For a linear chain of N nodes, steady-state throughput is given by = SN 1[((cf))N] (4.29) In Figure 4.6, we have plotted the three throughput equations (4.25), (4.26) and (4.29) for a linear chain with a set of typical parameter values. As /3 < (T + /3) in this example, optimal N can be obtained from equation (4.27) which gives a value of 20. Chapter 4. Processor Farm: Design and Modeling 60 Equation (4.28) is not applicable for linear chain even if [3 > (T + i3). For this case, As N —÷ oo, throughput reaches the limit l/3. But from Figure 4.6, we can observe that after a certain value of N, the increase in throughput with an increase in the number of processors is very small. This value of N can be determined by iteratively evaluating the throughput for increasing N using equation (4.29). 6000 Throughput Ternary Tree E Number of processors, N Figure 4.7: Comparison of processor farm throughput on linear chain, binary tree and ternary tree configurations In Figure 4.7, we have plotted throughput as a function of the number of nodes for three different topologies: linear chain, binary and ternary tree. As expected, we can observe that for any particular N, the ternary tree configuration gives better throughput than the other two, as long as the system has not reached one of the two communica tion bounds. This shows that with a k-ary tree topology, it is possible to achieve the same throughput with fewer number of nodes. It also shows the dramatic increase in throughput possible by using a binary tree rather a chain. The increase in throughput from binary to ternary tree is much smaller. If the total number of processors available in the system is less than the optimal number and it is not possible to have a complete k-ary tree, it is better to use a topology Chapter 4. Processor Farm: Design and Modeling 61 in which all the levels except the last are complete k-ary with the remaining nodes balanced at the last level. If it is not possible to obtain a k-ary tree due to system configuration restrictions, it is better to use a tree with the next possible larger degree. If the total number of processors available in the system is larger than the optimal number of nodes for a given application program, then one can use multiple k-ary tree topologies to improve performance. This is possible only if there are multiple links from the manager to the worker farm. The manager could be a host workstation or one of the multicomputer nodes. In the first case, the number of k-ary tree topologies one can use is limited by the number of available host links. In the second case, it is possible to have up to k k-ary tree topologies. For both cases, overall throughput of the system is given by the sum of the throughput of each of the individual tree topologies, provided the manager can keep up a continuous flow of tasks to all the worker trees. In practice, throughput can decrease as N exceeds the optimal value. ‘vVe have observed this phenomenon in the validation experiments and the reasons are discussed in Section 5.3. 4.3.2 Problem Scaling Our models can be used to determine how well the speedup scales with problem size. In the processor farm case, an application can be scaled in two different ways. One way to scale the problem is to increase the total number of tasks, M. It follows from equa tion (4.16) that steady-state speedup is independent of M. Thus, scaling the problem size by increasing M does not lead to any increase in the steady-state speedup. However, it may lead to a small increase in the overall speedup because wind-down and start-up now represents a smaller portion of the total execution time. •The second way to scale a problem is to increase the granularity of the tasks by increasing Te. In Figure 4.8, we have plotted steady-state speedup of a linear chain topology for different values of Te. Figure 4.8 shows that steady-state speedup increases with increasing values of Te as long as the system does not reach either of the two communication bounds. Thus, to increase speedup, it is better to increase Te rather than M. Chapter 4. Processor Farm: Design and Modeling 62 4’ Te = 20 ms 3’ Speedup 2 Te=lOms 1 Te=5ms Te 10 20 30 40 50 = 1 ms 60 Number of processors, N Figure 4.8: Measured speedup for processor farm on linear chain 4.3.3 Granularity There are many applications in areas such as nnmerical analysis and image processing in which it is possible to decompose a problem of fixed size in several ways. The com putation requirements of the tasks and the total number of tasks may vary from one case to another. Also, some programs may be easily restructured to produce tasks of different computation requirements. Granularity of the tasks is given by the computa tion requirements of the tasks. Performance models can be used to determine the best granularity to be used for an application to obtain maximum performance. It can be observed from Figure 4.8 that for a fixed N, steady-state speedup increases with an increase in Te. Also, the optimal value of N increases with increasing Te. For a problem of fixed size, increasing the granularity reduces the total number of tasks, M. In addition, it may increase data and result sizes which leads to increased communication costs. Both small M and larger communication costs will add to start-up and wind-down costs. Figure 4.9 shows a graph of speedup, with the effect of start-up and wind-down included, as a function of granularity and N for a processor farm running on a linear Chapter 4. Processor Farm: Design and Modeling 4, 63 40 Speedup 2’ Granularity Number of processors, N Figure 4.9: The affect of granularity on speedup chain for a problem of fixed size. Here, granularity is defined as the ratio of new Te to an original smaller Te. Notice that, up to certain point, speedup increases with granularity, and then starts decreasing. Therefore, one can not use an arbitrarily large granularity, rather the optimal operating point of the system must be calculated as a function of N, Te and the values of the overhead parameters. 4.4 Chapter Summary In designing an efficient processor farm system, many trade-offs have to be considered. In this chapter, we have described the design of Pfarm, detailing the factors that af fect the overall performance of the system and how they have to be addressed in the processor farm design. We have presented a general analytical framework that can be used to determine the performance of a processor farm system on any topology. The Chapter 4. Processor Farm: Design and Modeling 64 interaction between the design and modeling phases have been discussed throughout the chapter. We have outlined how the models can be used in restructuring applications and in determining the optimal number of nodes and topology to be used to maximize performance. In Chapter 5, we experimentally validate the performance models derived in this chapter on a large transputer-based multicomputer. The research results described in this chapter along with the experimental validation presented in the next chapter were published in [SCW92, WSC93]. Chapter 5 Processor Farm: Experiments In this Chapter, we validate the performance models derived in Chapter 4 for processor farms. The models are experimentally validated using a Pfarm implementation on the multicomputer described in Section 3.3. Pfarm was validated using the Logical Systems version of the software. We used a syllthetic workload in all of the validation experiments. The application program consisted of a set of tasks, each of which executed an empty loop. The number of iterations of this loop determines the task execution time Te for the particular exper iment. By running the ioop at high priority, it was possible to determine the number of iterations necessary to produce a task size of 1 ms. Multiples of this value were then used to obtain the different Te’5. In Section 5.1, we describe the experiments conducted to determine the values of the system overhead parameters. In Section 5.2, we validate the performance models for arbitrary tree topologies, and in turn show that Pfarm works on an arbitrary topol ogy. Performance models for balanced tree topologies are validated in Section 5.3. We describe the results of the experiments to test the robustness of our models in Section 5.4. 5.1 Determining System Overheads In order to compare the analytical model with the actual execution, it is first necessary to determine the values of the system overhead parameters, / e and / 3 f As explained in 3 65 Chapter 5. Processor Farm: Experiments 66 Section 4.2, / e is the processor overhead to execute a task locally and / 3 3 is the processor overhead for every task forwarded to a child processor. These overheads depend only on the implementation of Pfarm and are independent of the application program. Also, they do not depend on the underlying topology because the software paths in Pfarm that constitute these overheads are the same for any topology. Therefore, these overheads have to be determined only once for a particular Pfa’rm implementation. In some ap plications, it may be difficult to distinguish between what constitutes the computation time of a task (Te) and the associated overhead (/ e). For these cases, we can measure 3 o (the sum of Te and /3e) and use it in the models. The techniques for measuring the values of o are discussed in Section 8.2. The values of the overhead parameters, /3 and , can be determined by running a few experiments on configurations with one and two worker processors (see Figure 5.1). These experiments were first conducted with the Logical Systems implementation of Pfarm. (a) (b) Figure 5.1: Configurations for determining 13e and /3f For the configuration shown in Figure 5.1(a), the total execution time is given by Ttotai M(Te+13e). Experiments were run on this configuration with Te large value of M 10000. The value of /3e (5.1) 1, 5, 10, 20 and 40 ms and a was obtained by substituting the measured Chapter 5. Processor Farm: Experiments 67 execution time in equation (5.1) for each of the five cases. As expected, for different Te’S, e 3 / remained constant, varying by less than 3is. The average value of 13 e was 482 is. For the configuration shown in Figure 5.1(b), we can express the total execution time in terms of the number of tasks processed 1 (M and forwarded (M ) ) by Workerl. 2 Workerl spends (Te + /3e) for every task it processes, and /3 for every task it forwards. Thus, Ttotai Mi(Te+/3e)+M 1 2 3f. (5.2) The same set of experiments were run on configuration 5.1(b). We measured Ttotai, M 1 and M 2 and solved for / . by substituting these values into equation (5.2). Again, the 3 variation between the values obtained for r 3 1 for different cases was within 3ps. The average value of / us. 3 was 453 1 The same set of experiments were run with the Trollius version of Pfarm, using physical layer communication. For this case, e 3 / was 570 is and / 3 was 1.3 ms. Even though the design is the same for both Logical Systems and Trollius versions of Pfarm, the implementations are slightly different. The overheads are higher in Trollius because of the higher costs of memory allocation and deallocation, and process management. For the validation experiments, the Logical Systems version of Pfarm was used. 5.2 Arbitrary Topologies The analytical framework described in Section 4.2.1 was used to determine the perfor mance of Pfarm on arbitrary topologies. To test the general model, we conducted several experiments on three different breadth-first spanning trees of the 8 x 3 and 8 x 8 mesh topologies. Table 5.1 shows the predicted and measured total execution time for the different breadth first spanning trees (shown in Figure 5.2) of an 8 x 3 mesh. For each case, the total number of tasks is 10000 and time is given in seconds. Note that because of the large number of tasks, steady-state time dominates the execution time, so the experiments are generally testing the accuracy of the steady-state model. Although the distribution of tasks to processors is different for different breadth first spanning trees, Chapter 5. Processor Farm: Experiments BFST1 BFST2 68 BFST3 Figure 5.2: Three breadth-first spauuiug trees of the 8 x 3 mesh. as predicted by Theorem 2, the overall executiou time remaius the same (neglecting the very small experimental errors). Results for two differeut breadth-first spanuiug trees of the 8 x 8 mesh topology (similar to the first two BFSTs of 8 x 3 shown in Figure 5.2) are given in Table 5.2. Again, the total number of tasks used is 10000 and time is in seconds. As Tables 5.1 and 5.2 show, the maximum error between the predicted and measured total execution time is less than 1.5%. For comparison purposes, in Tables 5.1 and 5.2 we have included the results of using a chain rather than a breadth first spanning tree. Note that the speedup obtained by using the spanning trees is significantly higher than that obtained by using the chain. The minimum value of Te used in the experiments is 10 and 30 ms for’ the 8 x 3 and 8 x 8 mesh, respectively. For smaller Te, system throughput reaches the second communication bound given by equation (4.26). This scenario can be easily identified by using the analytical technique described in Section 4.2.1. If, in solving the equations, I. CR I’ —4 C) p CR —4 -C)00 cccc C) CI) Is) C) CR Is) cc Ci) C) CRCRccc- -i 00 C) p-C)c CR (DC ) C) cL H C) Ci) C/) Ci) C/). — —. H z H 00 H I Cl) C) EC) H Ci) BC) cz x 00 Ci) C) Ci) -i C) Ci) C) C) C) C) I-i C C Ci) -i E C) C C) C) C 1 C) Cl) CR CR : 00 Is) CR ) Is) C?)C cc cc © cz C) i— I : i I’ I’ (C—CRcc- oo i cc Eic 00C) c -i C)91-D00)1 00C)ccCi ©00cc 00CiCRt00 :. 00 C) CR CR cccc CR ©ccc.)1 ©©©© x 00 Ci) C) Ci) C) Ci) C) C) C•) C) C C Ci) c) C C 1 C) pppp©© Cl) H — — ——— — C) C) © CR 00 C) - --- :‘ cc — 00 CccCR CR Ci) : H Cl) C) —. — Cl) )- © C) F—i 00 :‘ C cc CR CR —ci — CRCiCi)CR CR t3 00 C) 91C) — cc-4 ccc H - ) - C Cl). — —. H H Cl) E C) H i C)’)--4 I) Is) : --cc00 I C) cz cc -cccCR __i cc C --- - C)C)cccc-C) i4- --------B. ccC) CR CiZ CRC)©00C)- C) 00 P-9191 H- —-- ——— cc oo CR c 0000C)00—ccc - ©©© CX) —4 C) CR C) CZ Chapter 5. Processor Farm: Experiments 70 a processor executes a negative number of tasks, the system is communication bound. In this case, it is better to use a smaller topology. An optimal subtree can be found by removing, one by one, the leaf nodes that are farthest from the root and solving the system of equations until a feasible solution is obtained (i.e., all processors execute a positive number of tasks). The models for start-up and wind-down were validated by running a separate set of experiments on both of the configurations. The start-up model was validated by running a number of experiments with different Te and M and observing the task number of the first task processed by each processor. In all the cases, the task number of the first task executed at a node was same as that predicted by the model given in Section 4.2.1. The wind-down model was validated by running experiments with M = 4N and observing the number of tasks executed by the leaf node in the longest path. For all the cases, this leaf node executed as many or fewer tasks compared to that predicted by the wind-down analysis given in Section 4.2.1. These experiments were performed with a varied Te on both the 8 x 3 and 8 x 8 mesh. 5.3 Balanced Tree Topologies In this section, we validate the performance models derived in Section 4.2.2 for balanced tree topologies. Table 5.3 gives the range of experiments conducted to validate the steady-state, start-up and wind-down models. Model Steady-state Start-up and wind-down Topology Chain Binary tree Ternary tree Chain Binary tree Ternary tree 1 1 1 1 1 1 N to 64 to 63 to 40 to 64 to 63 to 40 M 10000 100000 Te(ms) 1,5,10,20,40 4N 1,5,10,20,40 Table 5.3: Range of processor farm experiments To validate the steady-state model, experiments were run using a large M so that, Chapter 5. Processor Farm: Experiments 71 in comparison start-up and wind-down was negligible. The minimum value of T 6 chosen for these experiments is 1 ms because in order for Pfarm to make use of more than one worker, T must be greater than /S, where /3 453jts. We validated the start-up and wind-down analysis separately by performing experiments with M = 4N. For this value of M, the total execution time consists of only the start-up and wind-down phases since the system never reaches steady state. 5.3.1 Steady-state Validation First, we present the results of the validation experiments in which data and result sizes were small, and (T + /3) < / . This ensured that the communication bound given 3 in equation (4.25) was not reached for any of the experiments. Table 5.4 shows the percentage error between the predicted and measured execution time for a linear chain configuration. Table 5.4 shows that the percentage errors are within 3%. Also, for a fixed 7’s, the total execution time continues to decrease up to a certain value of N. After this point, there is no considerable decrease in the execution time as the throughput approaches the asymptotic communication limit of l//S. For example, for T = 1 and 5 ms, the decrease in execution time is small after 8 and 32 nodes, respectively. Tables 5.5 and 5.6 show the percentage error between the predicted and measured execution time for binary and ternary tree topologies, respectively. From the tables, observe that the percentage error again does not exceed 3% until the system reaches the asymptotic bound, i//3. Unlike in the linear chain case, the measured execution time may begun to increase as N increases after the optimal point. For example, in the ternary tree case, for Te = 1 ms the optimal N is 13 corresponding to the number of nodes in a 3-level tree. However, performance degrades significantly when another processing level is added. For N larger than the optimal number, the total processing capacity of the system exceeds the rate at which tasks can flow into the system, which is i//Sj. But, any demand-driven dynamic scheduler continues to forward tasks to the workers farther down the tree as long as these workers have free buffers and there are unprocessed tasks. Thus, the workers closer to the manager execute fewer tasks since most of the time they are busy forwarding tasks. This in turn, leads to poor utilization cj C —. C) CO C) C) C) C) C)C a C’D — 00 Q) -4 © © CR cc — © —3 © © C t 03 CR : C CR cc —3 :‘ 01 © 03 C —4 cc c —3 —4 3 4 I— CR C) —. C) -t- C) - cc cc c —4 00 CR © i— 00 00 - - c C) - C) - © :• CC :‘ — CC — I © 03 I © © I’ © — Co - c - CO 3 © II C) —. H- Co —. —C) © © © © C © 01 01 00 00 © CR © © cc 00 cc 00 C C) C) d C) I3 CR © © ccCRcc —. © F— 03 -4 CC C cc cc t3 C) —3 03 Cc 03 CR 03 C) © : 00 CC © © —1 -4 — © © — -4 II -Z 03 © C 00 cc - b © 00 ----S t’-D 00 & 01 : cc01c- : 00)— 1 01 CZ © 01 03 © 03 01 —4 CO cc CC —4 01 © c 00 —. C) - C) C) 00 © © I 00 00 CR 00 CR cc cc CR 03 cc CR © —4 cc © 00 03 C —4 01 03 F— CR C © 03 cc 00 CR —1 c CR CC C © © © CR 4- C © © CR 1-,3 © © —4 03 00 00 CR 03 —3 —3 © 3 © :- c © I3 - C © CO 03 03 © 00 03 01 CX CR © CR © © © oI-3 -4 —1 I 01 00 © CR —4 Is © 13 © cc © 01 00 —4 © - - C) —. C) —. - C) C) - C) H C) —- CO CR CO >4II --------— C © C) C) Chapter 5. Processor Farm: Experiments 73 of these workers in terms of the number of tasks locally processed, especially the root worker which only ends up processing the first task it receives. When this phenomenon occurs, the time spent by the root processor to forward all the tasks, except the first one, generally exceeds the total time taken by this processor for the cases in which the processor topology had not reached the asymptotic limit. This causes the total execution time to increase when the size of the processor topology is increased. Even though this phenomenon occurs in the linear chain case, it does not lead to any appreciable increase in the total execution time because the flow of tasks down the topology is smaller. The affect of this phenomenon for tree topologies would becomes even worse when the number of buffers on each processor was increased. N Predicted Exec Time Te10ms Measured Exec Time % Error Predicted Exec Time Te20ms Measured Exec Time %Error 1 3 7 15 31 63 104.881 36.014 15.970 7.753 4.829 4.776 104.868 36.027 15.974 7.742 4.828 4.632 0.012 -0.036 -0.025 0.142 0.021 3.015 204.941 69.369 30.261 14.431 7.158 4.846 204.915 69.370 30.267 14.410 7.138 4.706 0.013 -0.001 -0.020 0.146 0.279 2.889 N 1 3 7 15 31 63 Te4Oms Predicted Measured Exec Time Exec Time 405.061 136.102 58.876 27.820 13.663 6.864 405.029 136.096 58.855 27.788 13.618 6.820 % Error 7.900e-3 4.408e-3 0.036 0.115 0.329 0.641 Table 5.5: Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Binary Tree. In Tables 5.7 and 5.8 we have tabulated the percentage error between the predicted and measured total execution time on linear chain and binary tree configurations for experiments with larger data and result sizes. Both the data and result size used in these experiments are 1000 bytes per task. This leads to a larger communication time Chapter 5. Processor Farm: Experiments 74 N Predicted Exec Time = 1 ms Measured Exec Time % Error 1 4 13 40 14.827 4.811 4.793 4.749 14.808 4.809 4.555 6.016 0.128 0.042 4.966 -26.679 N Predicted Exec Time Measured Exec Time % Error Predicted Exec Time Measured Exec Time %Error 1 4 13 40 104.881 27.116 8.682 4.803 104.868 27.133 8.677 6.324 0.012 -0.063 0.058 -31.668 204.941 52.135 16.383 5.471 204.915 52.134 16.376 5.460 0.013 0.000 0.043 0.201 Te 5 ms Measured Exec Time %Error 54.832 14.633 7.080 6.730 0.035 -0.034 -45.859 -41.001 Te Predicted Exec Time 54.851J 14.628 4.854 4.773 TelOfflS Te=2Oms Table 5.6: Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Ternary Tree. for forwarding tasks and results, and / f < (T + /3). In this case, the rate of task 3 processing will be limited by the communication latency time. The system reaches the communication bound given by equation (4.25) after an optimal value of N. This value of N depends on the computation size of tasks, Te. From the tables, we can observe that for Te = 5 ms, the system reaches its communication bound at N 12 and 15 for linear chain and binary tree respectively. While the system remains in steady-state, the error is within 3%, however, once the communication bound is reached, the error is around 10% and remains almost constant as N increases. In this case, measured execution time does not increase with an increase in N. This is because it takes longer to forward a task to a child node and thus tasks do not get forwarded to the nodes farther from the manager for both chain and tree topologies. The errors obtained when the system is communication bound are larger compared to the steady-state error because of the difficulties in obtaining an accurate value for T. The value of r changes depending on the utilization of the link in both the directions. We have used an optimistic value for r that leads to a slightly larger value for optimal N. This is reasonable as the performance of the system does not decrease in this case even when a larger N is used. Chapter 5. Processor Farm: Experiments 75 N Predicted Exec Time 5 ms Measured Exec Time 1 2 4 8 12 16 24 32 54.845 28.604 15.532 9.092 7.464 7.464 7.464 7.464 55.183 28.873 15.795 9.341 8.327 8.320 8.312 8.301 Te = %Error -0.616 -0.940 -1.693 -2.739 -11.562 -11.468 -11.361 -11.214 Table 5.7: Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Linear Chain under Communication Bound N Predicted Exec Time = 5 ms Measured Exec Time %Error 1 3 7 15 31 54.845 19.347 8.844 7.464 7.464 55.183 19.585 9.009 8.172 8.169 -0.616 -1.230 -1.865 -9.485 -9.445 Te Table 5.8: Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Binary Tree under Communication Bound — i—h CD -CD Ci) CD H — ‘l CD cC) CD CD ‘-C CD CD CD CD CCD p ‘-C CD Ci) CD CD ‘-i-C Ci) cc S— C) CR 00 Cs) P P Cs:) CR CR e p CR 9) 00 9) I- C--i 9) cc Cs) p 01 C) Cs:) I E CR C) cC’ CD -C ‘-C C ‘-C -C CI) CD CD CD ECi) C— 00 C) — cc b )— © cc i cz — - C p 00 C) 1 c ‘-C ‘-C L:Ij - CD CD CD —. - i.1 Li CD C) C) C) cc cc CR CR -4 C) C) CR Cs) —4 Cs) cc , C) Ic-) 229)9)9)9) 00 cc - CD —. -C 0 Li -C ‘- -C CD CD Li CD CD CR -4 CR 90 9) 00 -4 cc c’i CD-C -4 : C) CD 00 CR —4 Ci) CD — CD 9)9)9)9)9)9) H F- 01 —. CD-C CD ‘-C -C C ‘-C :rj ‘-C Cl) CD CR cc Cs C) CCC 1’ 9)1 CD - CD CD CD —:1 Cs:) 90 I Cs:) C) 00 H e P He p p Ci) Cl) cc 9) CD ‘ CD >CD :‘ie <CD CD ‘CD Ho ‘- CD CD ‘-C C ‘-C ‘-C SCD CD - CD CD CD CD 0000 CD Ci) cC’ CD CD cC’ cC’ C)) CD c CD ‘-C cC’ CD — cC’ c ‘-C CD CD CD CI) CD CD — cC’ —CI) CD ‘-C - ‘-C CD cC’ 0 CD ‘-C ‘ cC’ , —. ‘- Ci) o Cl) CD , C , ‘-C cC’ CD . ‘- CD —. cC’ CD C1 CD ‘-C CD ‘-C P CD c+ Ci) ce, cC’ Cs:) C’) CD -• -. CD CD 0 ‘-C CD CD cC’ CD CD ‘-C CD CD CD CD CD cC’ CD ‘-C CD CD Ci) cc — H cC’ C-li c-C ci) CD CD Ci) CR CD 1 CiD Cl’ Chapter 5. Processor Farm: Experiments 77 N Predicted Exec Time TelOms Constant %Error Uniform %Error Predicted Exec Time Te= O 2 ms Constant %Error Uniform %Error 1 2 4 8 16 32 48 64 104.880 53.602 27.986 15.225 9.020 6.047 5.186 4.828 0.010 0.000 0.000 -0.110 0.610 0.496 0.174 -0.249 -0.400 -0.396 -0.382 -0.512 0.455 -0.050 -0.116 1.263 204.941 103.625 52.978 27.677 15.235 9.018 7.019 6.068 0.013 0.000 -0.028 -0.188 0.407 0.044 -0.527 -0.906 -0.453 -0.458 -0.525 -0.755 -0.217 0.011 -1.225 -1.269 Table 5.10: Comparison of Predicted and Measured Total Execution Time for uniform task distribution for Processor Farm running on Linear Chain 5.4 Robustness In all the experiments discussed so far, we have used a constant value for Te. However, in practice, Te may vary from one task to another. In order to test the robustness of using average values for prediction under this condition, we experimented with two common distributions for task sizes: uniform and bimodal. Experimental results are compared with those predicted by the model using the average value for Te. We ran several sets of experiments with uniform distribution of task sizes. For all the experiments, the total number of tasks used was 10,000. Table 5.10 shows the percentage error between the predicted and measured total execution time for two sets of experiments on a linear chain configuration. The task execution time varies from 1 to 19 ms (average Te = 10 ms) for the first set, and from 1 to 40 ms (average Te = 20 ms) for the second set. As Table 5.10 shows, the errors are all within 3%. Several sets of experiments were conducted with bimodal distribution of task sizes. In these experiments M was 10000, and the values of Te were 1, 5, 10, 20 and 40 ms. Here, we describe a set of experiments in which we used 5,000 tasks of 1 ms duration and another 5,000 of 20 ms duration. In the bimodal distribution case, the order of arrival of the tasks into the system also affects the performance. Experiments were conducted with four different arrival patterns: Chapter 5&ocessor Farm: Experiments 78 1. Both 1 ms and 20 ms tasks arrive with equal probability. 2. 20 ms tasks arrive at a probability of 0.75, until all 5,000 of them are processed. 3. 1 ms tasks arrive at a probability of 0.75, until all 5,000 of them are processed. 4. All the 1 ms tasks arrive before any 20 ms tasks. 10 0 Casel Case 2 -a- Case 3 0- Case 4 -0- -- % Error -10 -20 0 20 40 60 80 Number of Processors, N Figure 5.3: Error graph for processor farm on linear chain with tasks of bimodal distri bution In Figure 5.3, we have plotted the percentage error between the predicted and mea sured total execution times for these experiments on a linear chain topology. For predic tion, we have used an average value of T = (1 x 5000 + 20 x 5000)/10000 10.5 ms in the model. As we can observe from the figure, the errors are within 3% for all the four cases, when N is less than 8. For larger N, the prediction is accurate for the first case but the error increases for the other three cases. The maximum error observed varies from around 6.5% in the second case to around 10.25% in the third case, and is highest at around 15.0% for the fourth case. The errors depend on the extent to which the average Te reflects the actual com putation requirements of the tasks. As long as N is smaller than the optimal value corresponding to the smaller Te of the two, the average value works well for all the cases. Chapter 5. Processor Farm: Experiments 79 However, for larger values of N, the average value works well only wheu the two kinds of tasks are well mixed with respect to arrival order, as in the first case with equal probability. Among the other three cases, tasks are mixed for a larger portion of the total execution time in the second case compared to the third case, aud there is no mix at all in the fourth case. In these cases, there is a correspondiug increase in the error, with the largest errors occurring for the fourth case. Obviously the average value is not appropriate for the fourth case since there is no task mixing at all. One can view the execution as occurring in two distinct phases of computation, the first consisting of all the 1 ms tasks and the second with all the 20 ms tasks. For this case, it is better to use the model twice, predicting the execution time for 1 and 20 ms tasks separately and adding them together for the total time. There is a theoretical possibility of finding a distribution of tasks with a particular arrival pattern that could lead to arbitrarily poor performance. This is due mainly to the possible failure of the task scheduling strategy to balance the load among all the processors. However, it is very difficult to come up with this distribution as the system is dynamic and events happen in a nondeterministic order. We believe it to be unlikely for any application program to consist of tasks with such a distribution. 5.5 Chapter Summary In this Chapter, we experimentally validated the performance models derived in Chapter 4 using a Pfarm implementation on a large transputer-based system. Experimentally we showed that on a fixed topology, the performance obtained by Pfarm is the same for any breadth-first spanning tree, as predicted by Theorem 2 in Chapter 4. We also discussed the experiments conducted to determine the values of the system overhead parameters, /e and /l. Chapter 8 describes how Ffarm can be integrated into a programming environment that includes other programming tools such as a graphical interface, mapper and debug ger. We describe the user interface for Pfarm and discuss how the models can be used for performance tuning. Chapter 6 Divide-and-Conquer: Design and Modeling In this chapter, we describe the design of TrEK (Iee xecution kernel) that provides runtime system support for divide-and-conquer applications. TrEK is designed such that it can execute divide-and-conquer computations of any fixed degree and depth on any tree topology. We derive models that accurately describe the behavior and performance characteristics of TrEK or any similar system that satisfies the assumptions outlined. Section 6.1 describes the design and implementation of TrEK. In Section 6.2, models that describe the start-up, steady-state, and wind-down phases of the computation on any tree topology are derived. We use this modeling technique to derive performance models for balanced tree topologies. We close this chapter by discussing how these models can be used in performance tuning and restructuring of application programs. 6.1 TrEK: Design and Implementation As described in Section 3.4, TrEK assumes that there is a flow of tree structured com putations that enter the root processor. Each of these tree structured computation corresponds to a divide-and-conquer task and tasks are assumed to have a known fixed degree and depth. TrEK is a runtime kernel that runs on each worker node. In order to execute an application, TrEK has to be provided with three application dependent functions, split, 80 Chapter 6. Divide-and-Conquer: Design and Modeling 81 join and compute. The split function takes a task as input, splits it and outputs two or more subtasks. The join function takes two or more results as input, joins them and outputs a single result. Finally, the compute function takes a task as input, processes it and returns the result. An example of the task graph corresponding to an instance of a divide and conquer task is shown in Figure 6.1. Figure 6.1: Divide-and-Conquer Task Structure Processor farm can be viewed as a degenerate case of divide-and-conquer. Thus, it is possible to extend the Pfarm design to that of TrEK. In addition to the design issues and goals addressed by Pfarm, TrEK should also be able to execute divide-and-conquer tasks of any fixed degree and depth on any arbitrary processor topology. In this section, we explain the design modifications and extensions for TrEK from that of Pfarm. As in Pfarm case, TrEK is designed as a set of cooperating processes in which each link is controlled by a separate process. Figure 6.2 shows the process structure of TrEK Chapter 6. Divide-and-Conquer: Design and Modeling ——4— 82 — — Result Forwarding — Task Distribution I —I— — — Figure 6.2: TrEK Process Graph on an Intermediate Worker Processor on an intermediate worker node in the processor tree. In the task distribution path, there is an InLink process that receives tasks and a number of OutLink processes that forward subtasks to the children. In the result forwarding path, there are InLink pro cesses that receive results of subtasks from the children, and an OutLink process that forwards results onto the parent. There is a task manager process that controls the task distribution, and there is a result manager process that controls the collection and for warding of results. In addition to these system processes, there are three user processes on each intermediate node. The split process receives tasks from the task manager and calls the split function to split the tasks into two or more subtasks. The join process receives the results of subtasks from the children and calls the join function to com bine the results. There is a local worker process on each leaf processor as well as on intermediate processors that receives tasks from the task manager and processes them Chapter 6. Divide-and-Conquer: Design and Modeling 83 to completion. As explained in the design of Pfarm, to overlap communication with computation, the communication processes and both manager processes execute at high priority, whereas the worker process executes at low priority. As the split process is in the critical path of task distribution, it must also execute at high-priority. The join process is also run at high priority as it decreases the response time which is important if there is any dependency among tasks. In an idealized parallel implementation of divide-and-conquer algorithms on tree processors, such as those discussed in [HZ83, Col89], intermediate processors execute only split and join functions. This leads to an inefficient use of intermediate processors as they idle while waiting for the results. In TrEK, we allow the intermediate processors to do the processing of tasks in addition to executing split and join functions. This is possible only if the application consists of either a flow of divide-and-conquer tasks or a single divide-and-conquer task of large degree. TrEK uses a distributed demand-driven scheduling in which children processors, whenever they have free task buffers, greedily steal subtasks from their parents. When an intermediate processor gets a new task, there are two scheduling choices. The first choice is to split the task and put the subtasks on the output queue from which children processors get their tasks. The second choice is to allocate the task for local processing. A task allocated for local processing is executed until completion by recursively solving subproblems and joining the results as in the case of uniprocessor execution of divideand-conquer. At intermediate processors, priority is given to splitting the task and forwarding the subtasks over allocating it for local processing. The scheduling is demand driven since all the subtasks are stored in a single output queue from which the children with a free task buffer compete for tasks. This task scheduling strategy is similar to the one used in the ZAPP [MS88] system. However, in the case of TrEK, there is an important difference, a processor cannot grab subtasks from its own output queue. As shown in Section 6.2, this restriction allows us to model the system and does not degrade performance. Chapter 6. Divide-and-Conquer: Design and Modeling 84 In TrEK, we initially flood-fill the system with tasks. However, once full, the system is demand-driven with new tasks entering the system as the current tasks are completed. As in the case of Pfarm, the advantage of this scheduling strategy is that it opportunistically takes advantage of varying loads. As in the Pfarm case, the number of additional task buffers to be allocated is an important design issue. On a worker processor, each link process in the task distribution path holds an active task (or subtask) and the local worker process has an active task that is being executed. Aside from these active tasks, there is an additional task buffer in the task manager for the reasons explained in the Pfarm case. In addition, each intermediate processor in TrEK has an output queue that holds the subtasks produced by the split process before forwarding them to the children processors. The output queue consists of as many buffers as the number of subtasks produced from a task. If this output queue was not present, the children processors could wait for subtasks when the parent is busy doing a split. As in Pfarm, we only restrict the number of task buffers, whereas we freely allocate result buffers. As results are also collected, joined and forwarded at high-priority, at any given time, the number of result buffers on a node is small. In order to be topology independent, TrEK should be able to execute divide-andconquer tasks of any degree on any topology. In TrEK, a parent processor does not predetermine the child to which a particular subtask is going to be forwarded. Subtasks produced by splitting a task are put in a single queue, and all the children processors get their tasks from this queue. Thus, the number of children that execute the subtasks of a particular task depends on the load, and it is possible for all the subtasks of a task to be forwarded to the same child. The Result manager on the parent node joins the appropriate subresults of a task. Therefore, it is possible to execute tasks of any degree on any topology. Each TrEK kernel has to be provided with the number of children of the processor on which it runs and the degree of task either at runtime or at compile time. Chapter 6. Divide-and-Conquer: Design and Modeling 6.2 85 Performance Modeling In this section, we derive performance models for application programs running with TrEK or any other system that satisfies the following assumptions. The main charac teristics of the system are: 1. The hardware system is a distributed memory message passing architecture de scribed in Section 3.2 with a linear message cost model. 2. There is a flow of fixed degree divide-and-conquer tasks into the system. 3. The depth of the tasks is equal to or greater than the depth of the underlying processor topology. 4. Tasks originate at a single source and the results are returned to the source. 5. Intermediate processors are also allowed to process the tasks in addition to exe cuting split and join. 6. Tasks are dynamically distributed to the worker processors. We assume that at each step, tasks are split into subtasks of equal computation. With a fixed degree divide-and-conquer task in which at each step, the work is divided into k equal parts, we have W(n) = split(n) + join(ri) + kW(n/k), (6.1) where W(n) is the total amount of work of a task with an input data size of n, split(n) is the work of splitting a task of size m and joiri(ri) is the work of joining the corresponding k subresults. Later in Section 7.3.5, we show experimentally that the models derived with this assumption also work well for applications in which tasks are split into subtasks of unequal computational requirements. Our objective is to find a distribution of the load to all the processors so as to minimize the overall execution time, where load consists of both the computational requirements of the tasks and the associated overheads for splitting, forwarding and Chapter 6. Divide-and-Conquer: Design and Modeling 86 executing them. As in the processor farm case, the system can be either computation bound or communication bound. When it is computation bound, the system acts as a pipeline with three phases to be analyzed: start-up, steady-state and wind-down. In the case of TrEK, start-up phase ends when all the leaf processors have received at least one subtask. At the end of the start-up phase, intermediate processors may or may not have any tasks for local processing, but they will be busy with splitting and forwarding. The definitions of steady-state and wind-down phases are same as that in the processor farm case, and the total execution time is given by, Ttotai (6.2) u+T 8 T +Td 38 First, we derive performance models for the case in which the system is computation bound. In Section 6.2.1, we present a general analytical framework to analyze the steady-state performance on arbitrary tree topologies. We also derive upper bounds for start-up and wind-down costs on arbitrary tree topologies. In Section 6.2.2, we use this analytical approach to derive models for the special case of fixed degree divide-andconquer computations running on balanced tree topologies. We discuss the limits on performance for the communication bound case in Section 6.2.3. 6.2.1 Arbitrary Tree Topologies Let T be {i I pj is a child of p} denote the children of p in T. a tree architecture with processors pi,•• ,PN And let C(i) Let k be the degree of the divide-and-conquer tasks to be processed. Let = Te(j) + e, where Te(i) is the time required for processing a subtask locally by the ith processor and /3e is the associated overhead. Te(i) is given by W(n) in equation (6.1), where n is the input data size of a task that arrives at the ith processor. Let O = (i) + Tj(1) + / 8 T (i) and Tj(i) are the split and join time at the 5 f, where T 3 ith processor, and are given by split(n) and join(ri) respectively. /3 is the associated overhead for every task split and forwarded to the children processors. / f includes all 3 the CPU overheads involved in receiving a task, splitting and forwarding the subtasks to the children processors, receiving the corresponding subresults and joining them, and 87 Chapter 6. Divide-and-Conquer: Design and Modeling forwarding the result to the parent. Unlike in the processor farm case, / f is not a 3 constant because the overheads involved in forwarding subtasks and receiving results is proportional to the number of subtasks. / f can be expressed as 3 f 3 I = f1 3 ! , 2 + k/3f (6.3) where / f 1 and /3f2 are constants. / 3 f includes the overheads required to receive a task 3 from the parent and to send the result back, and thus, it is independent of the number of subtasks. / f 2 is the overhead required for forwarding a subtask to a child processor 3 and to receive the corresponding result. Thus, Oj = (i) + Tj(i) + 5 T i 3 / . 2 + k/3f Steady-state Analysis The steady-state phase begins once all the leaf processors have a subtask to execute and ends when the last task enters the system. It is assumed that no processor will be idle during the steady-state. Leaf processors will be busy processing the subtasks, and the intermediate processors will be busy either splitting the tasks (or subtasks), joining the results or processing the tasks (or subtasks). Suppose that M divide-and-conquer tasks of degree k are executed during this phase and let V denote the number of tasks (or subtasks) that visit pj. Then, we can express steady-state execution time (T ) in 8 terms of the number of tasks executed locally and the number of tasks forwarded along with their associated costs. T 38 is given by the following equation that holds for all the processors in T, 35 T = cj(Vj— y)+8j jEC(i) = > (6.4) jEC(i) Vj+(O-cj) (6.5) jC(i) The factor 1/k appears because of the fact that for every k subtasks that are forwarded to the children, there is only one original task, and split and join are done only once for these k subtasks. This system of equations is similar to that of Pfarm and, for tree topologies, also has a unique solution. The major differences from the processor farm case are, we have cj’s Chapter 6. Divide-and-Conquer: Design and Modeling and O’s 88 that vary from one processor to another unlike in the processor farm case, and the degree (k) of divide-and-conquer tasks appears in these equations. Given M, oj’s and O’s, we can solve for T 5 and Vi to VN. An example of this analysis is shown in Figure 6.3. [ (a) Task Graph ( T 1 a — ai) (i — ) 1 a 2 a 3 a 3 2aaa - 3) (8 - I Q5 — — (b) Architecture Graph 223&1 — ) 3 c c 1 M 2 k 4 o i 123 + 2281&3 + 12 2a k+4 4 a aa 1 a k \ 1 v \ 88 /T V21 ITSS1 I V 3 4 V ) 5 V — - I I I I T 8 T \\Tss) k 1 & 4 2 o — — Figure 6.3: An example of the steady-state analysis This analysis gives the execution time of the steady-state phase in terms of parame ters, whose values can be determined prior to the execution. The total number of tasks (M) and the degree of division (k) are usually known. Te(i),Ts(i) and Tj(i) can be estimated or measured experimentally by the techniques described in Section 8.2.1. The processor overheads e, f 1 and / 3 / f 2 are dependent on the TrEK implementa 3 k 4 3 o 2 a Chapter 6. Divide-and-Conquer: Design and Modeling 89 tion and hardware processor characteristics, but are independent of the application and the hardware topology. Therefore, they need only be determined once for a particular implementation of TrEK. We have experimentally found that on a fixed topology, a breadth-first spanning tree with maximum number of leaves provides maximum performance. The experiments are discussed in Section 7.2. Start-up and Wind-down Analysis As in the processor farm case, start-up and wind-down analysis depend on the underlying topology and the task scheduling strategy. We construct the process graph of the system by replacing each processor by the process structure given in Figure 6.2. Again, we ignore the processes that are not in the task forwarding path. The process graph of the topology given in Figure 6.4(a) is shown in Figure 6.4(b). OutLink o • Manager (a) • Worker (b) (c) Figure 6.4: (a) node graph (b) process graph (c) subtree decomposition Start-up and wind-down time depends on the number of task buffers in the system. Chapter 6. Divide-and-Conquer: Design and Modeling 90 Notice that the tasks in processors at different levels are different in terms of their work. In TrEK, on every intermediate processor, there is one task on each of the following processes: the task receiving InLink, the task manager, and the worker. The split process produces /c subtasks per task, which are put into a single output queue that can hold only k subtasks. Each task forwarding OutLink process has a single buffer to hold an outgoing subtask on that particular link. If we count this subtask towards the particular child processor to which it is getting forwarded, then every intermediate processor has five task buffers. Each leaf processor has four task buffers since the leaves do not have a split process. Thus, the total number of active tasks at any given time is given by where m is the number of processors at the ith level when the levels are numbered 1 to D from the root. In case of k-ary divide-and-conquer tasks running on a k-ary D level balanced tree, the total number of active tasks at any time is 5D — 1. Start-up Start-up begins when the first task enters the system and ends when all the leaf processors have at least received one subtask. As explained in Section 6.1, the scheduling strategy in TrEK gives priority to splitting the task and forwarding the subtasks to children processors over allocating the task for local processing. When the first task enters a processor, the manager passes it on to the split process which splits the task into k subtasks that are kept in a single output queue controlled by the manager. Task forwarding OutLink processes, when free, receive a subtask from the manager process. For analysis purposes, we use the collapsed process graph like the one shown in Figure 6.4(c). Here, we are interested in obtaining an upper bound for the start-up time. In an arbitrary tree, start-up time is given by the maximum time taken for a leaf to receive its first task. First, we will obtain an expression for the task number of the first task received by a processor. Then, we will determine the time required for a leaf processor to receive its first task. The technique used here is similar to that described in Chapter 4 for the start-up analysis for the processor farm case. Given a rooted oriented tree T, with children nodes numbered from left to right Chapter 6. Divide-and-Conquer: Design and Modeling c( v ) p(v) v. starting at one, let p(v). Let of node v v be the child number of node denote the nth ancestor of node 91 with respect to the parent of v, in T and let deg(v) be the down-degree Let k be the degree of the divide-and-conquer tasks. Definition 2 Let d(v) equal Ti 0 fl k deg(p(v)) Lemma 2 For any rooted oriented tree T, the first task received by node v is the (r(i) rc(pn—1(v))1 + k 1 ( d__ ( 2 v)) p — task at node (v), Ti the root of T. p s(v, Proof: Let i) be the number of the ith task received by node v r receives a subtask for every deg(p(v))1 tasks arriving at k Thus, we obtain the following recurrence s(v,i)=s(p(v), +(i-1) E p(p(vv)).. The first task to arrive F1th task that arrives at the parent node at a child node v is the subtask of From then onwards, node v. deg(p(v)) 1 k In general, s(v,i) = s(P1(v), 1 rci k = = ‘ (p2(v) = Ec(Pl(v) 1 k s(PTi(v)E At the root, s(v,i) s(v, + (i - 1)do(P(v))) c(p(v)) k + rrc(p(v))1 1 rc(p(v)) k k + s ((p(v)), (p2(v) s(v,i) Ec(pO(v)) = (F U + Ti (v)) c(p 1 + (i - (v)) + (i 2 do(p (FC1 1+H 1)do(p(v)) _i) do(p(p(v)))) = - 1)dl(P(v))) 72—2 i. Thus, when (v) 72 p k rc(pn—1(v))1 1)dO(p1(v))dO(p2(V))) 2 d ( 2 (p v)) 2 + (i is i) 2 ( d ( v)) p + (i k + j=O L - the root n—2 1) - k — 1 1] 2 7 d ( __ ( v)) p 2 (6.6) 92 Chapter 6. Divide-and-Conquer: Design and Modeling The time required for the task s(v, 1) to arrive at node v is given by (v) 8 T = s(v, 1) (TCd(D) + T (D) + 8 )) + (Td(i) + T 2 (i) + 8 + kf )), 2 + kf where the first part of the equation represents the time after which the root node p(v) sends a subtask to p’(v), and Td(D) and T (D) represent the communication time 8 needed to receive a task and the split time respectively at the root node. The second part of the equation gives the time it takes for the node v to receive its first task after node (i) represent 8 p(v) starts forwarding the corresponding task to its child. Td(i) and T the communication time needed to receive a task and the split time respectively at node p (v). Start-up time for any arbitrary tree topology is given by 8 T — max {T (v)} 8 v a leaf In an arbitrary topology, if the down-degree of every node is less than the degree of divide-and-conquer tasks (k), then s(v, 1) = 1 for every node. For this case, start-up time is determined by the longest path in the topology and is given by T ( i) + [TCd(i) + 8 (i )] 2 + kf (6.7) where n is the length of the longest path. If the topology has nodes with down-degree greater than k, then s(v, 1) has to be evaluated for every leaf node to calculate the start-up cost. For this case, s(v, 1) is proportional to the number of buffers present on each node. If the topology is an unbalanced one, start-up time increases as the number of buffers increases as in the case of Ffarm. Start-up costs for balanced tree topologies are discussed in the next section. Wind-down The wind-down phase begins when the last task enters the system and ends when the last result reaches the manager. In comparison to Pfarm, the wind-down analysis of TrEK is complicated by the fact that the computation requirements of tasks at different levels are different. Tasks at the root processor have maximum computational requirements. Chapter 6. Divide-and-Conquer: Design and Modeling 93 Here, we derive an expression for the wind-down time for an arbitrary topology using an estimate for the number of tasks executed by the root processor. In the following section, we derive an upper bound for the more interesting case, k-ary divide-and-conquer tasks on k-ary balanced tree topology. The total number of tasks at the beginning of the wind-down phase in a tree ar chitecture is given by the number of active tasks at any given time. As derived in the beginning of this section, it is given by mj = mD + (6.8) where D is the number of levels in the topology and m is the total number of processors at ith level. Assume that the processors are all identical and that they all do the same amount of work. Also, we neglect the overhead in forwarding tasks. Given these assumptions, an estimate on the number of tasks executed by the root processor is given by FM M = 1 -where N is the total number of processors. In any tree architecture, the value of this varies from 1 to 4 based on the number of processors. Thus, an estimate for the wind-down time is given by Td 6.2.2 Mi(T(D) + /3e) Balanced Tree Topologies In this section, we analyze the performance of balanced divide and conquer computations on balanced tree topologies using the general framework. We hypothesize that a g-ary balanced tree topology (where g is the number of links on each node) achieves optimal performance for divide-and-conquer applications for the following reasons: 1. Experimentally, we have found that on a fixed topology, a breadth-first spanning tree with maximum number of leaves provides maximum steady-state performance. Chapter 6. Divide-and-Conquer: Design and Modeling 94 As a g-ary balanced tree is a breadth-first spanning tree with maximum number of leaf nodes among all the topologies with the same number of nodes, it provides maximum steady-state performance. 2. As explained in the previous section, start-up cost is proportional to the length of the longest path in the topology. Balanced tree topologies have minimum length longest path among all topologies with the same number of nodes. As a result, by the analysis given in Section 6.2.1, it also minimizes start-up time. 3. Wind-down cost is also proportional to the length of the longest path in the topol ogy. Once again, the minimal path length and the symmetry of the balanced tree also minimizes wind-down time. First, we analyze the steady-state performance of divide-and-conquer tasks of any degree and depth on balanced tree topologies of any degree and depth. Then, we derive corresponding start-up and wind-down costs using the analyses given in the previous section for arbitrary topologies. Steady-state Analysis Consider a flow of 1 level k-ary divide-and-conquer tasks. Let g be the degree and D the number of levels of the balanced tree topology. Notice that the levels are numbered from 1 to D starting from the leaves rather than the root since this simplifies the derivation of the recurrence formula. Assuming that there is no idle time, steady state time (T ) can be expressed in terms 88 of the number of tasks processed and the number of tasks split and forwarded along with their associated costs and overheads. From the general framework, the steady-state execution time is given by equation (6.4), which is reproduced below. 88 T (6.9) = jC(i) jC(i) where V is the number of tasks (or subtasks) that visit a processor at the ith level, and T(i) + /3e and eu = . 2 T(i) + Tj(i) + / f 1 + k/3f 3 Let M be the number of divide-and-conquer tasks processed during the steady-state phase. Let f represent the fraction of the total number of tasks (M) processed by all 95 Chapter 6. Divide-and-Conquer: Design and Modeling the processors at the ith level (assuming that the corresponding splits and the joins for these tasks are executed by the nodes from levels i + 1 to D). Then, the number of subtasks processed by a processor at the ith level is given by (E7) fM, since there are gD_i processors at the ith level, and they have to execute k’ 3 subtasks to finish a single original task. The number of tasks forwarded by a processor at the ith level is given by i—i g D_i)fiM. j=i By substituting the above in equation (6.9), () = () fM+ (fi) MO. Let = j=1 where F represents the fraction of the total number of original tasks (i.e., tasks entering at the root) executed by all the processors in levels 1 through i. Rewriting T 88 in terms of F and (:::) (.) (:::) (::) (F = )Ma 1 F M6 1 F + FM(O - + = By rearranging the above, = 1 F_ ( 0i) + ). (6.10) Let Si MF = -i——, (6.11) Chapter 6. Divide-and-Conquer: Design and Modeling 96 where S represents the throughput of a subtree consisting of the processors from levels 1 to i (again assuming that the corresponding splits and joins were executed at levels i + 1 to d), and S 0 = 0. By substituting (6.11) into (6.10) and solving for Si, we obtain, s ( / ) + 1 k 1 (—:_:T) 9 cE ) (6.12) We are unable to obtain a closed form solution to this recurrence. Therefore, steady-state throughput of a D level balanced tree (SD) is obtained by recursively evaluating the Sj’s up to level D. In evaluating the Si’s, if Si > i+1, 5 then the intermediate processors at levels i + 1 to D can not split and forward the tasks at the rate at which the processors in levels 1 to i can process them. The throughput limit corresponding to this case is given by equation (6.18). Once we obtain the steady-state throughput using equation (6.12), we can derive other performance metrics such as steady-state execution time and speedup. Steadystate execution time for M tasks is given by 88 T — M D Steady-state speedup is given by SFD= MaD QDSD, where c is the execution time plus the associated overhead for each task at the root processor. Start-up Analysis Start-up time for a balanced tree is derived from the analysis given in Section 6.2.1 for arbitrary tree topologies. In the case of a balanced tree topology, start-up cost is given by the time required for the last leaf (the rightmost) to receive its first task. For a D level g-ary balanced tree topology, the task number of the first task received by the last leaf node is given by equation (6.6) with n 1) ri + = D : (Eii — — 1. i) dD_3(p (last)) 2 (6.13) Chapter 6. Divide-and-Conquer: Design and Modeling 97 k, s(last, 1) = 1. For this case, start-up time is given by If g D [TCd(i) (i) + 8 +T )] 2 + kf (6.14) If g> k, s(last, 1) > 1 and start-up time is given by s(last, 1) [TCd(D) + T (D) + 8 (i) + 8 [TCd(i) + T )] 2 + kf + kf )] 2 (6.15) + For example, given a 6-level 4-ary tree topology executing binary divide-and-conquer tasks, s(last, 1) = 2+ 2 (last)) _ (p+ 3 d 2+16+8+4+2 rr32 Wind-down Analysis We derive an upper bound on the wind-down time Td for the case of k-ary divide-andconquer tasks executed on balanced k-ary topologies. The total number of tasks in the system at the start of the wind-down phase (Mmd) is 5D — 1, where D is the number of levels of the topology. We discretize the wind-down phase into a number of steps, where a step is the time to execute a subtask at a leaf in the tree. Note that each leaf has 4 subtasks and all the remaining processors have 5 tasks (or subtasks). An upper bound is obtained by determining the number of steps required for each processor to contain at most one task. Let us derive the number of steps required to reduce the number of tasks at the root to one, the task being executed. Consider a D-level tree (D > 2), which consists of a root and two D — 1 level subtrees. Assume that at the end of a step, tasks at the root are split and forwarded as far as possible towards the leaves. This is an optimistic assumption since when the tasks are not transferred to the leaves, there is less overhead and the load is better balanced. In particular, at the end of the first step, once the Chapter 6. Divide-and-Conquer: Design and Modeling 98 leaves have finished a task, it indirectly reduces the number of tasks at the root by one. At the end of the second step, two tasks at the bottom two levels finish, and only two tasks remain at the root. After the third step, the root contains only one task (the task being executed). Thus, after three steps, there are Ic subtrees each with a maximum of (5(D — 1) — 1)/k tasks (this is an upper bound on the total number of tasks as some tasks at intermediate levels in the tree have been partially processed). Note that the root is still executing a task. By recursively applying the above argument to the roots of each of the Ic subtrees, after 3(D — 2) steps, there remain only 2-level trees. In the trees that remain, the root has 5 tasks and each leaf has 4 tasks. After one step, the leaves finish one task which reduces the number of tasks at the root to 4. At the end of the second step, in addition to the leaves, the root also finishes a task reducing the total number of tasks to two. After three steps, only the leaf processors will be left with four complete tasks each. Thus, after 3D steps, each processor has only one task that it is executing or no task at all. The wind-down time (Td) also depends on the time it takes to execute a single task at the root. Therefore, Td max{(3D + 1)(Te(1) + /3e), (Te(D) + /3€)} (6.16) where Te(1) is the execution time of a subtask at the leaf level and T(D) is the execution time of a subtask at the root. For larger D (D > 5, small /3), the execution time of a single task at the root dominates the wind-down time. 6.2.3 Communication Bounds The performance models derived in Sections 6.2.1 and 6.2.2 are applicable only when the system is computation bound. In this section, we discuss the performance of the system when it is communication bound. In this case, processors may idle as the system never reaches steady-state. There are two factors that may cause the system to be communication bound. Chapter 6. Divide-and-Conquer: Design and Modeling 99 Case (1) As in the processor farm case, overall performance of the system can be bound by the transfer costs whenever the data and result sizes are sufficiently large. In a divide-andconquer task, the data and result sizes generally decrease towards the bottom of the tree. Thus, overall throughput of the system is bound by Scorni = where T = T±’ (6.17) max{Tcd(D),Tcr(D)}. As in the processor farm case, /3 is the processor overhead to receive a task from a parent or to send a result to the parent. As the link processes are identical in both Pfarm and TrEK, the value of / 3 is also the same. Case (ii) In the second case, CPU time in transferring the tasks and results can limit the overall throughput. An intermediate processor at the ith level has to incur a CPU cost of (i) + Tj(i) + 1 8 T 2 for every task that is split and forwarded. In any divide.R + k/3f 3 and-conquer application, the sum of split and join costs is maximum at the root of the computation. Thus, overall throughput of the system is limited by the rate at which the root processor can split and forward the tasks independent of the number of processors. This bound is given by 2 Scorn = 6.3 (D) + (D) + 8 T f1 2 + kf (6.18) Discussion In this section, we discuss how the performance models derived in Section 6.2 can be used for performance tuning. 6.3.1 Optimal N and Topology Performance models can be used to determine the optimal topology and the number of nodes to be used to obtain maximum performance for a given divide-and-conquer application. In Figure 6.6, we plotted throughput as a function In Figure 6.5, we have plotted Chapter 6. Divide-and-Conquer: Design and Modeling 100 Comm limiti 800 600 Throughput 400 200 20 40 60 80 100 120 Number of processors, N Figure 6.5: Plot of throughput curves for Binary Divide-and-Conquer Tasks on Binary Tree the three throughput equations (6.12), (6.17) and (6.18) for a binary tree topology with a set of typical parameter values. The optimal number of processors is given by the intersection of the equations (6.12) and (6.17) or (6.18) which ever leads to the minimum throughput. Beyond this optimal value, there will be no increase in the performance with an increase in N. In Figure 6.6, we plotted throughput as a function of N for balanced binary and ternary tree topologies executing 4-ary divide-and-conquer tasks. As mentioned in the Section 6.1, TrEK is topology independent and hence can be used to run any k-ary divide-and-conquer computation on any topology. As expected, for any particular N, a ternary tree topology achieves better throughput than the binary tree case. In general, it is always better to use a g-ary tree topology, where g is the maximum number of links available on each node. If the number of nodes available is less than the optimal for a given application, and it does not lead to a complete g-ary tree, it is better to use a topology in which all the levels, except the last one, are complete g-ary tree and the remaining nodes are balanced in the last level. Chapter 6. Divide-and-Conquer: Design and Modeling 101 Comm limiti 80 Ternary Tree 60 Throughput 4 Binary Tree Comm Iimit2 2 Number of processors, N Figure 6.6: Comparison of divide-and-conquer throughput on binary tree and ternary tree topologies As in Pfarm case, if the number of nodes available is larger than the optimal number of nodes to be used for a given application, one case use multiple g-ary trees to increase the overall performance. 6.3.2 Problem Scaling As in the case of Pfarm, the most effective way to scale the problem is by increasing the granularity of the tasks. In Figure 6.7, we have plotted the steady-state speedup for a binary tree topology for different values of Te(D). As shown in Figure 6.7, the steady-state speedup increases with increasing values of Te(D) as long as the system does not reach any of the two communication bounds. 6.4 Chapter Summary In this chapter, we have described the design of TrEK, a runtime kernel for executing divide-and-conquer applications. We described how the Pfarm design was modified and Chapter 6. Divide-and-Conquer: Design and Modeling Te =5 ms 100 80 Speedup 102 60 Te = 2 ms Te = 1 ms 40 20 Number of processors, N Figure 6.7: Measured speedup for divide-and-conquer on binary tree extended for TrEK. We developed a general analytical framework that can be used to analyze performance of divide-and-conquer applications using TrEK. This framework was used to derive performance models for fixed degree divide-and-conquer problem on balanced tree topologies. In the next chapter we describe our experimental results. Chapter 7 Divide-and- Conquer: Experiments The performance models derived in Chapter 6 for divide-and-conquer applications were experimentally validated using TrEK implemented in C on Logical Systems environment. The application program used in the validation experiments consists of a set of divideand-conquer tasks with a synthetic workload. As in the Pfarm case, the application program executes empty loops corresponding to split, join and compute functions. The number of iterations of the empty loops determine the values of T , Tj and Te used in a 8 particular experiment. In this chapter, each divide-and-conquer task is represented by the following parameters: degree (k), number of levels (1), base case computation time (i)) and join time (Tj(i)). The base case computation time represents 5 (Te), split time (T the time needed for solving a leaf subtask of a divide-and-conquer task. The experiments for determining the system overhead parameters are described in Section 7.1. In Section 7.2, we validate the performance models for arbitrary tree topolo gies. Performance models for balanced tree topologies are validated in Section 7.3. It is often possible to execute a divide-and-conquer application using processor farm paradigm. In Section 7.4, we compare the performance of Pfarm and TrEK. 103 Chapter 7. Divide-and-Conquer: Experiments 7.1 104 Determining System Overheads For experimental validation purposes, it is necessary to determine the values of the system overhead parameters, / e and / 3 f• As explained in Section 6.2, /3 is the processor 3 overhead to execute a task locally and j’ 3 / is the processor overhead for every task that is split and forwarded to the children, its value depends on the degree of the divideand-conquer tasks. As defined in Chapter 6, / f 3 = , where k is the degree of 12 f 1 + k/3 3 / divide-and-conquer tasks. The overhead parameters, e, f 2 are constants as 3 .f 1 and / 3 / they correspond to the software costs to execute particular parts of the TrEK program. Also, these values do not depend on the topology being used. This is due to the fact that subtasks are put into a single output queue. As a result, the overhead to forward a task is a function of the cost of adding and deleting from a queue, both constant time operations. In addition, values of these overheads do not depend on the characteristics of the application program. However, they are dependent on the implementation of TrEK, and the underlying processor characteristics. Thus, values of these parameters have to be determined only once for a particular implementation of TrEK. Once these values have been determined, it is possible to predict the performance of an application that fits the model. The overheads e, f i and 1 3 / f 2 are determined by conducting several experiments on 3 simple configurations shown in Figure 7.1. The value of / e is determined by solving for / 3 e in the expression 3 Ttotai = M(Te+/3e), (7.1) where Ttotai is the execution time for the configuration shown in Figure 7.1(a). Exper iments were run on configuration 7.1(a) with Te M = = 5, 10, 20 and 40 ms and a large 10000. By using M, Te and measured Ttotai, one can solve for / e. As expected, 3 for different Te’5, / e remained constant, varying by less than 3 s. Changing M had no 3 effect on the value of e 3 / The average value of e 3 / was 560 its. To determine the values of / f 1 and 3 3 / 2, experiments were conducted on the config f urations shown in Figure 7.1(b) and (c). For these configurations, total execution time can be expressed in terms of the number of tasks processed (M ) and forwarded (M 1 ) by 2 Chapter 7. Divide-and- Conquer: Experiments 105 Manager Worker (a) (c) (b) Figure 7.1: Configurations for determining /3e and / f 3 Workerl. Workerl spends Te+/3e for every task locally processed, and T+Tj+/3fl+k/3f 2 for every task that is split and forwarded. On the configurations in Figure 7.1(b) and (c), experiments were run with divide-and-conquer tasks of degree (k) 2 and 3, respec tively, with the number of levels (ci) equal to 2. By choosing a sufficiently large M, we can neglect the effect of start-up and wind-down. For sufficiently large M, the total execution time for configuration in Figure 7.1(b) is given by Ttotai (T€ + i3) + 8 1 M (T + Tj + 2 M f1 + f2), 3 / 2 (7.2) f2) ! 3 (7.3) and for configuration in Figure 7.1(c), it is given by Ttotai Mi(Te + 3) + M (T + Tj + 3 2 !1 + f For each configuration, we ran several experiments with base case Te equal to 5, 10, 20 and 40 ms. In all experiments, the split and join time was held constant at 1 ms, and M 10000. For each experiment, we measured the Ttotai, and M 1 and M . These 2 values were then used in equations (7.2) and (7.3) to obtain the values of /3f and / f2 3 The variation observed between the values obtained for / f 1 and / 3 f 2 for different values 3 of Te was within 5 ps. The average values obtained for /3f 1 and /f 2 were 520 s and 420 is, respectively. Chapter 7. Divide-and- Conquer: Experiments 7.2 106 Arbitrary Topologies In this section, we validate the general analytical framework described in Section 6.2.1 and show, experimentally, that on a fixed topology, a breadth-first spanning tree (BFST) with maximum number of leaves outperforms other BFSTs. As described in Chapter 6, by splitting a task, we can increase the amount of parallelism and make better use of the available parallelism in the hardware, but every split increases the total amount of work because of its associated overhead. In the case of a flow of divide-and-conquer tasks, ignoring start-up, it is better to reduce the number of splits since the application consists of a number of divide-and-conquer tasks. Thus, on a fixed topology, a BFST that does the minimum number of splits obtains the best performance since the overall overhead in this case is smaller compared to other BFSTs. Since splits have to occur at internal nodes, a BFST with a minimum number of internal nodes or maximum number of leaves does minimum number of splits. Experiments were conducted on three different breadth-first spanning trees (shown in Figure 7.2) of an 8 x 3 mesh topology. Te 0.001 0.002 0.003 0.004 0.005 Measured Total Execution Time BFST1 BFST2 BFST3 (3 leaves) (16 leaves) (9 leaves) 101.675 130.038 188.199 208.821 229.898 70.922 92.097 113.763 135.700 157.455 90.889 120.634 159.948 186.049 211.225 Table 7.1: Performance Comparison of three different BFSTs of the 8 x 3 mesh. Table 7.1 shows the measured execution time with the percentage error from the corresponding predicted execution time for three breadth-first spanning trees of the 8 x 3 mesh. In the table, times are given in seconds. For each case, a total of 1000 binary divide-and-conquer tasks with 10 levels were used. The value of T shown in the table is the base case computation time. A value of 1 ms was used for split and join costs at each level. Tasks of 10 levels were chosen for these experiments because the number of levels of tasks has to equal or exceed the number of levels of the topology, Chapter 7. Divide-and- Conquer: Experiments BFST1 BFST2 107 BFST3 Figure 7.2: Three breadth-first spanning trees of the 8 x 3 mesh. 9 in this example. In order to achieve steady state, the computation time of a subtask at any node has to be greater than the sum of split and join times at the parent node plus the associated overhead / f. Therefore, for the values of split and join times chosen 3 in these experiments, base case Te must be at least 1 ms. As Table 7.1 shows, the measured execution time for BFST2 is small compared to the other two BFSTs for all cases. Experimentally, this supports our claim that on a fixed topology, a BFST with maximum number of leaves provides better performance compared to other BFSTs. 7.3 Balanced Tree Topologies In this section, we experimentally validate the performance models for TrEK, derived in Section 6.2.2 for balanced tree topologies. The experiments were conducted to test the models for steady-state, start-up and wind-down for k-ary divide-and-conquer tasks on g-ary balanced tree topology, variable split and join costs, and communication bound. The parameter values were chosen to satisfy the following conditions: Chapter 7. Divide-and-Conquer: Experiments 108 1. the number of levels of tasks has to be equal to or greater than the number of levels of the topology. 2. the computation time of a subtask at any node has to be greater than the sum of split and join time at the parent node plus the associated overhead /3. 7.3.1 Steady-State Table 7.2 gives the range of experiments conducted to validate the steady-state model. Steady state performance models were validated by experiments with a sufficiently large Model Topology N M Steady-state Binary tree 1 to 63 10000 Ternary tree 1 to 40 10000 — — ki 2 6 7 3 4 5 Tasks Te TTj 5,10,20 1 1 1,2,5 1 1 5,10,20 1 1 1,2,5 1 1 Table 7.2: Range of Divide-and-Conquer steady-state Experiments M(10000) so that start-up and wind-down time can be ignored. For all experiments, a value of 1 ms was used for both T and Tj at each level. Experiments with variable split and join costs are described separately. A value of 1 ms was chosen for both split and join costs as larger values either limit the overall throughput (also discussed later) or require very large divide-and-conquer tasks to be in steady-state. Tables 7.3 and 7.4 show the percentage difference in predicted and measured exe cution time for binary and ternary tree cases respectively. In these experiments, divide and-collquer tasks of 7 and 4 levels were used on binary and ternary tree respectively. As can be observed from the tables, the errors are within 7%. 7.3.2 Start-up and Wind-down In the case of k-ary divide-and-conquer tasks running on k-ary balanced tree topologies, there will be 5D — 1 tasks in the system at any given time, where D is the number of levels of the hardware topology. Start-up and wind-down models were validated with CR CO CR C CD C CD C Cl) -1 -4 00 CO 00 C) CO I -4 cc -4 C) II Cl) C CDCO CD - ._] > ECD CD CD >< CD C-) C C CD C C) Cp C C)) C C CR CJ) CD C-) CD cc -, C-) CD CR cc Cl) II Cl) CD CD CO CD CD CD CD CR -4 CR CR CD C-) CD I CO CD Cl) CD bi CR I C) I © I I I COcc—3 00 00 -cc izccc -COi c-z 00 CR 00 CR —ccc.1 CR 00 O CR 00 CR ccCRC-c CO CO cc I C) C) : - I CO -4 O I © I cc ccccc CR I C) I- © I- -CO00CR ccC)CR 00 C) c 00 C)CR -CO 00CRC)1 CO 00 .D -4 CO CO CO I I cc cp —4 CO CO--4CO CCOI— CO I’ CR CD CD CD C) C - CD C ECD CD Chapter 7. Divide-and-Conquer: Experiments M 5D — 110 1 so that the system never reaches steady-state and the total execution time consists only of start-up and wind-down. Tables 7.5 and 7.6 show the percentage error between the predicted and measured total execution time for these experiments. In these experiments, the values used for the base case computation (Te) were 10 and 20 ms. The errors observed are large, especially for larger topologies because the predicted wind-down costs are upper bounds. We have calculated the wind-down cost based on the number of tasks that are predicted to have been executed on the root processor. For larger topologies, the model uses an upper bound of two tasks on root processor. If in the actual execution, the root processor executes only one task, then the error can be as large as 40%, because there are only a few large tasks and each task contributes considerably to the total execution time. N 3 7 15 31 63 T10ms Upper Bound Measured Exec Time Exec Time 1.546 1.171 0.795 0.803 0.812 1.338 0.954 0.573 0.461 0.477 % Error 13.45 18.53 27.92 42.59 41.25 Te=2Oms Upper Bound Measured Exec Time Exec Time 2.826 2.132 1.435 1.444 1.452 2.459 1.755 1.053 0.781 0.781 %Error 12.99 17.68 26.62 46.20 46.21 Table 7.5: Start-up and Wind-down Performance Comparison for Divide-and-Conquer running on Binary Tree. N 4 13 40 TelOms Upper Bound Measured Exec Time Exec Time 1.201 0.913 0.634 0.793 0.635 0.365 % Error 33.97 30.45 42.43 Te=r20ms Upper Bound Measured Exec Time Exec Time 1.717 1.163 1.174 1.513 0.695 0.635 %Error 11.88 40.24 45.91 Table 7.6: Start-up and Wind-down Performance Comparison for Divide-and-Conquer running on Ternary Tree. As the divide-and-conquer tasks are generally large, it is important to validate the models for the cases in which the total number of tasks is not very large. Thus, we Chapter 7. Divide-and-Conquer: Experiments conducted several experiments with M 111 1000. Tables 7.7 and 7.8 show the percentage error between the predicted and measured total execution time for these experiments on binary and ternary tree topologies. As Tables 7.7 and 7.8 show, the errors are within 7%. N Predicted Exec Time 5 ms Measured Exec Time 1 3 7 15 31 63 222.674 74.728 32.576 15.641 8.165 4.708 222.694 74.788 32.436 15.581 7.911 4.693 Te Predicted Exec Time 10 ms Measured Exec Time %Error 382.754 128.013 55.236 26.406 13.487 7.407 382.786 128.325 55.358 26.193 13.086 7.372 0.000 -0.144 -0.221 0.806 2.973 0.472 Te = % Error 0.000 -0.080 0.429 0.384 3.110 0.319 = Table 7.7: Comparison of Predicted and Measured Total Execution Time for Divideand-Conquer running on Binary Tree with M 1000. N Predicted Exec Time Te=lms Measured Exec Time % Error Predicted Exec Time TeSms Measured Exec Time %Error 1 4 13 40 161.640 41.053 13.294 4.996 161.662 41.003 13.308 4.990 -0.013 0.122 -0.105 0.120 296.708 74.883 23.779 8.365 296.740 74.758 24.811 8.613 -0.010 0.167 -4.340 -2.965 Table 7.8: Comparison of Predicted and Measured Total Execution Time for Divide and-Conquer running on Ternary Tree with M =1000. 7.3.3 k-ary Tasks on g-ary Balanced Topologies The experiments discussed in the previous sections tested binary and ternary divide and-conquer tasks which exactly match the underlying topologies. In order to validate the models for cases in which the task structures does not match the underlying topolo gies, we conducted experiments in which binary divide-and-conquer tasks were run on ternary tree topologies. Table 7.9 shows the percentage error between the predicted and Chapter 7. Divide-and-Conquer: Experiments 112 measured total execution time for these experiments. The errors are within 7%. N Predicted Exec Time TelOms Measured Exec Time % Error 4 13 40 96.351 30.165 10.458 96.292 31.422 10.084 0.061 -4.167 3.576 Te=2Oms Predicted Measured Exec Time Exec Time 176.630 55.086 18.945 176.436 56.077 19.678 %Error 0.110 -1.780 -3.869 Table 7.9: Comparison of Predicted and Measured Total Execution Time for Binary Divide-and-Conquer tasks running on Ternary Tree We claim in Chapter 6 that the models can also be used for single divide-and-conquer problem with large degree and depth. To substantiate this claim, several experiments were run, each with a single large divide-and-conquer task. In the first example, a 5-ary 9-level divide-and-conquer task with base case com putation time of 5 ms was run on a 40-node ternary tree. For prediction purposes, we evaluated the throughput excluding the root worker because in TrEK, with a single task, the root worker does not process any subtasks as it splits the task and forwards all the subtasks to its children. The model predicted a total execution time of 275.581 seconds whereas the TrEK execution took 316.478 seconds. As a second example, a 6-ary 8-level divide-and-conquer task with base case computation time of 5 ms was run on a 40-node ternary tree. For this case, the model predicted a total execution time of 232.683 ms, and the TrEK execution took 252.117 ms. The errors are larger than 7% because the processors closer to the manager (especially the root) can idle as the number of subtasks arriving at these processors is small. 7.3.4 Variable Split and Join Costs In all the previous experiments, split and join costs were kept constant at all levels of the computation. We conducted several experiments to verify the validity of the models for the cases in which the split and join costs are different at different levels of computation. Results of two sets of experiments are tabulated in Tables 7.10. These experiments were conducted with 1000 binary divide-and-conquer tasks of 6 levels with base case Chapter 7. Divide-and- Conquer: Experiments 113 computation of 10 ms on a binary tree. In the first set of experiments, a variable split cost was used where as the join cost was kept constant at all levels. The split cost was varied from 1 to 5 ms, and a value of 1 ms was used for the join cost. The second set of experiments were conducted with variable join costs (varying from 0.5 to 2.5 ms) with a constant value of 1 ms for split cost. Both the split and join costs were varied in the third set of experiments. Once again, the errors observed in these experiments are within 79bo. N 1 3 7 15 31 63 Variable Split & Constant Join Predicted Measured % Error Exec Time Exec Time 408.764 136.688 58.962 28.185 13.989 7.852 408.923 136.980 59.107 28.011 14.020 8.250 N 1 3 7 15 31 63 -0.038 -0.217 -0.246 0.617 -0.222 -5.069 Constant Split & Variable Join Predicted Measured %Error Exec Time Exec Time 369.745 123.679 53.381 25.550 12.708 6.830 369.889 124.019 53.587 25.483 12.844 7.151 -0.038 -0.275 -0.386 0.262 -1.070 -4.700 Variable Split & Join Predicted Measured %Error Exec Time Exec Time 434.777 145.363 62.688 30.014 14.939 11.363 434.854 145.736 62.851 29.774 15.129 11.797 -0.018 -0.257 -0.260 0.800 -1.272 -3.819 Table 7.10: Comparison of Predicted and Measured Total Execution Time for Divide and-Conquer Tasks with Variable Split & Join Costs When split and join costs are large, the system performance is bound by the rate at which the root processor can split the tasks and join the results and throughput is given by equation (6.18). Several experiments were conducted to test the performance of the system for this case. Table 7.11 shows the predicted and measured execution time for a set of experiments in which 1000 binary divide-and-conquer tasks of 6 levels were run Chapter 7. Divide-and- Conquer: Experiments 114 on binary trees. The split and join costs were varied from 1 to 5 ms, with base case computation being 10 ms. N 1 3 7 15 31 63 Variable Split & Join Predicted Measured % Error Exec Time Exec Time 548.834 183.388 79.045 37.815 21.370 21.370 548.930 183.769 79.213 37.434 24.534 27.656 -0.017 -0.208 -0.216 1.008 -14.806 -29.415 Table 7.11: Comparison of Predicted and Measured Total Execution Time for Binary Divide-and-Conquer Under Split and Join Bound Table 7.11 shows that the system performance reaches the limit for a 31 node bi nary tree topology. For larger topologies, system performance degrades as the dynamic demand-driven scheduling strategy keeps forwarding tasks towards the leaf nodes caus ing the nodes closer to the manager to idle. This phenomenon is similar to that observed in the processor farm case and shows the importance of determining the right number of nodes and topology to be used for a given application. 7.3.5 Robustness In all the experiments discussed above, it is assumed that the tasks are split into subtasks that take equal amount of computational time. In general, tasks may not always be divided into equal sized tasks. To test the robustness of using the average value of Te, we experimented with tasks which were split into k randomly unequal subtasks. Table 7.12 show the percentage error between the predicted and measured total execution time for experiments in which binary divide-and-conquer tasks were run on binary tree topologies. Two sets of experiments were conducted with tasks in which the sum of all the base case computations per task were 256 and 512 ms. It was assumed that the split and join costs are proportional to the computational requirements of the task. The percentage Chapter 7. Divide-and- Conquer: Experiments 115 errors in this case are slightly larger, but are still around 10%. The measured execution time is always larger than the predicted execution time because some of the subtasks are smaller than the overhead required to forward them, causing additional work. N Predicted Exec Time Te256ms Measured Exec Time % Error Predicted Exec Time T=512ms Measured Exec Time %Error 1 3 7 15 31 63 297.668 99.658 43.091 20.713 10.391 5.655 297.696 103.279 45.878 22.156 11.162 6.217 -0.009 -3.634 -6.465 -6.966 -7.420 -9.938 605.037 202.149 87.061 41.488 20.420 10.747 605.123 209.926 92.855 45.067 22.458 11.834 -0.014 -3.847 -6.655 -8.626 -9.980 -10.114 Table 7.12: Comparison of Predicted and Measured Total Execution Time for Binary Divide-and-Conquer Tasks with Subtasks of Unequal Size 7.4 Comparison of Divide-and-Conquer with Processor Farm As both divide-and-conquer and processor farm are task-oriented paradigms, it is pos sible to execute divide-and-conquer applications using the processor farm paradigm. Performance of divide-and-conquer applications executed with Pfarm compared to that using TrEK depends on the values of the application parameters such as the execution time per task and the total number of tasks. In the following paragraphs, we compare the performance of TrEK and Pfarm for various values of the parameters. In Table 7.13, we have tabulated the measured total execution time for binary divide-and-conquer ap plications executed with TrEK and Pfarm on a binary tree topology. Divide-and-Conquer strategy performs better compared to processor farm for appli cations with larger computation time per task. Computation time per task includes all the split and join times in addition to the time required for processing all the subtasks. As shown in Table 7.13, the total execution time taken by TrEK is smaller than that taken by Pfarm when the computation time per task is greater than 0.766 seconds. This cut-off value depends on the values of the various parameters of the system. This can Chapter 7. Divide-and-Conquer: Experiments Te (D) 7.169 3.072 1.535 0.766 0.190 0.126 Mz== 1000 TrEK Pfarm 117.980 49.792 25.175 13.251 4.541 4.769 136.315 58.425 29.204 14.585 3.634 2.418 116 M100 TrEK Pfarm 14.670 6.475 3.403 1.863 0.632 0.470 21.531 9.233 4.620 2.311 0.582 0.390 Table 7.13: Comparison of Total Execution Time for Binary Divide-and-Conquer Ap plications with TrEK and Pfarm be obtained by using the models to calculate and compare the performance of the ap plication using Pfarm and TrEK. The percentage difference in the total execution time is large when the M is small (see Table 7.13, M 100). Processor farm takes longer for these cases because the wind-down phase is longer in Pfarm compared to that of TrEK. In Pfarm, the wind-down phase begins when there are 4N tasks left in the sys tem compared to only 5D — 1 in TrEK, where N is the total number of processors and D is the number of levels of the hardware topology. As the affect of wind-down becomes considerable for applications with fewer tasks, the difference in the total execution time between Pfarm and TrEK increases. From the table, it is evident that Pfarm works better for applications in which the computation time per task is smaller (less than 0.190 seconds). This is because the overheads involved in splitting and joining in TrEK (overheads not present in Pfarm) become considerable compared to the execution time per task leading to a larger total execution time compared to that of Pfarm. Also, TrEK can not be used for applications with small computation time per task (e.g., smaller than 0.126 seconds for a 64-node binary tree topology). This is because the number of levels of the tasks should be greater than the number of levels of the topology, and the base case computation should be greater than the overhead 1 f to make use of all the processors effectively. 3 The only way to use Pfarm for executing a single divide-and-conquer problem is for the manager to split the task to obtain a sufficiently large number of independent Chapter 7. Divide-and-Conquer: Experiments 117 subtasks. However, since the manager must do a large number of splits and joins, it can quite easily become the bottleneck. In contrast, since in TrEK, the internal nodes do some of the splitting and joilling, it is not necessary for the manager to perform as many splits. Note that the manager still must do some splits since otherwise the nodes near the root may idle. 7.5 Chapter Summary Performance models for divide-and-conquer applications derived in Chapter 6 have been experimentally validated using TrEK implementation on a 75-node transputer-based multicomputer. We have described the experiments conducted to determine the values of the system overhead parameters, e, 3 / f 1 and 3 / f2. 3 / For a fixed topology, we have exper imentally shown that it is better to use a breadth-first spanning tree with a maximum number of leaves. We have validated the models for balanced tree topologies with a large number of experiments varying the values of all the parameters that affect overall performance. As processor farm strategy can be used for some divide-and-conquer ap plications, we have discussed and compared the performance of using TrEK and Pfarm for various cases. Chapter 8 System Integration and Applications In order to make programming and performance tuning easier, users have to be provided with an integrated environment that includes tools that support all phases of program development and execution, in addition to runtime systems such as Pfarm and TrEK. In Section 8.1, we briefly describe Parsec, an integrated programming environment that provides Pfarm and TrEK with supporting tools such as a graphical interface, mapper, loader and debugger on our transputer-based system. We discuss how this integrated en vironment supports reusability, reconfigurability and performance tuning. In section 8.2, we discuss the techniques that can be used to obtain the values of application dependent parameters in order to use the models for performance tuning. Finally, we describe two applications that have been developed using Pfarm and TrEK. 8.1 Parsec: An Integrated Programming Environment Pfarm and TrEK provide programming templates to efficiently execute applications that fit into processor farm and divide-and-conquer paradigms, respectively. In order to make it easier for application programmers to use these templates on a multicomputer system, it is important to provide a programming environment that supports all phases of program development and execution. Such a programming environment should not only have a variety of tools that help programmers in developing, executing, debugging 118 Chapter 8. System Integration and Applications 119 and tuning a parallel program, but must support their cooperative functioning through close integration. The following facilities should be present in an integrated programming environment to effectively support Pfarm and TrEK on a multicomputer: • an interface that hides both the system hardware and software complexities, • support for reusability, • support for easy reconfigurability of the system, • support for loading and executing of programs, • performance monitoring and tuning facilities based on the performance models, and • program debugging tools. Initial work in developing an integrated programming environment that addresses the above requirements on a large transputer-based system has been reported in [CGJ91, FSWC92]. Farsec is an on-going project at UBC to support creating tem plates or applications, and includes tools for building, mapping and loading the program onto the system. Ffarm and TrEK have influenced the design of Parsec. In order to make performance tuning easier for applications using Ffarm or TrEK, Parsec supports parameterized process graphs. A parameterized process graph is a family of intercon nection networks with one or more parameters that control structural properties. In addition, Parsec allows users to change these parameters in an easier way. Thus, a user can easily run an application on its optimal N and topology, once they are determined from the models. In Parsec, a “template implementor” describes a template in terms of a parameterized process structure which is then turned into a system module. Users of a template do not have to understand the details of its implementation. They simply instantiate a copy of the template and provide any necessary parameters and code. Parsec creates all the files (makefiles, configuration files, and load scripts) necessary for running the application. Chapter 8. System Integration and Applications 120 Pfarm and TrEK templates have been incorporated into Parsec and they make use of the parameterized graph structure to simplify scaling and restructuring of the system. Within this programming environment, currently oniy Trollius is available to the pro grammers. The following discussion focuses on how the various tools in the environment support programming of applications that use the Pfarm template under Parsec. In Parsec, programmers are provided with an easy-to-use graphical interface to Pfarm and TrEK, and to the system in general. The interface, developed by Feldcamp [FW93] is an X windows application utilizing the OpenLook GUI. Figure 8.1 shows the graph ical interface provided to application programmers when Pfarm template is selected. Programmers can easily modify the template by including the files that contain the application dependent code. Parsec supports system reconfigurability in an easier way through the graphical interface. The user can change the parameters (such as degree and depth) that define the topology to be used for executing an application with Pfarm. Then, the programmer can build the object files needed to execute the application by us ing the makefiles generated by Parsec. These makefiles remove the concerns of choosing the right compiler and libraries from the user. Users can easily include any additional libraries, if necessary. To execute a Pfarm application, two different object codes have to be built, one for the manager and another for all the workers. After choosing the topology to be used, the user must map this topology onto the 75node transputer based hardware system. The mapping tool [Mu193] inputs a description of the hardware, processors and crossbars, and outputs a crossbar setting, a process to processor assignment, and a configuration file. The mapping tool uses a greedy algorithm to do the mapping. In the case of Pfarm and TrEK, a different notion of mapping is needed. Pfarm and TrEK need one-to-one mapping of workers on to the processors, without any dilation. Also, on a fixed interconnection network, the mapper should be able to map the workers onto a breadth-first spanning tree. Parsec includes a loader tool {Mu193] that builds the network configuration file based on the mapping obtained by the mapping tool. In addition, the loader generates a script that is used to execute the application program. This script includes the Trollius com mands to boot the network and to load the appropriate programs onto the transputer Chapter 8. System Integration and Applications Interface 121 [: Parameters r Figure 8.1: Graphical Interface to Pfarm in Parsec Dc re e Chapter 8. System Integration and Applications 122 nodes. Users execute the application program by running this load script. The loader al lows users to choose either the network or physical level of communication. The network level is slower compared to the physical level, but unlike the physical level, users are able to print from any node and to monitor the state of the processes on each node. This allows users to choose network level during the program development and debugging phases, and then use physical level to obtain better performance. 8.2 Performance Tuning In this section, we describe how the programmer can use models for performance pre diction and tuning. 8.2.1 Parameter Measurements In order to use the models for performance prediction and tuning, one has to determine the input parameters to the model. The user must supply the values for all the param eters other than the system overhead parameters (/3e and /‘ f in case of Pfarm, and e, 3 f 1 and / 3 / f 2 in case of TrEK). The values of these overhead parameters are obtained 3 once using the techniques described in Sections 5.1 and 7.1 for Pfarm and TrEK respec tively. Here, we explain the techniques that are useful in determining the values of the application dependent parameters. Processor Farm In the processor farm case, the application dependent parameters that affect the overall performance are: the average execution time per task(Te), the total number of tasks (M), the data size (d) and result size (r) per task. An application programmer generally knows the values of M, d and r for an implementation, otherwise these values can be easily obtained. Obtaining the value of Te is not so straightforward as it depends on the nature of the application program in addition to the implementation. Here, we briefly describe the different techniques that can be used to obtain Te. Chapter 8. System Integration and Applications 123 1. In the case of application programs that consist of tasks, each of which require the same amount of computation, the easiest way to measure Te is to scale down the program to a single task or a small number and execute it with Pfa’rm on a single worker node. Then, Te can be obtained from the expression Ttotai = M(Te + ISe) using the measured total execution time and the value of ,6. This technique can also be used for application programs that consist of tasks with varied computation requirements. In this case, one can determine average Te by finding 72 for a representative set of tasks. If it is difficult to choose right set of tasks, then, the third technique can be used. If the application program consists of multiple phases, where all the tasks in a phase belong to the same type, then as explained in the robustness section 5.4, it is necessary to calculate a separate Te for each phase. 2. If there is a direct relationship between the input data size of a task and its computation requirement, then, by determining Te for a certain data size, one can estimate T for a new data size based on the relation. 3. If Te cannot be obtained by either of the previous techniques, then it can be determined by executing Pfarm on a smaller number of processors and using the model (with known N and T) to calculate If the performance obtained is not equal to that of the communication bounds, then, models can be used to obtain the average value of T by plugging in the total number of tasks and the measured total execution time. Divide-and- Conquer In the case of TrEK, the application dependent parameters that affect the performance are: the execution, split and join times (T(i), T (i), andTj(i)) at each level of the divide 3 and-conquer task, the total number of tasks (M), the data size (d), and the result size (r) per task. As in the Pfarm case, an application programmer generally knows the values of M, d and r for an implementation. Obtaining the values of the execution, split and join times is not so straightforward. Chapter 8. System Integration and Applications 124 As explained in Section 6.2, the computational requirement of a fixed degree divideand-conquer task (or subtask) with an input data size of n can be expressed as W(n) split(ri) + join(n) + kW(n/k), (8.1) where split(m) is the splitting cost for a task with size n, join(n) is the joining cost to produce a result of size n and k is the degree of the divide-and-conquer tasks to be processed. Te(i),Ts(i) and Tj(i) are given by W(n), split(m) and join(n), respectively, depending on the data size (n) of the tasks at the ith level. Thus, in order to use the performance models, one has to determine W(n), split(ri) and join(n) for the given application program. In general, W(n) can be expressed by the time complexity of the algorithm such as (9(nlogm) or O(m 2 logn). Thus, we have to find the constants underlying these time complexities to get the value of W(m), the time needed to execute the task on a processor. This can be determined by executing a scaled down version of the problem as in the first technique outlined for measuring Te in the processor farm case. This experiment can be repeated with few different input sizes to verify the values of the constants. Similar measurements can be used to obtain the values for the constants to be used for split and join times. 8.2.2 Performance Analysis Library To make it easier to use the models for prediction and tuning, application programmers are provided with a set of performance analysis library functions for both Pfarm and TrEK. These functions accept the values of application dependent parameters and output the predicted performance metrics such as throughput, speedup and total execution time. 8.3 User Interface Pfarm and TrEK were designed to hide the underlying complexities of the multicomputer system from the user. All the system dependent code is in the execution kernels and the user has to concentrate oniy on the application dependent code. The kernel can be used for application programs that fit the corresponding parallel programming paradigms. Chapter 8. System Integration and Applications 125 Both Pfarm and TrEK are run time kernels where the user code is linked with the system code to produce a single executable object for each processor node. In the manager, the user invokes the system routines to submit tasks and receive back results. In the workers, program control lies within the system code and the system invokes the user code at the appropriate times. 8.3.1 Pfarm In the case of Pfarm, there are two different executables, one for the manager node and the other for the worker nodes. The user part of the manager code consists of the following functions: 1. master_mit 0 - The system code calls this function at the beginning of the ex ecution. This function consists of the initialization part of the user code, such as reading data from an input file, etc.. Also, if there is any global data to be broad cast to all the worker nodes, the user code can initiate the system call, bc_send() to do the broadcast. 2. data_generator() - This function consists of the part of the user code that gener ates the tasks. In real-time applications, it could receive data from some device and generate the corresponding tasks. The tasks are passed to Pfarm for processing by the system call do_task 0. 3. result_receiver() - This function consists of the part of the user code that collects the results. The system call get_result 0 returns the next available result. This function could also include any processing of the results. Both data_generator() and resultreceiver() are called by low priority system processes. These functions are called in the beginning of the execution after creating all the processes and are run concurrently until they finish their respective jobs. In the case of applications in which later tasks depend on the results of the initial tasks, the user program could consist of only one function, data_generator 0. This function performs both the system calls, do_task 0 and get_result 0. Chapter 8. System Integration and Applications 126 •The user part of the worker code consists of the following two functions: • s].ave_init() - This function consists of any initialization part of the user code needed on each worker node. Also, if there is any broadcast of the global data, this function can do the system call, bc_receive() to receive the broadcast data. • comp_fn() This function contains the user code to process a task. It is called by - the worker process and takes a pointer to the task data and returns a pointer to the result and the size of the result. Ffarm provides the following system calls: 1. do_task(task_type, task_size, task_ptr) 2. result_ptr get_resu].t() 3. bc_send(bc_dat a_size, bc_data_ptr) 4. bc_data_ptr bc_receive() 8.3.2 TrEK TrEK provides the same system calls to the user as in the Pfarm case. The user part of the manager code to be run in the TrEK case is similar to that in the Ffarm case. The user part of the worker code includes the following two functions in addition to the comp_fn() described in the Pfarm case. • split_fn() - This function consists of the user code that splits a task and is called by the split process in the TrEK. It takes a pointer to the task as input and returns the pointers to the subtasks. • join_fn() - This function consists of the user code that combines the results of subtasks and is called by the join process in the TrEK. It takes the pointers to the subresults as input and returns a pointer to the combined result. As an example, of the user code for an FFT implementation that uses TrEK has been included in the following section. Chapter 8. System Integration and Applications 8.4 127 Applications Several applications have been developed using Pfarm and TrEK on our transputer-based system by other graduate students and myself. In this section, we discuss two interesting cases where the models were used to understand the performance of applications. Results reported here are for Logical systems versions. 8.4.1 Cepstral filtering Pfarm was used to parallelize a vision application that performs Cepstral filtering for motion analysis [BL93]. It takes two images of the same subject at different instances of time and determines the motion of the subject. The user code on the manager consists of two parts, a data-generator and a result-receiver. The data generator function partitions the images into small blocks and puts two corresponding blocks, one from each of the two images, into a single task. The result-receiver collects the results of the partial motion analysis and assembles them. In the following paragraph, we discuss how the models were used for performance tuning. Initially, the program was tested using two smaller images of 64 x 64 bits. The task execution time (Te) for images divided into blocks of 16 bits was measured by executing the program on a single worker node. For this case, the value of T was 178.048 ms. We were interested in using the program with larger images of 512 x 512 bits. The performance model was used to determine the best topology and the number of nodes to run this application for larger images. The model predicted that a 63-node balanced binary tree gives maximum performance with the total execution time of 3.141 seconds. We executed the application program on a 63 node binary tree and found that it took 4.492 seconds, which was considerably larger than the predicted value. To investigate the reasons for the large error, we ran the program on a linear chain and a ternary tree. The model predicted that on a 64 node linear chain, it would take 4.743 seconds and on a 40 node ternary tree 4.813 seconds. For these cases, the prediction was accurate since the measured total execution times were 4.517 and 4.756 seconds for chain and ternary tree cases respectively. Then, the program was executed on a 32 node binary tree and Chapter 8. System Integration and Applications 128 the prediction was found to be accurate. To investigate the reasons for the large error in the prediction for the 63-node binary tree case, we recorded the number of tasks executed on each node. We found that the leaf nodes executed a fewer tasks than the intermediate nodes. This occurs only when the system is communication bound, but according to the models, the system should not have reached the communication bound (based on the data and result sizes of the tasks). The only explanation for this scenario is that the system violated one of the assumptions in the model. On examination, we hypothesized that the data generator may not be generating the tasks at the rate at which the system can receive and process them. Therefore, we ran several experiments to measure the rate at which the data gen erator was generating the tasks, and found that the rate was 248.32 tasks/sec. However, according to the model, a 63-node binary tree for this application program can process tasks at the rate of 349.06 tasks/sec. This confirmed our suspicion that the large error occurred because the data generator was unable to keep up a continuous flow of tasks into the farm. For chain, ternary tree and smaller binary tree topologies the predictions were accurate because the task processing rate of these topologies were smaller than the rate at which tasks were generated. 8.4.2 Fast Fourier Transform (FFT) Divide-and-conquer strategy has been used to design efficient sequential FFT algorithms. In this example, we have used TrEK to parallelize a sequential algorithm that uses FFT for multipoint evaluation of a polynomial over a field [Sed83]. For this implementation, execution time starts increasing for binary trees with more than 4 levels as the through put reaches the communication bound given by equation (6.18) between levels 3 and 4. The sequential FFT algorithm {Sed83] is reproduced below: Algorithm FFT(N, a(x), w, A) if N = 1 then ; 0 a else /* split */ n := N/2; Chapter 8. System Integration and Applications 129 ajxZ; b(x) Zril c(x) : Z’ /* recursive calls */ ,B); 2 FFT(n,b(x),w , C); 2 FFT(m, c(x), w /* combine */ for Ic := 0 to ri 1 do Ak := Bk + wkCk; Ak+n := Bk — — endfor endif In order to show the interaction between TrEK and the user code, the user code that implements this FFT algorithm has been included. data_generator 0 { for (i=1; i<TotalJobs; i++) { /* prepare the data for a task */ /* send the task to TrEK by calling do_task */ do_task(task_type, user_data_size, ptr); }} result_receiver() { user_result *ptr; for (i1; i<=TotalJobs; i++) { 1* get the next available result */ ptr = (userresult *) get_resultO; /* process the result */ }} comp_fn(user_data *ptr, { mt *ur_size, char **ur_ptr) /* call the fft function */ fftl(N, fptr, A); /* set the argument values to be returned*/ *ur_sjze = user_result_size; *ur_ptr = (char*) p; } Chapter 8. System Integration and Applications 130 split.in(ptr, split_datasize, ptrs) user_data *ptr; mt *split_datasize; char *ptrs [1; { /* split the task */ /* set the argument values to be returned*/ ptrs[0] (char *) ptrl; ptrs[1] = (char *) ptr2; *split_datasize = user_data_size; } join_fn(jb_userres, jres_size, ptr) char *jb_userres [TASK_DEGI; mt *jres_size; char **ptr; { /* join the results */ /* set the argument values to be returned*/ *ptr = (char *) p; *jres_size = user_result_size; } After parallelizing this algorithm using TrEK, we ran the program on a single node with smaller N (32 and 64), and used the technique described in Section 8.2 to determine the values of the application dependent parameters. As the sequential algorithm is an 0(NlogN) algorithm, we set Te(N) aN + bNlogN. Because the split and join costs in this algorithm are 0(N), we set (N)+Tj(N) 3 T c+dIV. We calculated the values of the constants (a, b, c and d) for this implementation using the measured execution times for smaller N. The values of these constants are: a b 0.000091, c = 0.001695 and d = = 0.000330, 0.000082. The models were used to predict the performance of this implementation for larger N (128, 256, 512 and 1024). From the models, we found that the throughput for any N Chapter 8. System Integration and Applications 131 Problem Size N Lower bound Time Measured Time Measured Speedup 128 256 512 102 13.55 24.05 45.06 87.06 16.841 28.620 52.864 102.820 7.32 9.46 11.13 12.42 Table 8.1: Experimental results for FFT on a 16-node binary tree reaches the communication bound given by equation (6.18) for binary trees of 4 levels and ternary trees of 3 levels. The system reaches this bound because of the split and join costs at the root processor. If a larger tree is used, the root processor would be unable to split and forward the tasks at the rate in which the rest of the processors can process the tasks. The only way to improve the performance in this case is to optimize the code for split and join functions. We verified these predictions by experimenting with larger N (see Table 8.1). As predicted, the measured total execution time did not decrease when we increased the size of a binary topology from 4 levels to 5 levels. Actually, the measured execution time increased because of the reason described in Section 7.3.4. Chapter 9 Conclusions This dissertation has explored a parallel programming approach that addresses the need of providing a programming environment that is easy to use, efficient and supports per formance tuning on multicomputers. In this approach, users are provided with program ming support based on parallel programming paradigms. We have studied two commonly used parallel programming paradigms: processor farm and divide-and-conquer. Runtime system support for these two paradigms are designed such that they are easy-to-use and can maximize the performance for applications that fit these paradigms. Performance models are derived for these systems taking into account the computation and com munication characteristics of the applications that fit the paradigm in addition to the characteristics of the hardware and software system. The models determine the parame ters that affect the performance and can be used for performance prediction and tuning. This work has contributed to our understanding of these systems and their limitations. In designing reusable and efficient runtime systems, many trade-offs have to be con sidered. In Chapter 4 and 6, we have described the trade-offs involved in the design of Pfarm and TrEK respectively. Hiding the complexities of the underlying hardware and software system is the major consideration in the design of these runtime systems. These systems include all the necessary code for the system dependent issues such as communication, synchronization, task scheduling, and load balancing. Thus, users can concentrate on the application dependent compute intensive code. In order to be effi cient, runtime systems are designed to make use of all the available parallelism in the 132 Chapter 9. Conclusions 133 hardware system such as the ability to simultaneously communicate on all the links. The system overheads that limit the overall performance of the applications are kept to a minimum. Both systems implement distributed dynamic task scheduling strategies so that they can work well even for applications that can be decomposed into tasks with varying computational requirements. The systems are designed such that they are topology independent, i.e., they can scale and run on any processor topology. It is difficult to obtain a single performance model that can be used for all the applications on a parallel system. However, it is possible to derive good performance models for each of the virtual machines as every paradigm is a restricted model of parallel computation. Performance models for processor farm and divide-and-conquer virtual machines have been derived in Chapter 4 and 6 respectively. These models take into account the computation and communication characteristics of the applications that fit the paradigm in addition to the characteristics of the hardware and software system. General analytical frameworks that can be used to predict the performance on any tree topology have been presented for both of these paradigms. As both of these task-oriented systems behave like a pipeline, it is important to analyze start-up and wind-down. For the processor farm case, we have shown that, on a fixed topology, a breadthfirst spanning tree provides maximum performance and steady-state performance of all breadth-first spanning trees are equal. As balanced tree topologies provide maximum performance in the case of reconfigurable systems, we have derived performance models for these topologies using the general analytical framework. TrEK can execute divide-and-conquer computations of any degree and depth on any arbitrary tree topology. Unlike idealized parallel implementations of divide-and-conquer algorithms on tree processors [HZ83, Co189], TrEK allows intermediate processors to do subtask processing to make use of all the available parallelism in the hardware system. The analytical framework assumes a flow of divide-and-conquer tasks. As explained in Section 3.4, this framework works well even for applications that consist of a single divide-and-conquer computation with large degree and depth compared to the underlying hardware topology. Experimentally, we have found that, on a fixed topology, a breadth first spanning tree with maximum number of leaves obtains maximum performance. As Chapter 9. Conclusions 134 balanced tree topologies provide maximum performance in the case of reconfigurable systems, we have derived models that can predict performance of any fixed k-ary divideand-computations on any g-ary balanced tree topology. Pfarm and TrEK have been implemented on a 75 node transputer-based multicom puter. They are implemented using C on two different software environments: Logical Systems and Trollius. As Pfarm and TrEK provide standard interfaces to the user code irrespective of the environment they are implemented on, the user code is portable from one system to the other. Performance models have been experimentally validated using Pfarm and TrEK. The models are found to be accurate as reported in Chapters 5 and 7. In order to use the models, it should be possible to determine the values of the param eters in an easy way. We have explained the techniques that can be used to determine the values of the system dependent and application dependent parameters. We have discussed how the models can be used in predicting and tuning the performance. It is possible to use the processor farm strategy to parallelize some divide-and-conquer applications. Divide-and-conquer strategy performs well compared to the processor farm strategy for applications with larger tasks and for those that consist of a smaller number of tasks. Performance models can be used to determine the strategy to be used for a given application. In order to make it easier for application programmers to use runtime systems on a multicomputer, they must be provided with a programming environment that supports all phases of program development and execution. In Chapter 8, we have described such an integrated programming environment, Parsec, developed on our transputer-based multicomputer. In addition to providing Pfarm and TrEK templates to application programmers, Parsec supports tools such as a graphical interface, mapper, loader and debugger We have discussed how Pars cc supports reusability, reconfigurability and per formance tuning. Chapter 9. Conclusions 9.1 135 Future Directions This research can be continned in several directions to further explore the usefulness of this approach for parallel programming. We have mentioned that the same design can be used for implementing Ffarm and TrEK on any distributed memory parallel computer that has the characteristics detailed in Chapter 3. Also, the performance models derived here can be used for any system that satisfies the assumptions used in the models. By implementing Pfarm and TrEK on two different software environments, Logical Systems and Trollius, we have shown that the user programs are portable and the same models can be used for both implementations using appropriate parameter values. The claims of being able to use the same design and modeling on any multicomputer system in addition to user programs being portable can be further strengthened by implementing the rnntime systems on other hardware platforms, such as C40 based machines. Developing more application programs with these runtime systems in several application areas could further support the usefulness of these runtime systems. It is interesting to research the expressiveness of the task-oriented paradigms, pro cessor farm and divide-and-conquer, i.e., whether these paradigms and the associated implementations can be modified or enhanced to make use of them for applications that may not exactly fit the underlying computational models. Pfarm system can be used for applications with differing characteristics as discussed in Chapter 8. It is possible to modify the same design to develop a runtime system that can be used for the Task Queue paradigm. The applications that fit the Task Queue paradigm consist of an initial set of tasks, which could be allocated to various processors in the system. These tasks may generate new tasks that have to be processed. Unlike in the case of divide-and-conquer, the results of these new tasks need not be joined by the parent. In this case, the task scheduling and load balancing has to be different from that of Pfarm. Each processor can keep a local task queue to which all the new tasks are added. It can exchange the load information with its neighbors and transfer the tasks to the lightly loaded neighbors, if necessary. Deriving performance models for such systems Chapter 9. Conclusions 136 may need probabilistic models that reflect the task generation. It is possible to expand this approach to develop virtual machines that support applications that fit other parallel programming paradigms such as Compute-Aggregate Broadcast, Systolic and Dynamic Programming. Also, in practice, some applications may consist of several phases each of which may need different programming paradigms. This work can be expanded by developing programming environment that supports such applications. There are many issues to be considered here such as how the different systems exchange the data and results, whether they can be co-existent on the system or they should be interleaved. Bibliography [ACG91] S. Ahmed, N. Carriero, and D. Gelernter. The Linda Program Builder. In A. Nicolau, D. Gelernter, T. Gross, and D. Padua, editors, Advances in Languages and Compilers for Parallel Processing, Research Monographs in Parallel and Distributed Computing. MIT Press, Cambridge, Massachusetts, 1991. {AG89] D. Atapattu and D. Gannon. Building Analytical Models into an Interactive Performance Prediction Tool. Proc. Supercomputing ‘89, pages 521—530, November 1989. Indiana U. [AHU74j A.V. Aho, J.E. Hopcroft, and J.D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974. {A1v90] G. A. Alverson. Abstractions for Effectively Portable Shared Memory Par allel Programs. Technical Report 90-10-09, Dept. of Computer Science and Engineering, University of Washington, October 1990. [Amd67] G. M. Amdahl. Valildity of the Single-processor Approach to Achieving Large Scale Computing Capabilities. In AFIPS Conference Proceedings. AFIPS Press, Reston, Va., 1967. [Ar188] R. Arlauskas. iPSC/2 System: A Second Generation Hypercube. In Pro ceedings of the Third Conference on Hypercube Concurrent Computers and Applications, pages 38—50. ACM Press, January 1988. [A588] W. C. Athas and C. L. Seitz. Multicomputers: Message-passing concurrent computers. IEEE Computer, August 1988. {BBFB89] Moshe Braner, Gregory D. Burns, David L. Fielding, and James R. Beers. Trollius OS for Transputers. In D. Stiles, editor, Transputer Research and Applications (NA TUG 1), 1989. [Ben8O] J.L. Bentley. Multidimensional Divide-and-Conquer. 23:214—229, 1980. [BFKK91] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer. A Static Perfor mance Estimator to Guide Data Partitioning Decisions. In Proc. 8th ACM Symp. on Principles and Practice of Parallel Programming (PPOPP), pages 213—223, April 1991. 137 Commun. A CM, Chapter 9. Conclusions 138 [BL93] E. Bandari and J. J. Little. Multi-evidential Correlation & Visual Echo Analysis. Technical Report TR 93-1, Department of Computer Science, University of British Columbia, Vancouver, Canada, January 93. [BTU88] R. D. Beton, S. P. Turner, and C. Upstill. A State-of-the-Art Radar Pulse Deinterleaver A Commercial Application of Occam and the Transputer. In C. Askew, editor, Occam and the Transputer Research and Applications, pages 145—152. lOS Press, 1988. - - [CCT93] A.G. Chalmers, N.W. Campbell, and B.T. Thomas. Computational Models for Real-Time Tracking of Aircraft Engine Components. In S. Atkins and A. Wagner, editors, Transputer Research and Applications 6 (NA TUG 6). lOS Press, May 1993. [CG89] N. Carriero and D. Gelernter. How to Write Parallel Programs: A Guide to the Perplexed. ACM Computing Surveys, pages 323—357, September 1989. [CGJ91j S. Chanson, N. Goldstein, J. Jiang, H. Larsen, H. Sreekantaswamy, and A. Wagner. TIPS: Transputer-based Interactive Parallelizing System. In Transputing ‘91 Conference Proceedings, Sunnyvale, Calf. lOS Press, April 1991. [CHvdV88] N. Carmichael, D. Hewson, and J. van der Vorst. A Prototype Simulator Output Movie System based on Parallel Processing Technology. In C. Askew, editor, Occam and the Transputer Research and Applications, pages 169— 175. lOS Press, 1988. - [CK88] D. Callahan and K. Kennedy. Compiling Programs for Distributed-memory Multiprocessors. Journal of Supercomputing, 2:151—169, October 1988. [CK91] K. M. Chandy and C. Kesselman. Parallel programming in 2001. IEEE Software, pages 11—20, November 1991. [Cok9l] Ronald S. Cok. Parallel Programs for the Transputer. Prentice-Hall, 1991. [Co189] M. Cole. Algorithmic Skeletons: Structured Management of Parallel Com putation. MIT Press, Cambridge, Massachusetts, 1989. [CU9O] I. Cramb and C. Upstill. Using Transputers to Simulate Optoelectronic Computers. In S.J. Turner, editor, Tools and Techniques for Transputer Applications (OUG 12). lOS Press, April 1990. [Dav85] R. E. Davis. Logic Programming and Prolog. IEEE Software, pages 53—62, September 1985. [Den8O] J. B. Dennis. Data Flow Supercomputers. Computer, pages 48—56, Novem ber 1980. {Dun9OJ R. Duncan. A Survey of Parallel Computer Architectures. IEEE Computer, pages 5—16, February 1990. Chapter 9. Conclusions 139 [EZL89] D.L. Eager, J. Zahorjan, and E.D. Lazowska. Speedup Versus Efficiency in Parallel Systems. IEEE Transactions on Computers, March 1989. [F90] D.L. Fielding et al. The Trollius Programming Environment for Multicomputers. In A. Wagner, editor, Transputer Research and Applications (NATUG 3). lOS Press, April 1990. [FJL88] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors (vol. 1). Prentice Hall, New Jersey, 1988. [FK89] H. P. Flatt and K. Kennedy. Performance of parallel processors. Parallel Computing, 12:1—20, 1989. [F1y72J M. J. Flynn. Some computer organizations and their effectiveness. IEEE Trans. Computers, C-21 (9) :948—960, September 1972. {FO9O] I. Foster and R. Overbeek. Bilingual Parallel Programming. In A. Nicolau, D. Glelernter, T. Gross, and D. Padua, editors, Advances in Languages and Compiler for Parallel Processing, Research Monographs in Parallel and Distributed Computing. MIT Press, Cambridge, Massachusetts, 1990. [FSWC92] D. Feldcamp, H. V. Sreekantaswamy, A. Wagner, and S. Chanson. Towards a Skeleton-based Parallel Programming Environment. In A. Veronis, editor, Transputer Research and Applications NA TUG 5. lOS Press, April 1992. [FT9O] I. Foster and S. Taylor. Strand: New Concepts in Parallel Programming. Prentice Hall, New Jersey, 1990. [FW93] D. Feldcamp and A. Wagner. Parsec: A Software Development Environ ment for Performance Oriented Parallel Programming. In S. Atkins and A. Wagner, editors, Transputer Research and Applications 6 (NA TUG 6). lOS Press, May 1993. [Gab9O] E. Gabber. VMPP: A Practical Tool for the Development of Portable and Efficient Programs for Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(3):304—317, July 1990. [Ga190] J. Galletly. Occam2. Pitman Publishing, 1990. [GC92J D. Gelernter and N. Carriero. Coordination Languages and their Signifi cance. Commun. ACM, 35(2):97—107, February 1992. [GK92] A. Gupta and V. Kumar. Analyzing Performance of Large Scale Parallel Systems. Technical Report TR 92-32, Dept. of Computer Science, University of Minnesota, October 1992. {GKP89] R. L. Graham, D. E. Knuth, and 0. Patashnik, Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, Mass., 1989. Chapter 9. Conclusions 140 [GM85] D. H. Grit and J. R. McGraw. Programming Divide and Conquer for a MIMD Machine. Software-Practice and Experience, 15(1):41—53, January 1985. [GR88] A. Gibbons and W. Rytter. Efficient Parallel Algorithms. Cambridge Uni versity Press, Cambridge, 1988. [Gus88] J. L. Gustafson. Reevaluating Amdahl’s Law. Commun. ACM, 31(5):532— 533, May 1988. [Hey9O] A. J. G. Hey. Experiments in MIMD Parallelism. In E. Odijk, M. Rem, and J.-C. Syre, editors, PARLE: Parallel Architectures and Languages Europe, pages 28—42. Springer-Verlag, New York, 1990. Lecture Notes in Computer Science 366. [HKT91I S. Hiranandani, K. Kennedy, and C. Tseng. Compiler Support for Machineindependent Parallel Programming in Fortran D. In J. Saltz and P. Mehro tra, editors, Compilers and Runtime Software for Scalable Multiprocessors. 1991. [Hom93j D. Homeister. An Adaptive Granularity Scheduler for Multiprocessors. In S. Atkins and A. Wagner, editors, Transputer Research and Applications 6 (NA TUG 6). lOS Press, May 1993. [HU79] J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Lan guages, and Computation. Addison-Wesley, Reading, MA, 1979. [HZ83] E. Horowitz and A. Zorat. Divide-and-Conquer for Parallel Processing. IEEE Transactions on Computers, 32(6):582—585, June 1983. [Inc9l] Texas Instruments Incorporated. TMS 32OC4X User’s Guide. 1991. [1nc92] Texas Instruments Incorporated. TMS 2OCX Parallel Runtime Support Library: User’s Guide. 1992. [K87j H.T. Kung et al. The Warp Computer: Architecture, Implementation, and Performance. IEEE Transactions on Computers, C-36(12):1523—1538, De cember 1987. [K88J H.T. Kung et al. iWARP: An integrated solution to high-speed parallel com puting. In Proceedings of Supercomputing ‘88, Orlando, FL. IEEE, Com puter Society Press, 1988. [K90] H.T. Kung et al. Supporting systolic and memory communication in iWarp. In Proceedings of the 17th Intl. Symposium on Computer Architecture. IEEE, Computer Society Press, 1990. [Kar87] A. H. Karp. Programming for Parallelism. IEEE Computer, pages 43—57, May 1987. Chapter 9. Conclusions 141 [KF9O] A. H. Karp and H.P. Flatt. Measuring Parallel Processor Performance. Commun. ACM, 33(5):539—543, May 1990. {Kog85] P. M. Kogge. Function-based Computing and Parallelism: A Review. Par allel Computing, 2:243—253, 1985. [KR87] V. Kumar and V. N. Rao. Parallel Depth-first Search, Part II: Analysis. International Journal of Parallel Programming, 16(6):501—519, 1987. [KT88] M. Kallstorm and S. S. Thakkar. Programming thre Parallel Computers. IEEE Software, 5(1):11—22, Jan 1988. [Kun89] H. T. Kung. Computational Models of Parallel Computers. In R. J. El liot and C.A.R. Hoare, editors, Scientific applications of multiprocessors. Prentice Hall, 1989. [Lim84J INMOS Limited. Occam Programming Manual. Prentice Hall International, New Jersey, 1984. [1im87] Meiko limited. Computing Surface Technical Specifications. Bristol, UK, 1987. [Lim88] INMOS Limited. Transputer Development System. Prentice Hall, New Jer sey, 1988. [Moc88] Jeffrey Mock. Process, channels and semaphores (version 2). Logical Sys tems C Programmers Manual, Logical Systems, 1988. [MS88] D. L. McBurney and M. R. Sleep. Transputer-based Experiments with the ZAPP Architecture. In J. W. de Bakker et al, editor, Lecture notes in Computer Science. 1988. [Mu193j S. Mulye. Communication Optimization in Parsec: A Programming Envi ronment for Multicomputers. Master’s thesis, Dept. of Computer Science, University of British Columbia, 1993. in preparation. [Ne187] P. A. Nelson. Parallel Programming Paradigms. Ph.D. thesis, Department of Computer Science , University of Washington, Seattle, WA, 1987. [NRFF93] D.F. Garcia Nocetti, M.G. Ruano, P.J. Fish, and P.J. Fleming. Parallel Implementation of a Parametric Spectral Estimator for a Real-time Doppler Flow Detector. In S. Atkins and A. Wagner, editors, Transputer Research and Applications 6 (NA TUG 6). lOS Press, May 1993. {Pri87] D. Pritchard. Mathematical Models of Distributed Computation. In Pro ceedings of the 7th Occam User Group Technical Meeting. lOS, 1987. [Pri9O] D. J. Pritchard. Performance Analysis and Measurement on Transputer Arrays. In Aad van der Steen, editor, Evaluating Supercomputers. Chapman and Hall, 1990. Meiko Ltd., 142 [PW86] D. A. Padua and M. J. Wolfe. Advanced Compiler Optimizations for Su percomputers. Commun. ACM, 29(12):1184—1201, December 1986. [RSV9O] M. Rao, Z. Segall, and D. Vrsalovic. Implementation Machine Paradigm for Parallel Programming. In Proceedings of Supercomputing ‘90, pages 594— 603. IEEE, Computer Society Press, 1990. [5CW92j H. V. Sreekantaswamy, S. Chanson, and A. Wagner. Performance Prediction Modeling of Multicomputers. In Proceedings of the Tweith International Conference on Distributed Computing Systems (lCD CS), June 1992. [Sed83] R. Sedgewick. Algorithms. Addison-Wesley, Reading, MA, 1983. [Sei85] C.L. Seitz. The cosmic cube. Commun. ACM, 28(1):22—33, Jan 1985. [Son93] Y. Song. The Implementation of Sequential and Parallel Algorithms for Solving Almost Block Bidiagonal Systems. Master’s thesis, Dept. of Com puter Science, University of British Columbia, March 1993. [SR85] Z. Segall and L. Ruldolph. PIE: A Programming and Instrumentation En vironment for Parallel Processing. IEEE Software, 2(6):22—37, November 1985. [SS91] S. S. Sturrock and I. Salmon. Application of Occam to Biological Sequence Comparisons. In J. Edwards, editor, Occam and the Transputer Current Developments, pages 181—190. lOS Press, 1991. - [Sto87] F. Stout. Supporting Divide-and-Conquer Algorithms for Image Pro cessing. Journal of Parallel and Distributed Computing, 4:95—115, February 1987. [Sto88] H. Stone. High Performance Computers. Addison-Wesley, 1988. [Sys9O] Logical Systems. C Programmers Manual. Corvallis, OR, 1990. [TD9O] R.W.S. Tregidgo and A.C. Downton. Processor Farm Analysis and Simu lation for Embedded Parallel Processing Systems. In S.J. Turner, editor, Tools and Techniques for Transputer Applications (OUG 1). lOS Press, April 1990. [U1l84] J.D. Ullman. Computational Aspects of VLSI. Computer Science Press, Rockville, MD, 1984. [WSC93I A. Wagner, H. Sreekantaswamy, and S. Chanson. Performance Issues in the Design of a Processor Farm Template. In Proceedings of the World Transputer Conference. lOS Press, 1993. To appear. Q. 143 Glossary Average processor overhead for a locally processed task. Average processor overhead for a forwarded task. r Communication rate of the links. The fraction of the total number of tasks executed by a processor i. M Total number of tasks in an application program. N The total number of processors in the system. SD Throughput of a D-level tree. SPD Speedup of a D-level tree. Td Average communication time required to transfer a task from a processor to its neighbor. Td Average communication time required to transfer a result from a processor to its neighbor. T Average processing time per task in the processor farm case. Te(j) Average processing time per task or subtask at the ith level of the hardware topology in the divide-and-conquer case. Tj (i) Average result joining time at the ith level of the hardware topology in the divide-and-conquer case. (i) 8 T Average task splitting time at the ith level of the hardware topology in the divide-and-conquer case. 88 T Steady-state execution time. 3 T Start-up time. Ttotai Total execution time. Td Wind-down time. Number of tasks or subtasks that visit processor i.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- An integrated approach to programming and performance...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
An integrated approach to programming and performance modeling of multicomputers Sreekantaswamy, Halsur V. 1994
pdf
Page Metadata
Item Metadata
Title | An integrated approach to programming and performance modeling of multicomputers |
Creator |
Sreekantaswamy, Halsur V. |
Date Issued | 1994 |
Description | The relative ease with which it is possible to build inexpensive, high-performance multicomputers using regular microprocessors has made them very popular in the last decade. The major problem with multicomputers is the difficulty in effectively programming them. Programmers are often faced with the choice of using high level programming tools that are easy to use but provide poor performance or low level tools that take advantage of specific hardware characteristics to obtain better performance but are difficult to use. In general, existing parallel programming environments do not provide any guarantee of performance and they provide little support for performance evaluation and tuning. This dissertation explores an approach in which users are provided with programming support based on parallel programming paradigms. We have studied two commonly used parallel programming paradigms: Processor Farm and Divide-and-Conquer. Two runtime systems, Pfarm and TrEK, were designed and developed for applications that fit these paradigms. These systems hide the underlying complexities of multicomputers from the users, and are easy-to-use and topology independent. Performance models are derived for these systems, taking into account the computation and communication characteristics of the applications in addition to the characteristics of the hardware and software system. The models were experimentally validated on a large transputer-based system. The models are accurate and proved useful for performance prediction and tuning. Pfarm and TrEK were integrated into Parsec, a programming environment that supports program development and execution tools such as a graphical interface, mapper, loader and debugger. They have also been used to develop several image processing and numerical analysis applications. |
Extent | 2177906 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-04-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051473 |
URI | http://hdl.handle.net/2429/6951 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1994-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1994-893923.pdf [ 2.08MB ]
- Metadata
- JSON: 831-1.0051473.json
- JSON-LD: 831-1.0051473-ld.json
- RDF/XML (Pretty): 831-1.0051473-rdf.xml
- RDF/JSON: 831-1.0051473-rdf.json
- Turtle: 831-1.0051473-turtle.txt
- N-Triples: 831-1.0051473-rdf-ntriples.txt
- Original Record: 831-1.0051473-source.json
- Full Text
- 831-1.0051473-fulltext.txt
- Citation
- 831-1.0051473.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051473/manifest