An Integrated Approach to Programming andPerformance Modeling of MulticomputersbyHALSUR V. SREEKANTASWAMYB.Engg., University of Mysore, India, 1981M.Tech., Indian Institute of Technology, Kanpur, India, 1983A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYIN THE FACULTY OF GRADUATE STUDIESDEPARTMENT OF COMPUTER SCIENCEWe accept this thesis as conforming to the required standardTHE UNIVERSITY OF BRITISH COLUMBIAOctober, 1993© Halsur V. Sreekantaswamy, 1993In presenting this thesis in partial fulfillment of the requirements for an advanced degree at theUniversity of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarlypurposes may be granted by the head of my department or by his or her representatives. Itis understood that copying or publication of this thesis for financial gain shall not he allowedwithout my written permission.Computer ScienceThe University of British Columbia2075 Wesbrook PlaceVancouver, CanadaV6T 1Z1Date: IlOc€ T, 1?3AbstractThe relative ease with which it is possible to build inexpensive, high-performancemulticomputers using regular microprocessors has made them very popular in the lastdecade. The major problem with multicomputers is the difficulty in effectively programming them. Programmers are often faced with the choice of using high level programming tools that are easy to use but provide poor performance or low level tools thattake advantage of specific hardware characteristics to obtain better performance but aredifficult to use. In general, existing parallel programming environments do not provideany guarantee of performance and they provide little support for performance evaluationand tuning.This dissertation explores an approach in which users are provided with programmingsupport based on parallel programming paradigms. We have studied two commonlyused parallel programming paradigms: Processor Farm and Divide-and-Conquer. Tworuntime systems, Pfarm and TrEK, were designed and developed for applications thatfit these paradigms. These systems hide the underlying complexities of multicomputersfrom the users, and are easy-to-use and topology independent. Performance modelsare derived for these systems, taking into account the computation and communicationcharacteristics of the applications in addition to the characteristics of the hardware andsoftware system. The models were experimentally validated on a large transputer-basedsystem. The models are accurate and proved useful for performance prediction andtuning.Pfarm and TrEK were integrated into Parsec, a programming environment thatsupports program development and execution tools such as a graphical interface, mapper,loader and debugger. They have also been used to develop several image processing andnumerical analysis applications.UContentsAbstract iiTable of Contents iiiList of Tables viiiList of Figures xAcknowledgements xii1 Introduction 11.1 Motivation 21.2 Methodology 31.3 Synopsis of the Dissertation 52 Background and Related Work 82.1 Parallel Programming Approaches 82.1.1 High Level Approaches 92.1.2 Low level Approaches 102.1.3 Other Approaches 102.1.4 Parallel Programming Paradigms 111112.2 Performance Modeling 142.2.1 Performance Measures and Models 152.2.2 Integration 183 Methodology 213.1 An Integrated Approach 223.2 System Model 233.3 Experimental Testbed 243.3.1 Hardware System 243.3.2 Software Environments 253.4 Task-oriented Paradigms 263.4.1 Processor Farm 263.4.2 Divide-and-Conquer 293.5 Chapter Summary 324 Processor Farm: Design and Modeling 334.1 Pfarm: Design and Implementation 344.1.1 Process Structure and Scheduling 354.1.2 Task Scheduling 364.1.3 Buffering 374.1.4 Topology Independence 384.2 Performailce Modeling 384.2.1 General Analytical Framework 394.2.2 Balanced Tree Topologies 494.2.3 Communication Bound 57iv4.3 Discussion4.3.1 Optimal N and Topology4.3.2 Problem Scaling4.3.3 Granularity4.4 Chapter Summary5 Processor Farm: Experiments5.1 Determining System Overheads5.2 Arbitrary Topologies5.3 Balanced Tree Topologies5.3.1 Steady-state Validation5.3.2 Start-up and Wind-down Validation5.4 Robustness5.5 Chapter Summary6 Divide-and-Conquer: Design and Modeling6.1 TrEK: Design and Implementation6.2 Performance Modeling6.2.1 Arbitrary Tree Topologies6.2.2 Balanced Tree Topologies6.2.3 Communication Bounds6.3 Discussion6.3.1 Optimal N and Topology6.3.2 Problem Scaling6.4 Chapter Summary808085869398999910110158586162636565677071767779V7 Divide-and-Conquer: Experiments7.1 Determining System Overheads7.2 Arbitrary Topologies7.3 Balanced Tree Topologies7.3.1 Steady-State7.3.2 Start-up and Wind-down7.3.3 k-ary Tasks on g-ary Balanced Topologies7.3.4 Variable Split and Join Costs7.3.5 Robustness7.4 Comparison of Divide-and-Conquer with Processor7.5 Chapter Summary8 System Integration and Applications8.1 Parsec: An Integrated Programming Environment8.2 Performance Tuning8.2.1 Parameter Measurements8.2.2 Performance Analysis Library8.3 User Interface8.3.1 Pfarm8.3.2 TrEK8.4 Applications8.4.1 Cepstral filtering8.4.2 Fast Fourier Transform (FFT)9 Conclusions 132103104106107108108111112114Farm 115117118118122122124124125126127127128vi9.1 Future Directions.135Bibliography 137Glossary 143viiList of Tables5.1 Comparison of predicted and measured results for 8 x 3 mesh 695.2 Comparison of predicted and measured results for 8 x 8 mesh 695.3 Range of processor farm experiments 705.4 Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Linear Chain 725.5 Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Binary Tree 735.6 Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Ternary Tree 745.7 Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Linear Chain under Communication Bound . . 755.8 Comparison of Predicted and Measured Total Execution Time for Processor Farm running on Binary Tree under Communication Bound . . . 755.9 Comparison of Predicted and Measured Total Execution Time ( Start-up& Wind-down) for Processor Farm running on Linear Chain, Binary Treeand Ternary Tree 765.10 Comparison of Predicted and Measured Total Execution Time for uniformtask distribution for Processor Farm running on Linear Chain 777.1 Performance Comparison of three different BFSTs of the 8 x 3 mesh. . . . 1067.2 Range of Divide-and-Conquer steady-state Experiments 108vi”Comparison forComparison forfor Divide-and-for Divide-and-Time for Divide-Time for Divide-Time for BinaryTime for Divide-Time for Binary7.3 Steady-state Performanceon Binary Tree7.4 Steady-state PerformanceDivide-and-Conquer running109Divide-and-Conquer running109on Ternary Tree.7.5 Start-up and Wind-down Performance ComparisonConquer running on Binary Tree7.6 Start-up and Wind-down Performance ComparisonConquer running on Ternary Tree7.7 Comparison of Predicted and Measured Total Executionand-Conquer running on Binary Tree with M = 1000.7.8 Comparison of Predicted and Measured Total Executionand-Conquer running on Ternary Tree with M =1000.7.9 Comparison of Predicted and Measured Total ExecutionDivide-and-Conquer tasks running on Ternary Tree7.10 Comparison of Predicted and Measured Total Executionand-Conquer Tasks with Variable Split & Join Costs7.11 Comparison of Predicted and Measured Total ExecutionDivide-and-Conquer Under Split and Join Bound . .1101101111111121131147.12 Comparison of Predicted and Measured Total Execution Time for BinaryDivide-and-Conquer Tasks with Subtasks of Unequal Size 1157.13 Comparison of Total Execution Time for Binary Divide-and-Conquer Applications with TrEK and Pfarm 1168.1 Experimental results for FFT on a 16-node binary tree 131ixList of Figures3.1 Ideal Manager-Workers Architecture 274.1 Process Structure on a Worker Processor in Pfarm 364.2 An example of the steady-state analysis 414.3 (a) node graph (b) process graph (c) subtree decomposition 434.4 Binary and Ternary Trees with D 4 and 3 514.5 The affect of /3 on efficiency 554.6 Plot of throughput curves for a linear chain (with Te = lOms, /3e = 482ts,!3e = 453is) 594.7 Comparison of processor farm throughput on linear chain, binary tree andternary tree configurations 604.8 Measured speedup for processor farm on linear chain 624.9 The affect of granularity on speedup 635.1 Configurations for determining 13e and 13j 665.2 Three breadth-first spanning trees of the 8 x 3 mesh 685.3 Error graph for processor farm on linear chain with tasks of bimodaldistribution 786.1 Divide-and-Conquer Task Structure 816.2 TrEK Process Graph on an Intermediate Worker Processor 82x6.3 An example of the steady-state analysis 886.4 (a) node graph (b) process graph (c) subtree decomposition 896.5 Plot of throughput curves for Binary Divide-and-Conquer Tasks on BinaryTree 1006.6 Comparison of divide-and-conquer throughput on binary tree and ternarytree topologies 1016.7 Measured speedup for divide-and-conquer on binary tree 1027.1 Configurations for determining /3e and 13f 1057.2 Three breadth-first spanning trees of the 8 x 3 mesh 1078.1 Graphical Interface to Pfarm in Parsec 121xiAcknowledgementsFirst and foremost, I would like to thank both my supervisors Dr. Samuel Chansonand Dr. Alan Wagner for their guidance, support and encouragement throughout mythesis research. I was very fortunate to have Dr. Wagner as my supervisor, whosedirection and continuous involvement made it possible for my thesis research to takethis final shape. I thank the members of my supervisory committee Dr. Mabo Ito andDr. James Little for their valuable input to my research.I thank Mandeep Dhami, David Feldcamp, Norm Goldstein, Kunhua Lin, SameerMulye and other past members of the parallel computation group for their cooperationduring this research. I appreciate all the help provided by the system and administrativestaff during my stay in the department. In addition, I would like to thank Chris Healeyfor proofreading parts of this thesis.My thanks go to many fellow graduate students (Don Acton, Ian Cavers, Parag Jam,Sree Rajan, Vishwa Ranjan to name a few) for their friendship, help and support. Manythanks are also extended to numerous friends in UBC and outside who made our stayin Vancouver enjoyable.I would like to acknowledge the financial support provided by the Canadian Commonwealth Scholarship and Fellowship Administration. I also thank the Government ofIndia for sponsoring me for the fellowship.Finally, I thank my wife Latha for her love, endless support and patience during thelast five years.xiiChapter 1IntroductionParallel processing is becoming popular with the advent of inexpensive, powerful microprocessors made possible by advances in VLSI technology. Several kinds of parallelcomputer architectures [Dun9Oj have been proposed and built. These parallel computershave been used successfully to achieve remarkable performance for applications in severalareas including scientific computing, and signal and image processing. The domain ofparallel computer architectures includes Single Instruction Multiple Data (SIMD) machines, and Shared memory and Distributed memory Multiple Instruction Multiple Data(MIMD) machines {F1y72].Distributed memory MIMD machines, generally known as multicomputers [AS88],consist of a number of processors each with their own local memory, connected bya message-passing network. Several research and commercial multicomputers suchas Hypercubes[Sei85], Transputer-based systems{Lim88] and iWARP[K+90j have beenavailable since the mid 1980s. Multicomputer architectures have several advantages.These machines are able to take advantage of the latest and fastest microprocessor technology making them cost-effective in comparison to other parallel architectures. Theyare easily scalable compared to other architectures. In the case of reconfigurable machines, it is possible to take advantage of the communication patterns of the problem toimprove performance. With these advantages, multicomputers are gaining importanceas general purpose parallel machines useful for applications in a wide range of areas[FJL+88]. The focus of this research is on multicomputers and their effective use.1Chapter 1. Introduction 21.1 MotivationThe major stumbling block to the widespread use of multicomputers is the tremendousdifficulty in effectively programming them. Software development for multicomputershas not kept pace with the advances in hardware technology [Kar87, KT88, CK91].Message passing gives finer control over resources, but at the cost of added complexity.Users must address difficult problems such as partitioning, mapping, communication,synchronization, data distribution, and load balancing.There are a few high level languages based on functional and logic programmingmodels [Kog85, Dav85, FT9O]. Those that are available provide high level abstractionswith universal interfaces to all applications, but their overall performance is generallypoor because of the difficulties in taking full advantage of the underlying structure ofthe application and the architecture. Most of the recent work on parallelizing compilers [PW86, CK88, HKT91] has focussed on extracting loop level and lower levels ofparallelism. They are restricted to exploiting parallelism in certain loop structures andthus can improve performance only for certain problems such as SPMD (single program,multiple data) type programs that are data-parallel. Most of the commercial multicomputers provide low level machine-dependent environments [Lim88, 1im87, Inc9l] thatcan be used to achieve high performance. These environments provide very low levelprogramming abstractions that makes program design a complex process. This leadsto higher software development costs and programs that are not easily ported to othermachines.Difficulties in parallel programming do not end with the design and development ofa working parallel program. The primary motivation for using parallel computers is toobtain higher performance for application programs. In general, existing programmingenvironments do not provide any guarantee of performance, moreover they provide littlesupport for performance evaluation and tuning. In the case of parallel systems, performance depends on the computation and communication characteristics of a parallelprogram in addition to the characteristics of the hardware and software system. Usersgenerally have very little knowledge about the performance of their programs until theyChapter 1. Introduction 3are implemented and run. Even though one may think that using more processors willimprove performance, this is not always the case. Simple models [5to88] have shownthat using more than a certain number of processors for a given application may notimprove performance. In practice, it may actually degrade performance.1.2 MethodologyProviding abstractions that are efficient and easy-to-use for programming multicomputers is a difficult problem. One recent approach to reconciling ease of use and reuse withperformance is the construction of software components (libraries, modules, templates,skeletons) based on the parallel programming paradigms that have appeared in the literature [K+87, Nel87, Co189, Pri9O]. These paradigms, taken together, represent the stateof the art of parallel programming. Software components based on these paradigms canhide the complex distributed system code needed to implement the paradigm, therebyallowing the application programmer to concentrate on the computationally intensivecode rather than parallelization and the coordination of the processors. Several projectssuch as Chameleon [A1v90], PIE [RSV9O] and VMPP [Gab9O] have looked at providingprogramming support for some of these paradigms on shared-memory machines.It is difficult to obtain a single performance model that can be used for all applications on a parallel system. Performance depends on the computation and communication characteristics of the algorithm, in addition to the characteristics of thehardware and software system. There are some performance metrics, such as Amdahl’sserial fraction [Amd67], experimentally determined serial fraction [KF9O] and averageparallelism [EZL89], which are based on simple characterizations of parallel systems.Although, they can be used to obtain rough bounds on performance, it is not as easyto use them for performance prediction and tuning. One approach to obtaining moreaccurate models that could be used for prediction and tuning is to model simpler andmore restricted systems. Parallel programming paradigms are more restricted and sufficiently general to be of more general use. Models based on paradigms can take intoaccount the computation and communication characteristics of the applications and alsothe characteristics of the hardware and software system.Chapter 1. Introduction 4Abstraction and added functionality that diminishes performance leads to constantre-design and re-implementation of the software component. Therefore, it is necessaryto formalize these paradigms to better understand their expressiveness, their limitations,and most importantly their performance characteristics. It is important to understandthe effect of scaling the component to execute on a larger number of processors. It mustbe possible to easily modify the behavior and performance characteristics of the component in order to take advantage of application specific optimizations (e.g., fixed versusvariable data packets). By understanding the behavior and performance characteristicsof the paradigm, it may be possible to guarantee performance and provide guidance tothe use and design of these paradigms on different topologies or systems with differentprimitives. The challenge is to construct a system based on paradigms that is reusableand achieves close to optimal performance.In this dissertation, we consider two task-oriented parallel programming paradigms,processor farm and divide-and-conquer. The processor farm paradigm is widely used inparallelizing applications in several areas [CHvdV88, BTU88, CU9O, CCT93j. Divide-and-conquer is a well-known problem solving strategy in both sequential and parallelprogramming [AHU74, HZ83, GR88, Sto87J.The principal contributions of this dissertation research are:Development of performance models for two commonly used parallel programmingparadigms: processor farm and divide-and-conquer.We have developed models that accurately describe the behavior and performancecharacteristics of processor farm and divide-and-conquer applications on multicomputers with the characteristics described in Section 3.2. These are realistic modelsthat can help in understanding the capabilities and limitations of these paradigms.These models have been experimentally validated on a transputer-based system.They provide guidance for system design, and can be used for performance prediction and tuning.• Design and development of execution kernels for processor farm and divide-andconquer applications.Chapter 1. Introduction 5Execution kernels for both processor farm and divide-and-conquer have been designed and implemented on a transputer-based machine. The systems are topologyindependent, i.e., they can be used on machines of any size and topology. Theyhave been integrated into a programming environment that includes supportingtools such as a graphical interface, mapper, loader and debugger. Several applications have been developed using these kernels.1.3 Synopsis of the DissertationChapter 2 provides an overview of the related literature that puts this dissertation workin context. It includes a discussion on various existing parallel programming approaches,highlighting their advantages and disadvantages, and comparing and contrasting ourapproach to these approaches. Existing performance measures and models for parallelsystems are reviewed, emphasizing their applicability and limitations.Chapter 3 describes the integrated approach we have taken to address the programming and performance modeling problems in multicomputers. This approach providesprogramming support based on parallel programming paradigms to the application programmers. We discuss the characteristics of processor farm and divide-and-conquerparadigms that are studied in this thesis. We describe how these paradigms can be usedto parallelize several different applications.In Chapter 4, we describe the design of Pfarm, a topology independent processor farmruntime kernel, detailing the trade-offs involved to make it efficient. Pfarm implementsa distributed dynamic task scheduling strategy. The affect of process structure, scheduling, and buffering on performance has been investigated. We have developed a generalanalytical framework that can be used to derive performance models for processor farmson an arbitrary tree topology. For a fixed topology, we have shown that a breadthfirst spanning tree provides maximum performance, and the steady-state performanceof all breadth-first spanning trees are equal. Since a processor farm system behaveslike a pipeline, we have also analyzed start-up and wind-down. The ideal architecturefor Pfarm is a balanced k-ary tree, where k is the number of links on each processor.Chapter 1. Introduction 6Performance models for this case have been derived from the general framework. Wealso describe how the models can be used in performance tuning and restructuring ofapplication programs.The Pfarm system has been implemented on a 75 node transputer-based machine.The performance models were experimentally validated as reported in Chapter 5. Themodels are sufficiently accurate that they can be used to predict performance of thisdesign on any tree topology. The robustness of the model under our assumption ofaverage task size was tested for uniform and bimodal distributions. The model wasaccurate for the uniform distribution. In the case of the bimodal distribution, we foundthat the model remained accurate as long as the arrival pattern of the two task typeswas mixed.In Chapter 6, we extend the design of Pfarm to provide runtime system support fordivide-and-conquer applications. This system, called TrEK (Tree Execution Kernel), canexecute divide-and-conquer computations of any degree and depth on an arbitrary treetopology. TrEK is designed to make use of intermediate processors for subtask processingin order to increase the overall performance. We expanded the general analytical framework given for Pfarm to derive performance models for fixed degree divide-and-conquerapplications on an arbitrary tree topology. Experimentally, we found that performancedepends on the depth and number of leaves in the tree topology. Thus, on a fixed topology, a breadth-first spanning tree with a maximum number of leaves achieves maximumperformance. With a reconfigurable network, a g-ary balanced tree, where g is the number of links on each node, provides maximum performance. We derived models thatcan predict the performance of any fixed k-ary divide-and-conquer computation on anyg-ary balanced tree topology.Chapter 7 describes the experiments conducted to validate the performance modelsfor divide-and-conquer. The experiments show that our framework performs well evenfor applications that consist of a single large divide-and-conquer task in addition tothose with a flow of tasks. In some cases, it is possible to use the processor farmstrategy for divide-and-conquer applications. We found that TrEK outperformed Pfarmfor applications with larger tasks and for those applications that consist of a smallerChapter 1. Introduction 7number of tasks.In order to make it easier for application programmers to use these paradigms on amulticomputer system, a programming environment that supports all phases of programdevelopment and execution is needed. In Chapter 8, we describe Parsec, an on-goingproject at the University of British Columbia in developing an integrated programmingenvironment for the support of paradigms. Parsec provides Pfarm and TrEK withsupporting tools such as a graphical interface, mapper, loader, monitor and debugger.We have also discussed applications that have been developed using Pfarm and TrEK.Chapter 9 provides a summary of the dissertation with a discussion of topics for futureresearch.Chapter 2Background and Related WorkThe relative ease with which it is possible to build inexpensive, high-performance multicomputers using commodity microprocessors has made multicomputers very popular.The major problem with multicomputers is the difficulty in effectively programmingthem. Programmers must either use a high level programming tool that is easy to usebut provides poor performance or a machine dependent low level tool that can providehigh performance but is difficult to use.To be successful, a parallel programming environment should address both the basicissues, namely programming and performance. First, programmers should be providedwith easy-to-use programming abstractions that hide the complexities of the system.Second, the environment should be able to assist programmers in obtaining the maximumperformance on a given parallel architecture for their applications.In this chapter, we provide a overview of the related literature. Section 2.1 describesvarious parallel programming approaches, emphasizing their advantages and disadvantages. In Section 2.2, a review of the literature on existing performance measures andmodels for parallel computing is provided.2.1 Parallel Programming ApproachesIn this section, various parallel programming approaches are reviewed, highlighting theiradvantages and disadvantages, and comparing and contrasting them with our approach.8Chapter 2. Background and Related Work 9We also review the existing research on identifying parallel programming paradigms.2.1.1 High Level ApproachesParallelizing CompilersParallelizing compilers are aimed at extracting loop level and lower levels of parallelismin a sequential program. Considerable research work is being done in developing compilers that automatically parallelize FORTRAN DO loops [PW86, CK88, HKT91J. Theprogrammers write sequential programs in standard FORTRAN, and the compiler analyzes data dependencies and uses parallelizing and vectorizing constructs to optimizethe program for a given parallel hardware.With automatic parallelizing compilers, users need not be concerned with writingexplicitly parallel code. In some cases, users can provide compiler directives for programpartitioning and mapping. Compilers generally perform local optimizations which maynot always lead to an overall improvement in performance. They have to use conservativevalues for data unknown at compile time. Parallelizing compilers have been successfullyused on multicomputers for certain classes of problems such as SPMD (Single ProgramMultiple Data) programs.In comparison, we have studied task-oriented paradigms on multicompnters. Ourapproach concentrates on global optimizations that lead to overall improvement in performance of application programs. This is done by considering classes of applicationsseparately, and identifying their characteristics to decide on the necessary global optimizations to efficiently run them on a particular hardware system.High-Level LanguagesThere is a group of researchers that advocate the use of high-level languages based onfunctional, logical and data-flow models of computation [Kog85, Dav85, Den8O]. Usingthese languages, the programmer needs only to write a high-level declarative descriptionof the algorithm, which is free of concurrency. The compiler and the runtime systemproduce code suitable for a specific parallel system.Chapter 2. Background and Related Work 10The advantage of being able to write programs in a very high level is generally outweighed by the resnlting poor performance. It is very unlikely that the standard implementation decisions used by the compiler will be optimal for all situations. Programmerswho understand the specific structure of their algorithms can always do better optimizations than the generalized transformations included in a compiler. In our approach,different classes of applications are considered separately, and good optimizations foreach of them are obtained by understanding the structure of the underlying algorithms.2.1.2 Low level ApproachesIn multicomputers such as the transputer-based [1im87] and C40-based [Inc9l, 1nc92jmachines, the user is totally responsible for implementing parallelism. The programming environments provided in these cases consist of languages such as occam [Lim84]or extended C that provide process creation and inter-process message passing. The programmer is responsible for partitioning the work into processes and mapping them toexploit the parallelism provided by the hardware. The programmer also has to managecommunication between processes.It is possible to extract maximum performance out of the system if the programmerhas good knowledge of the underlying hardware architecture and how well it can beused for a given application program. Even though high performance is achievable,it is difficult since the programming environments provide minimal support for thesemachines.In our approach, the programming system provided to the users efficiently implements parallelism on a given machine and manages communication among processors.The user has to concentrate only on the application dependent sequential code.2.1.3 Other ApproachesThe Linda project advocates the use of a coordination language such as Linda in conjunction with computational languages like C or FORTRAN to make parallel programmingeasier [GC92]. A coordination language provides operations to create multiple executionthreads (or processes) and supports communication among them. Linda[CG89, ACG91]Chapter 2. Background and Related Work 11consists of a few simple tuple-space operations. Adding these tuple-space operationsto a computational language produces a parallel programming dialect such as C-Linda.In this case, the programmers are responsible for creating and coordinating multipleexecution threads. Linda provides a shared memory paradigm independent of the underlying parallel architecture. The processes can communicate and synchronize usingthe tuple-space, which is in fact a shared memory.Implementation of the Linda tuple-space on a distributed memory machine is generally difficult since the tuple space has to be distributed and replicated, which can leadto poor performance. With our approach, the system takes responsibility for processcreation and coordination, rather than the user.Foster and Overbeek[F090j propose an approach called bilingual programming. Inthis approach, the key idea is to construct the upper levels of an application in a high-level language, while coding the selected low-level components in low-level languages.They argue that this approach permits the advantages of the high-level notation (expressiveness, elegance, conciseness) to be obtained without the usual cost in performance.They introduce a particular bilingual approach in which the concurrent programminglanguage Strand [FT9O] is used as the high-level language and C or Fortran is used tocode low-level routines. Strand provides a high level notation for expressing concurrentcomputations.With this approach, overall performance is determined by the decisions on howto partition concurrent processes into tasks and map them onto various nodes. Theuser is responsible for partitioning and mapping, although there are some tools whichcan provide guidance. Our runtime systems take care of creatin5g concurrent processesand communicating among them. Each system efficiently implements partitioning andscheduling for a particular class of applications. The performance models can be usedfor restructuring application programs to obtain better performance.2.1.4 Parallel Programming ParadigmsThere are several well-known programming paradigms such as divide-and-conquer,branch-and-bound and dynamic programming techniques that are commonly used inChapter 2. Background and Related Work 12designing sequential algorithms. These paradigms are not exactly algorithms, but theyare problem solving techniques or high level methodologies that are common to manyefficient algorithms.We can find similar problem solving techniques that are commonly being used indesigning parallel algorithms. Identifying and analyzing useful parallel programmingparadigms will help the programmer in understanding parallel computation and in thedifficult process of designing and developing efficient parallel algorithms.In general, programming paradigms encapsulate data reference patterns. In thecase of parallel programming paradigms, they encapsulate underlying communicationpatterns. Since they identify useful communication patterns, they can help in designingarchitectures that can effectively support commonly used communication patterns. Theanalysis of these paradigms can provide guidelines for designing programming tools thatcan assist application programmers in obtaining better performance on a given parallelmachine.The following paragraphs summarize the related research on identifying and understanding useful parallel programming paradigms.In 1989, Kung et. al. [Kun89j identified several computational models based ontheir experiences in parallel algorithm design and parallel architecture development.These models characterize the interprocessor commnnication and correspond to differentways in which cells in 1D processor arrays exchange their intermediate results duringcomputation. The models are:1. Local computation 6. Recursive computation2. Domain partition 7. Divide-and-conquer3. Pipeline 8. Query processing4. Multi-function pipeline 9. Task Queue5. RingFox [FJL88] and Karp [Kar87J have discussed SPMD paradigm for programmingshared and distributed memory multicomputers. In the SPMD model, the same programis executed on all the processors. Processors communicate their intermediate results toChapter 2. Background and Related Work 13their neighbors and synchronize at a barrier point. Fox and others[FJL+88] have successfully used the SPMD model to solve a number of large applications on multicomputers.Nelson [Ne187] has studied compute-aggregate-broadcast, divide-and-conquer,pipelining and reduction paradigms for distributed memory parallel computers. Hehas discussed how these paradigms can be used to develop algorithms for solving manynumerical and non-numerical applications. He has also studied the contraction problem,the problem arising when an algorithm requires more processors than are available on amachine, for algorithms based on these paradigms.Cole{Co189] advocates an approach in which the users are presented with a selectionof “Algorithmic Skeletons” instead of an universal programming interface. Each skeletoncaptures the essential structure of some particular problem solving style or technique.To solve a particular problem, the user is required to select a skeleton which describesa suitable problem solving method. The procedures and data structures are added tothe skeleton to customize it to the specific problem. Since each instance of these procedures will be executed sequentially, they can be specified in any sequential programminglanguage. He has discussed four different skeletons - Divide-and-conquer, Task Queue,Iterative Combination and Cluster. He proposes to embed suitable topologies for various skeletons on a grid architecture. In terms of performance, he has focussed on theasymptotic efficiency with which a large grid of processors can implement a system withrespect to the performance of a single processor.Pritchard and Hey [Pri9O, Hey9O] discuss three useful paradigms for programmingtransputer arrays: Processor Farm, Geometric Array and Algorithmic Pipes. ProcessorFarm uses a manager-workers setup to solve an application that consists of a largenumber of independent tasks. Geometric Array is same as the SPMD model mentionedearlier and Algorithmic Pipes is similar to the pipeline approach.PIE project [RSV9O] uses parallel programming paradigms as an intermediate layerof abstraction, called implementation machine (IM) level, between the application leveland the physical machine level for uniform memory access multiprocessors. Each TM hastwo representations: an analytical representation and a pragmatic representation. Theanalytical representation helps in predicting the performance of a class of applicationsChapter 2. Background and Related Work 14using the TM. The model predicts the upper bound and lower bound on performance ofan application that uses this TM. A pragmatic representation of TMs is made availablein the form of modifiable templates. All necessary communication and synchronizationfor the TM are correctly and efficiently implemented in the template. All the user needsto do is to insert the application dependent code. With the help of the TM layer, theuser can write performance efficient parallel programs with relative ease. The aualyticalmodels help the user to select the most appropriate or efficient TM for a given applicationand parallel machine. Two TMs, master-slave and pipeline, have been implemented on aEncore Multirnax, a bus-based shared memory multiprocessor.In this thesis, we explore an approach in which users are provided with programming support based on parallel programming paradigms for multicomputers. This approach is similar to Cole’s proposal of Algorithmic Skeletons. In contrast to Cole’stheoretical study of how various skeletons can be implemented on a grid architecture,we have implemented runtime systems for two widely used paradigms, Processor Farmand Divide-and-Conquer, that are topology independent. Performance models derivedin this thesis are analytical models, unlike Cole’s asymptotic models, and hence can beused in performance tuning. Our approach is similar to that followed by PIE. It differsin the underlying parallel architectures being considered and the apparent fact that wecan obtain accurate models on multicomputers.2.2 Performance ModelingIn the case of sequential computation, performance can be adequately characterized bythe instruction rate of the single processor and the execution time requirement of thesoftware on a processor of unit rate. Predicting the performance of a parallel algorithmon a parallel architecture is more complex. Performance depends on the computationand communication characteristics of the algorithm, in addition to the characteristics ofthe hardware and software system.In order to use parallel systems effectively, it is important to understand the performance of parallel algorithms on parallel architectures. This can help in determiningChapter 2. Background and Related Work 15the most suitable architecture for a given algorithm. It can also help in predicting themaximum performance gain which can be achieved. In this section, we summarize therelevant research in understanding the performance behavior of parallel systems andhighlight the applicability and limitations of each.2.2.1 Performance Measures and ModelsIt is a well known fact that the speedup for a fixed-size problem on a given parallelarchitecture does not continue to increase with an increasing number of processors, buttends to saturate or peak at a certain value.In 1967, Amdahl [Amd67j argued that if s is the serial fraction in an algorithm, thenthe speedup obtainable is bounded by 1/s even when an infinite number of processorsare used. For an N processor system, speedup is given by1s + (1 — s)/NThis observation, which is generally known as Amdahl’s Law, has been used to argueagainst the viability of massively parallel systems.In the recent years, researchers have realized that it is possible to obtain near-linearspeedup by executing large problems. In 1988, Gustafson and others at Sandia NationalLab [Gus88] were able to obtain near-linear speedup on a 1024-processor system byscaling up the problem size. Gustafson argues that in practice, users increase the problemsize when a powerful processor is made available; hence, it may be more realistic toassume runtime as constant instead of problem size. He introduced a new measurecalled scaled speedup, defined as the speedup that can be achieved when the problemsize is increased linearly with the number of processors. For an N processor system,scaled speedup is given by N + (1 — N) x s.Karp and Flatt [KF9O] have used experimentally determined serial fraction as a metricin tuning performance. The experimentally determined serial fraction, f is defined as1/S — 1/N1—1/Nwhere S is the speedup obtained on an N-processor system. If the loss in speedup isonly due to the serial component, that is, there are no other overheads, the value of f isChapter 2. Background and Related Work 16exactly equal to the serial fraction s used in Amdahl’s law. With the help of experimentalresults, they argue that this measure provides more information about the performanceof a parallel system. If f increases with N, then it is considered an indicator of risingcommunication and synchronization overheads. An irregular change in f as N increaseswould indicate load balancing problems.Eager, Zahorajan and Lazowska[EZL89J use a simple measure called average parallelism to characterize the behavior of a parallel software system. The software system isrepresented by an acyclic directed graph of subtasks with precedence constraints amongthem. Average parallelism is defined as the average number of processors that are busyduring the execution time of the software system, given an unbounded number of available processors. Once the average parallelism A is determined, either analytically orexperimentally, the lower bounds on speedup and efficiency are given byNA A(N+A-1) and (N+A-1)respectively. This measure can be used only if the parallel system does not incur anycommunication overheads or whenever these overheads can be easily included as part ofthe tasks.Kumar and Rao[KR87] have developed a scalability measure called the isoefficiencyfunction, which relates the problem size to the number of processors necessary for anincrease in speedup proportional to the number of processors used. When a parallelsystem is used to solve a fixed-size problem, the efficiency starts decreasing with an increase in the number of processors as the overheads increase. For many parallel systems,for a fixed number of processors, if the problem size is increased then the efficiency increases because the overhead grows more slowly than the problem size. For these parallelsystems, it is possible to maintain efficiency at a desired value (between 0 and 1) foran increasing number of processors, provided the problem size is also increased. Thesesystems are considered to be scalable parallel systems. For a given parallel algorithm,for different parallel architectures, the problem size may have to increase at differentrates with respect to the number of processors in order to maintain a fixed efficiency.The rate at which the problem size is required to grow with respect to the number ofprocessors to keep the efficiency fixed essentially determines the degree of scalability ofChapter 2. Background and Related Work 17the parallel algorithm for a specific architecture. If the problem size needs to grow asfz(P) to maintain an efficiency E, then fE(p) is defined as the isoefficiency function forefficiency E.If the problem size is required to grow exponentially with respect to the number ofprocessors, then the algorithm-architecture combination is poorly scalable since it needsenormously large problems to obtain good speedups for a larger number of processors.On the other hand, if the problem size needs to grow only linearly with respect to thenumber of processors, then the algorithm-architecture combination is highly scalable.Isoefficiency analysis has been used in characterizing the scalability of a variety of parallel algorithm-architecture combinations [GK92]. Using isoefficiency analysis, one canpredict the performance of a parallel program on a larger number of processors aftertesting the performance on a smaller number of processors.Stone[Sto88] has used a simple model to determine how granularity affects thespeedup on a multiprocessor. The model considers an application program that consistsof M tasks and obtains the maximum speed with which this program can be executedon an N processor system. It assumes that each task executes in T units of time.Each task communicates with every other task at an overhead cost of T units of timewhen the communicating tasks are not on the same processor, and at no cost whenthe communicating tasks are on the same processor. The results of this model indicatethat the speedup is proportional to N up to a certain point. After this, as N increases,the speedup approaches a constant asymptote which can be expressed as a function ofthe task granularity. This model gives a general picture of how granularity and overhead affect the performance of a multiprocessor system. It also gives some indicationof the importance of minimizing overhead and selecting a suitable granularity. Stone’sstudies indicate that there is some maximum number of processors that is cost-effective,and this number depends largely on the architecture of the machine, on the underlyingcommunication technology, and on the characteristics of each specific application.Flatt and Kennedy[FK89] have derived some upper bounds on the performance ofparallel systems taking into account the effect of synchronization and communicationoverheads. They show that if the overhead function satisfies certain assumptions, thenChapter 2. Background and Related Work 18there exists a unique value N0 of the number of processors for which the total executiontime for a given problem size is minimum. However, for this value, the efficiency ofthe system is poor. Hence they recommend that N should be chosen to maximize theproduct of speedup and efficiency and analytically compute the optimal values of N. Amajor assumption in their analysis is that the per-processor overhead grows faster than0(N), which limits the applicability of the analysis.Performance metrics such as serial fraction (s and f) and average parallelism (A) aresimple measures that can be used to obtain rough bounds on performance. These cannotbe easily used for performance prediction and tuning, especially for multicomputers inpart because they neglect communication overheads. Also, the values of these parametersoften cannot be obtained easily. We have concentrated on considering the characteristicsof both the system and the applications in order to obtain accurate performance modelsthat can be used for performance prediction and tuning. Our models take into account allthe communication overheads involved in implementing different classes of applicationson multicomputers. The values of the parameters used in our models can be determinedin a relatively easy manner, and we discuss some of the techniques for obtaining them.2.2.2 IntegratiotiThere has been very little work done in integrating performance tuning into programming environments to provide performance-efficient parallel programming. To makebest use of the underlying parallel architecture for an application program, in addition to programming support, users must be provided with performance models thatcan help in predicting how well their programs are going to perform. The environmentshould be able to assist the programmers in restructuring their applications to improveperformance.PIE[SR85] addresses the issues of integrating performance tuning into the programming environment through the support of specific implementation paradigms coupledwith a performance prediction model. The model provides performance trade-off information for parallel process decomposition, communication, and data partitioning in thecontext of a specific implementation paradigm and a specific parallel architecture.Chapter 2. Background and Related Work 19Gannon[AG89] describes an interactive performance prediction tool that can be usedby the user to predict execution times for different sections of a program. This performance predictor analyzes FORTRAN programs parallelized by an automatic parallelizing and vectorizing compiler targeted for the Alliant FX/8. Programmers can use thepredictor to estimate the execution time for a segment of the code produced by thecompiler. The predictor uses a database to estimate the total number of CPU cyclesneeded for the segment. They also incorporate a simple model of memory contentioninto the predictor to include the effects of caching. This predictor can be used only topredict the execution time of a segment of the program; it does not give any specificinformation about the overall performance of a parallel program.Kennedy and Fox[BFKK91] have worked on an experimental performance estimatorfor statically evaluating the relative efficiency of different data partitioning schemes forany given program on any given distributed memory multiprocessor. The performanceestimator is aimed at predicting the performance of a program with given communicationcalls under a given data partitioning scheme. This system is not based on a performancemodel. Instead, it employs the notion of a training set of kernel routines that testvarious primitive computational operations and communication patterns on the targetmachine. The results are used to train the performance estimator for that machine. Thistraining set procedure needs to be done only once for each target machine, during theenvironment or compiler installation time. Although the use of a training set simplifiesthe task of performance estimation significantly, its complexity lies in the design of thetraining set program, which must be able to generate a variety of computation anddata movement patterns to extract the effect of the hardware/software characteristicsof the target machine on the performance. The authors argue that real applicationsrarely show random data movement patterns and there is often an inherent regularityin their behavior. They believe that their training set program will probably give fairlyaccurate estimates for a large number of real applications, even though it tests only asmall (regular) subset of all the possible communication patterns.We have integrated the programming and performance tuning support into Parsec(described in Chapter 8). Parsec is an on-going project at the University of BritishChapter 2. Background and Related Work 20Columbia which is aimed at developing an integrated programming environment to support several parallel programming paradigms. It includes supporting tools such as agraphical interface, mapper, loader and debugger. The programmers can make use ofperformance models to predict the performance of their applications, and also can obtain optimal values for system parameters such as the number of nodes and topologythat can lead to maximum performance. The environment allows programmers to easilychange the parameters and accordingly does the necessary mapping and loading. Someof the techniques that can be used to determine the values of the application dependentparameters are described in Chapter 8.Chapter 3MethodologyThe diversity of parallel computing architectures and their underlying computation models makes it particularly difficult to find universal techniques for developing efficientparallel programs. Choosing an appropriate parallel machine for a given applicationis a difficult process. Furthermore, in most of the existing programming environmentsavailable on multicomputers, the user is responsible for managing both parallelism andcommunication. As explained in Chapter 2, identifying and analyzing useful parallelprogramming paradigms may help programmers in the difficult process of developingefficient parallel algorithms. In this chapter, we present an approach based on parallelprogramming paradigms for developing efficient programs for multicomputers.Section 3.1 describes an integrated approach we have taken to address the programming and performance modeling problems on multicomputers. In Section 3.2, we explain the multicomputer system model used in this dissertation. Section 3.3 describesthe transputer-based multicomputer system that is used as an experimental testbed inthis research. This approach has been used to develop programming support and performance models for two commonly used parallel programming paradigms, processorfarm and divide-and-conquer. In Section 3.4, we discuss the characteristics of these twotask-oriented programming paradigms and how these paradigms can be used for variouskinds of applications.21Chapter 3. Methodology 223.1 An Integrated ApproachThis approach provides application programmers with abstractions based on commonlyused parallel programming paradigms. The application programmers are provided witha set of Virtual Machines (VMs), where each virtual machine corresponds to a parallelprogramming paradigm. Each virtual machine consists of an analytical performancemodel, and an efficient runtime system that can be used to run applications that fit intothe corresponding paradigm. The user has to choose one of the virtual machines thatcorresponds to the paradigm that can be used to solve his application problem.With this approach, the users are not responsible for implementing parallelism andcommunication. Each runtime system implements parallelism and all the necessarycommunication and synchronization needed for running the corresponding class of applications in an efficient manner. It also implements other system dependent aspectssuch as task scheduling and load balancing. Such a runtime system can be efficientlyimplemented by a systems programmer who understands the complexities of the underlying hardware and software system. The runtime system provides a simple interfaceto the user. The user has to write only the application dependent code and executeit with the runtime system. This approach eliminates the difficulties in programmingmulticomputers and reduces the software development cost. Also, as the user code doesnot contain any system dependent parts, it is portable across different machines andsystems on which the virtual machine implementations are available.The analytical performance model helps in predicting the performance of applicationprograms that use the particular virtual machine. Some of the parameters of the modelare application program dependent and the others are dependent on the characteristicsof the underlying physical machine and software system. The values of the systemdependent parameters can be estimated once the rnntime systems are implemented.The application dependent parameters are either estimated or measured (either from aserial program or from a scaled down parallel program). Once these parameter valuesare known, the model can be used to predict the actual performance of an applicationprogram on a given parallel system. It can also be used in performance tuning, either toChapter 3. Methodology 23choose the optimal number of nodes to be used for a given application or to restructurethe application to maximize its performance.Two virtual machines corresponding to commonly used parallel programmingparadigms, processor farm and divide-and-conquer, were developed. In Section 3.4,we describe the characteristics of these two task-oriented paradigms.3.2 System ModelIn this thesis, we consider distributed memory parallel computers (multicomputers).The system consists of processor nodes connected by an interconnection network suchas a chain, tree, mesh, hypercube etc. with the underlying support for point-to-pointcommunication. Each processor node consists of a CPU with its own local memory andhardware support for communication links.Execution kernel designs are based on the following assumptions on the characteristics of the underlying system. The kernels assume a reliable point-to-point communication mechanism. Furthermore, it is assumed that the data can be transferred simultaneously on all the links and that data transfer can be overlapped with the computation.However, for each message, there is a CPU start-up cost that cannot be overlapped withthe computation. This time may include hardware set-up costs, context switch times,and other system software overheads. Message start-up is an important overhead thatsignificantly affects performance. In addition, it is assumed that the system supportsconcurrent processes on each processor with process scheduling and levels of priorities.This support could be available either in hardware (as in nhansputers) or in software (asin TI C40s).Performance models are developed assuming homogeneous processors and links. Alinear cost communication model is assumed (i.e., for every message, there is a CPU costfor starting the communication, and a transfer cost proportional to the message size).The time for a processor node to send a message of length n-i to its neighbor is given by+ rm, where j3 is the CPU start-up cost described in the previous paragraph and r isthe transfer rate of the communication links.Chapter 3. Methodology 243.3 Experimental TestbedIn this section, we describe the transputer-based multicomputer in the Department ofComputer Science at the University of British Columbia that is used as an experimentaltestbed in this thesis. Performance models derived in Chapter 4 and 6 for processor farmand divide-and-conquer are applicable for any multicomputer system that satisfies thesystem model described in Section 3.2. In addition, the corresponding runtime systemdesigns can be implemented on any similar multicomputer system.3.3.1 Hardware SystemThe transputer-based multicomputer consists of 75 T800 transputer nodes and 10 crossbar switches. The system is hosted by a Sun-4 workstation via VME bus, with 4 portsthat connect the system to the host. There are 64 nodes with 1 MB of external memoryand 10 nodes with 2 MB. There is a special node that has 16 MB, and is used as themanager node for both Pfarm and TrEK.The INMOS T800 transputer is a 32 bit microprocessor with a 64 bit floating pointunit on chip. A T800 running at 20MHz has a sustained processing rate of 10 MIPSand 1.5 Mflops. It has 4 KB of on-chip RAM and four bit-serial communication links.These communication links allow networks of transputer nodes to be constructed bydirect point to point connections with no external logic. Each link runs at an operatingspeed of 20 Mbits/sec and can transfer data bidirectionally at up to 2.35 MB/sec.The T800 processor has a microcoded scheduler that enables any number of concurrent processes to be executed together, sharing the processor time. It supports two levelsof priority for the processes: high and low. A high priority process, once selected, runsuntil it has to wait for a communication, a timer input or until completion. If no processwith a high priority is ready to proceed, then one of the ready low priority processes isselected. Low priority processes are time sliced every millisecond. A low priority processis only permitted to run for a maximum of two time slices before the processor deschedules it at the next descheduling point. Process context switch times are less than 1 s,as little state needs to be saved. The transputer has two 32 bit timer clocks. One timerChapter 3. Methodology 25is accessible only to high priority processes and is incremented every microsecond, andthe other is accessible only to low priority processes and is incremented every 64 /ts.Each C004 crossbar switch has 32 data links and a control link. These are programmable through the control link and each can have 16 bidirectional connections.Thus, the system is reconfigurable and an appropriate interconnection network for anapplication can be chosen. As the system is not fully connected, there are some restrictions on the possible configurations. For Pfarm and TrEK, we statically configure thesystem into an appropriate interconnection network.3.3.2 Software EnvironmentsPfarm and TrEK have been implemented using C on two different software environments that are available on our multicomputer: Logical Systems and Trollius. TheLogical Systems environment [Moc88, Sys9O] includes a library of process creation andcommunication functions. The environment has an utility called id-net that runs on thehost and downloads the executable programs onto a network of transputers.In the Trollius environment [BBFB89, F+90], the programs are run on top of a kernelon each transputer node. The Trollius kernel manages and synchronizes any number oflocal processes. There are two levels of message passing in Trollius. The kernel levelallows communication between processes on the same node. The network level allowscommunication between processes on different nodes, as well as between processes onthe same node. There are four sub-levels of message passing within the Trollius networklevel, representing different functionality/overhead compromises. They are, in order ofincreasing functionality and overhead, the physical, datalink, network and transportsub-levels. Trollius provides both blocking and non-blocking communication functions.It also includes a set of utility programs that run on the host, which can be used to load,monitor and control the programs on transputer networks.Chapter 3. Methodology 263.4 Task-oriented ParadigmsIn this section, we describe the characteristics of two task-oriented parallel programmingparadigms, processor farm and divide-and-conquer, that are considered in this thesis.3.4.1 Processor FarmProcessor farm is one of the most widely used strategies for parallelizing applications.There are a large number of scientific and engineering applications that consist of repeated execution of the same code, with different initial data. In addition, there islittle or no dependency among the different executions of this code. These different executions can be considered as a set of tasks and can be executed in parallel.When we use multiple processors to execute these tasks in parallel, even though allthe processors are executing the same code, they may not be executing the same instruction at any given time as they execute different parts of the code depending onthe initial data. Therefore, it is not possible to use SIMD (Single Instruction Multiple Data) parallel machines to run these applications. These kind of applications canbe run efficiently on MIMD (Multiple Instruction Multiple Data) parallel computers.Even though shared memory MIMD machines can be used for this class of applications {A1v90], distributed memory MIMD machines are being widely used because theyare scalable and cost-effective. Processor farms are described in many transputer programming books [Lim88, Ga190, Cok9l]. This strategy has been used in parallelizingapplications in several areas [CHvdV88, BTU88, CU9O, SS91, CCT93, NRFF93].The typical setup to run these applications is to use a “farm” of worker processorsthat receive tasks from, and return results to, a manager processor. Each worker processor runs the same program and depending on the data received for individual tasks,it may execute the same or different parts of the program. No communication is required among the worker processors to execute the tasks. The manager processor hasto communicate with the worker processors to distribute the tasks and to receive theresults.The ideal architecture for this “manager-workers” setup is one in which each workerChapter 3. Methodology 27Figure 3.1: Ideal Manager-Workers Architectureprocessor is directly connected to the manager processor as shown in Figure 3.1. Inpractice, this can not be realized as the processor nodes in distributed memory machines generally have a fixed degree of connectivity (e.g., in the transputer case, eachprocessor has four links and a network of transputers can be configured using crossbarlinks). It is possible to use crossbar switches to dynamically reconfigure the connectionbetween the manager and the worker processors such that every worker will be connectedto the manager directly for a certain period of time [Hom93]. The need for special hardware in addition to not being scalable makes this configuration less useful. Also, manycommercial machines use a fixed interconnection network such as mesh or hypercubefor interconnecting their processor nodes. Thus, one has to use an interconnection network in implementing processor farm applications, and the topology has to be chosenwith a view of minimizing the communication costs in distributing tasks and collectingresults. Since tree architectures that have a minimum possible number of hops to eachworker from the manager incur minimum communication costs, they are best suited forimplementing processor farm applications.There are several variants of processor farm that increase in applicability. In thegeneric description of a processor farm, the application program consists of a numberof independent tasks that have to be executed on a network of worker processors. Allthe tasks may be of the same type in which case the workers execute the same codeChapter 3. Methodology 28or different parts of the code depending on the initial data. The computation timerequirements might vary from task to task depending on their data. The manager mighthave all the tasks readily available to start with, or there could be some computationinvolved in producing these tasks. Also, in the case of real-time applications, the datafor individual tasks arrive in real time.Processor farms can also be used to execute applications that consist of multipletypes of tasks. In this case, the worker processors have to be loaded with the codeneeded to execute each type of task with the appropriate code chosen, according to itstype at runtime. Cramb and Upstill [CU9OJ discuss a processor farm implementation ofsuch an application.It is also possible to use processor farms to execute application programs that consistof multiple phases of computation [Son93]. In these applications, the tasks might havesome dependency. After a phase of computation, a new set of tasks for the next phaseare generated based on the results of the current phase. Also, some applications mightconsist of a number of distinct phases of computations out of which some phases could beexecuted efficiently using a processor farm [BTU88]. In this case, the processor farm actsas a computational server that is called at different points in the application program toexecute a set of tasks.Most of the processor farm designs discussed in the literature are overly simplisticand hide the issues that affect their performance and reuse. Poor reuse of code isa major problem in parallel programming and this remains the case in processor farmapplications. There is very little work done in understanding how the various parameterssuch as the hardware topology, computation and communication requirements of thetasks affect overall performance.In addition to making it reusable and topology independent, the Pfarm system described in Chapter 4 was designed considering all the factors that affect the performance.Performance models were derived from these realistic implementations of Ffarm. Themodels have been experimentally validated and are found to be accurate. Thus, theycan be used in performance tuning to maximize performance.Chapter 3. Methodology 293.4.2 Divide-and-ConquerDivide-and-conquer is a well-known problem solving strategy used in deriving efficientalgorithms for solving a wide variety of problems on sequential machines. Efficientdivide-and-conquer algorithms have been used for solving problems in several areas suchas graph theory, sorting, searching, computational geometry, Fourier transforms andmatrix computations [AHU74, HU79, Sed83, Ben8O]. It is also a useful strategy inhardware design as mentioned by Uliman in [U1184].The divide-and-conquer strategy can be briefly stated as follows: A large instanceof a problem is divided into two or more smaller instances of the same problem. Theresults of smaller instances, called sub-problems, are combined to obtain the final resultfor the original problem. Sub-problems are recursively divided until they are indivisibleand can be solved by a non-recursive method.On uniprocessor systems, after dividing the original problem, sub-problems are solvedsequentially. Sequential divide-and-conquer results in a tree structure of sub-problems.Several researchers have discussed the usefulness of the divide-and-conquer strategy inparallel processing [HZ83, GM85, GR88, Sto87]. On parallel systems, sub-problems canbe solved concurrently provided that the system has sufficient parallelism. Problemsplitting and combining of results can also make use of the available parallelism. Theseoperations require interprocessor communication for distributing the data and for receiving the corresponding results. As the sub-problems are independent, there is no needfor communication among the processors working on different sub-problems.In its most general setting, a divide-and-conquer algorithm can be described as adynamically growing tree structured computation, where initially there is a single problem and sub-problems are created as the problem is recursively divided. The number ofsubproblems and the depth of the tree may depend on the data and thus known only atruntime. For example, evaluation trees of functional and logic programs have the abovecharacteristics. In the case of applications such as matrix multiplication and FFT, thedegree of division and the depth of the computation tree is fixed. In some applicationssuch as Quicksort, the problems are not equally divided.Chapter 3. Methodology 30There are real-time applications in areas such as vision and image processing, in whichthere is a continuous stream of real-time data and each data set has to be processed witha divide-and-conquer algorithm. Also, there are some non real-time applications in areassuch as numerical analysis that consist of a set of problems, each of which can be solvedusing a divide-and-conquer algorithm.In chapter 6, we describe the design of a runtime kernel called TrEK(Tree ExecutionKernel) that provides runtime system support for divide-and-conquer applications onmulticomputers. TrEK is designed such that it can execute divide-and-conquer computations of any degree and depth on an arbitrary tree topology. To improve the overallperformance, TrEK makes use of the intermediate processors to process subproblemsin addition to splitting and joining. The task-oriented framework chosen here assumesthat there is a flow of divide-and-conquer tasks. This framework is well suited for applications that consist of a number of divide-and-conquer tasks. It can also be used forapplications in which the problem consists of a single divide-and-conquer computationtree with large degree and depth. In this case, when we run such an application on aprocessor tree of smaller degree, the flow of tasks approximate the computation enteringthe subtrees. This framework allows us to derive performance models that accuratelydescribe the behavior and performance characteristics of TrEK.In the following paragraphs, we discuss how this work contrasts with the relatedwork in using divide-and-conquer for parallel processing.Horowitz and Zorat [HZ83] have outlined how appropriately designed multiprocessors whose logical structure reflect the tree structure of divide-and-conquer can be usedto efficiently execute divide-and-conquer algorithms. They discuss the data movementproblem in hardware configurations that have local memory, global memory via commonbus and global memory augmented by local caches. Their proposal of a local memoryarchitecture consists of processors with local memory connected as a full k-ary tree forexecuting k-ary divide-and-conquer problems. In contrast with their model, TrEK makesuse of intermediate processors in addition to leaf processors for processing of subproblemsto improve overall performance. Also, TrEK can execute divide-and-conquer problemsof any degree and depth on distributed memory machines with any given topology.Chapter 3. Methodology 31Stout [5to87] has discussed the usefulness of the divide-and-conquer strategy ou distributed memory parallel computers for solving image processing problems. He haspresented a divide-and-conquer algorithm for the connected components problem. Hehas analyzed some of the requirements of this problem, and outlined some of the implications for machine architectures and software. Nelson [Nel87] has studied how the divide-and-conquer paradigm can be used in designing parallel algorithms. He has discussedand presented parallel algorithms based on sequential divide-and-conquer algorithms forBatcher’s Bitonic Sort, Matrix Multiplication, and Fast Fourier Transform(FFT) problems. Contraction of these algorithms on a binary n-cube show that different algorithmsrequired different contractions to obtain good results. Our work focuses on developingefficient programming support for divide-and-conquer applications on multicomputersto hide the underlying complexities of the parallel machine from the programmer.McBurney and Sleep [M588] have done experimental work on implementing divide-and-conquer algorithms on a transputer-based system. Their ZAPP architecture is avirtual tree machine that is capable of dynamically mapping a process tree onto anyfixed, strongly connected network of processors. Each processor runs a ZAPP kernel thatimplements the divide-and-conquer function. Each processor performs a sequential depthfirst traversal of the process tree, constructing sub-problems. Parallelism is introduced byallowing immediate neighbor processors to steal sub-problems. It is difficult to obtainany performance model for this framework. The task distribution strategy in TrEKdiffers from that used in ZAPP. Unlike ZAPP, TrEK does not allow a node to grabsubtasks from its output queue for processing. This restriction allows us to model thesystem and does not degrade performance.Cole, in his algorithmic skeletons [Co189], has studied how well a divide-and-conquerskeleton can be implemented on a grid architecture. He proposes to use an H-tree layoutto map tree processors to mesh processors, but has not implemented the system. Interms of performance, he has focussed on the asymptotic efficiency with which a largegrid of processors can implement the system with respect to the performance of a singleprocessor. In contrast, the TrEK design can work on multicomputers with any topologyand has been implemented on a transputer-based multicomputer. Performance modelsChapter 3. Methodology 32derived in this thesis are analytical models, unlike Cole’s asymptotic models, and hencecan be used for performance tuning.3.5 Chapter SummaryIn this chapter, we have described an integrated approach for addressing programmingand performance modeling problems on multicomputers. We have discussed the characteristics of two task-oriented paradigms, processor farm and divide-and-conquer, that areconsidered in this thesis. The following chapters describe the design and implementationof runtime systems, development of performance models and their experimental validation on a 75-node transputer based multicomputer. An integrated programming environment that includes programming tools such as graphical interface, mapper, loader,monitor and debugger in addition to virtual machines would make programming thesemachines much easier. Such an environment is described in Chapter 8.Chapter 4Processor Farm: Design andModelingIn this chapter, we describe a processor farm and detail the trade-offs involved in itsdesign. We derive models that accurately describe the behavior and performance characteristics of the system. We give the limitations and assumptions on which the modelsare based and describe how the models were used in the design process. The models aresufficiently general that they can be used to predict performance of our design on anytopology. Providing independence from both size and topology while maintaining theability to tune the performance strongly supports reuse.In Section 4.1, we classify different types of processor farms and present our processor farm system, Pfarm. We compare and contrast our design with that of othersat appropriate places within this section. In Section 4.2, we derive general analyticalmodels that describe the start-up, steady-state, and wind-down phases of the executionon any tree topology. As balanced tree topologies provide maximum performance amongall topologies of the same size, we apply this modeling technique to derive expressionsfor balanced tree topologies. We close by discussing how our models can be used inperformance tuning and restructuring of application programs.33Chapter 4. Processor Farm: Design and Modeling 344.1 Pfarm: Design and ImplementationWe begin by listing some of the important design issues and goals that must be addressedwhile designing and developing a system to execute applications efficiently using theprocessor farm strategy.1. The system should fully exploit all the available parallelism in the hardware, suchas the ability to simultaneously communicate on all links.2. System overheads should be minimized.3. The system should be topology independent, that is, it should scale and run onany processor topology.4. The system should support reuse and provide an easy-to-use interface to the application programmer.In a processor farm, control may either be centralized or distributed. In the caseof centralized control, requests for work are sent to a central manager processor thatassigns the tasks. Usually, as in Hey [Hey9Oj, these requests are routed through thenetwork back to the manager. However, it is also possible to dynamically reconfigurethe links, with the use of crossbar switches, so that the manager can load and draintasks directly from each worker [Hom93]. In the case of distributed control, there is amanager on each processor. In this case, the tasks flow into the system and the localmanager either schedules an incoming task to the local worker process or forwards itto a child processor. Distributed control processor farms are common and have beendescribed by many authors (for example see Cok [Cok9lJ).Processor farms may be control driven or demand driven. A control driven schemeis useful when the work can be statically partitioned and assigned to the workers. Ademand driven scheme, however, has the advantage that it can dynamically adjust todifferent sized tasks. Also, in a distributed processor farm, only neighbor to neighborcommunication is necessary.Pfarm implements a distributed demand-driven processor farm.Chapter 4. Processor Farm: Design and Modeling 354.1.1 Process Structure and SchedulingWhen implementing a distributed demand-driveu processor farm, each worker processorconsists of at least two processes: a task mauager process and a worker process. Sinceintermediate processors in the network distribute tasks and collect results in addition toprocessing, from the performance point of view, it is important to overlap communicationwith computation. It affects the rate at which tasks can be forwarded, which, as shownin Section 4.2, limits the performance of the system. Although the processor farmimplementations that have appeared in the literature [Cok9l] mention the importanceof overlapping communication with computation, most do not fully implement it.In order to overlap communication with computation, it is necessary to have multipleprocesses on each worker processor. As a result, Pfarm has one InLink and one OutLinkprocess for each hardware link in the processor, in case of transputers, there will in totalbe 4 InLink and 4 OutLink processes. The process structure of a worker with threechildren is shown in Figure 4.1. This figure depicts a single worker in the system. Theentire system consists of a collection of these workers, organized in a tree structure,which logically corresponds to the ideal processor farm structure given in Figure 3.1.The result manager process in Figure 4.1 coordinates the collection and forwarding ofresults to the manager processor (or root) of the system. Pfarm takes advantage of thetransputer’s ability to use the links and the CPU simultaneously, and makes it possibleto overlap computation with the transfer of tasks and results. Note that there is still anon-overlapped message start-up time associated with each communication.Another important design consideration is the scheduling of local processes. In orderto keep the worker processors busy processing tasks, it is important to forward the tasksas quickly as possible. Therefore, all the processes that distribute tasks should be runat high priority. Pfarm uses the hardware scheduler provided on the transputer chip.In Pfarm, all the task communicating link processes and the task manager process arerun at high priority, whereas the worker process is run at low priority. In addition, linkprocesses that forward results and the result manager are also run at high priority. Asdescribed in Section 4.2.3, this is important whenever system throughput is bound bythe rate at which the manager receives the results. Also, this returns the results to theChapter 4. Processor Farm: Design and Modeling 36Figure 4.1: Process Structure on a Worker Processor in Pfarmmanager as quickly as possible which is important when there are dependencies amongthe tasks.4.1.2 Task SchedulingSince Pfarm distributes the control, there is a task manager process on each workerprocessor. The local manager gives priority to the local worker process while allocatingan incoming task. If the local worker process is busy, the manager attempts to forwardthe task to one of its children using a round robin strategy among the free OutLinkprocesses. The order in which the tasks are assigned to the OutLink processes dependson the underlying topology and affects the start-up time of the system. An analysisof the affect of round robin scheduling of tasks to OutLink processes during start-up isChapter 4. Processor Farm: Design and Modeling 37given in Section 4.2.In Pfarm, we initially flood-fill the system with tasks. However, once fnll, the systemis demand-driven with new tasks entering the system as tasks are completed. Thisscheme works well for start-np and steady-state bnt is not as effective for the wind-downphase. Dnring wind-down, there are no longer any incoming tasks and workers closer tothe root may idle since remaining tasks are still being forwarded towards the workersfarthest from the root.4.1.3 BufferingThe nnmber of task bnffers to be allocated to each processor is an important designissne. In Pfarm, each process in the task distribntion path shown in Fignre 4.1 can haveonly one task. This provides snfficient buffering so that a process never idles waitingfor a task. For example, when the worker process finishes, there will be another taskavailable at the manager process. If the manager process was not there and the workerreceived the task directly from the InLink, then it would have to wait whenever theInLink process was in the midst of a communication. We call the task bnffer in themanager process, an additional bnffer since it is there for synchronization purpose only.It is important to minimize the number of buffers because the wind-down time increasesprop ortiojially with the number of buffers. This occurs because more tasks end up on theworkers at the leaves of the tree, resulting in workers closer to the root idling (a completeanalysis of wind-down is given in Section 4.2.1). The number of buffers adversely affectsstart-up as well, as we show in Section 4.2.1. However, the number of buffers does notaffect steady-state performance.At any given time, each processor can be viewed to have four tasks assigned to it,one each to the task manager process, local worker process, local InLink process andOutLink process of the parent processor. Thus, in total there are 4N active tasks in anyN processor system, irrespective of the topology.We dO not restrict the number of result buffers as this does not affect the overallperformance. But, at any given time, the number of active result buffers on any processoris small as the result forwarding communication processes and the result manager areChapter 4. Processor Farm: Design and Modeling 38run at high priority.In contrast, the amount of buffering required in a centralized scheme depends on thetask size and message latency for receiving a new task. Unlike Pfarm, as N increases,message latency also increases and therefore the number of buffers per processor hasto increase. In the centralized scheme, there is also the extra overhead of sending andreceiving task request messages.4.1.4 Topology IndependencePfarm system is designed to be topology independent. For processors and links thatsatisfy the assumptions described in Section 3.2, the observations about the Pfarm designremain true, independent of the topology. Besides transputers, there are other machinessuch as iPSC hypercubes [Arl88] and TI C40 [Inc9l] with similar behavior. Moreover,as we show in the next section, accurate performance models can be derived for Pfarmon any tree topology. For a given fixed interconnection, by taking a spanning tree, it ispossible to use Pfarm and derive a model to predict its performance.4.2 Performance ModelingIn summary, as a consequence of the design, Pfarm has the following characteristics:1. The hardware system is a distributed memory message passing architecture withlinear message cost model.2. There is a continuous flow of tasks into the farm.3. All tasks originate from a single source and results are returned to the source.4. Tasks are dynamically distributed to the workers.Our objective is to find a distribution of the load to all the worker processors soas to minimize the total execution time, where load consists of both the computationalrequirements of the tasks and the associated overheads for forwarding and executing thetasks.Chapter 4. Processor Farm: Design and Modeling 39The system can be either computation bound or communication bound. When itis computation bound, the system acts as a pipeline, with three phases to be analyzed:start-up, steady-state and wind-down. The start-up phase begins when the first taskenters the system and ends when all the worker processors have received at least onetask. After start-up, the system is in steady-state where it is assumed that the processorsdo not idle. Finally, the wind-down phase begins when the last task enters the systemand ends when all the results have reached the source. The total execution time is givenby,Ttotai = Tsu+T8s+Twd (4.1)where is the start-up time, TSS is the steady-state time and Td is the wind-downtime. For a sufficiently large number of tasks, steady-state time dominates the remainingtwo phases. However, to better understand the limitations of processor farms with asmaller number of tasks, it is important to analyze start-up and wind-down. When thesystem is communication bound, the total execution time is determined by the rate atwhich either the tasks can be distributed to the worker processors or the results can bereceived from them.In Sections 4.2.1 and 4.2.2, we derive performance models for the case in which thesystem is computation bound. In Section 4.2.1, we present a general analytical framework to analyze the steady-state performance of processor farms on any tree topology.We argue that, under reasonable assumptions, Pfarm obtains optimal performance onany topology. Later in this section, upper bounds for start-up and wind-down time arealso derived. In Section 4.2.2, we derive steady state, start-up and wind-down modelsfor balanced complete trees using the general analytical framework. Balanced completetrees are interesting because they provide maximum performance among all topologiesof the same size. In Section 4.2.3, we analyze the performance of processor farms whenthey are communication bound.4.2.1 General Analytical FrameworkLet T be a tree architecture with processors p1,... ,PN. Let C(i) ={j pj is a child of p} denote the children of p in T. Let a = Te + lie be the processingChapter 4. Processor Farm: Design and Modeling 40time plus associated overhead to execute a task locally, and let /3 be the processor overhead for every task forwarded to a child processor. [3r includes all the CPU overheadsinvolved in receiving a task, forwarding it to a child processor, receiving the corresponding result and forwarding it to the parent processor. Let d and r be the average dataand result size per task, respectively, and let T be the communication rate of the links.Steady-state AnalysisThe steady-state phase begins once all the processors have a task to execute and endswhen the last task enters the system. It is assumed that no processor idles duringsteady-state; a processor is either processing a task or busy forwarding tasks and results.Suppose that M tasks are executed during this phase and let V denote the number oftasks that visit p. Then, the following condition holds for all the processors in T,T88— c(Vj— /)+/3f (4.2)jEC(i) jC(i)= + (/3 - (4.3)jEC(i)That is, the steady-state execution time (T5) equals the processing time with the associated overhead (cr) for all the tasks executed locally plus the overhead (/3) for all thetasks that were forwarded.For a fixed T8, c, and 13f, these N conditions form a system of linear equations on Nunknowns; Vi, V2, . . . VN. By ordering the equations so that the parent of a processor inT appears before its children, it is easy to see that the system is in an upper triangularform. Thus, the N equations are linearly independent and there is a unique solution.At the root of T, V = M. Furthermore it is easily verified, by back substitution,that V1 is a linear function of T8. Thus T88 can be expressed in terms of M, a and /3f,which implies that given M, a and,we can solve for T55 and V1 to VN. We can alsosolve for fj, the fraction of the M tasks executed by the ith processor,).jEC(z)Chapter 4. Processor Farm: Design and Modeling 41In addition, we can obtain the steady-state throughput, the rate at which tasks can beprocessed by the system. An example of this analysis for an arbitrary architecture isgiven in Figure 4.2.a /3f—a /3f—a V1a V2 T8a /3f—a /3f—a V3 T88a V4 T88a V5 T33a3MT88— 5a2— 6af +Figure 4.2: An example of the steady-state analysisThis analysis gives the execution time of the steady-state phase in terms of parameters that can be determined prior to execution. The total number of tasks, M, is usuallyknown. a can be estimated or measured by using the techniques that are described inChapter 8. The Pfarm system is designed such that the two overheads 13e and /3j’ areapplication and topology independent. Therefore, they need only be determined oncefor a particular implementation of Pfarm. In Section 5.1, we describe how the values ofthese overhead parameters can be determined.As explained above, we can determine T88 for a given arbitrary tree, T. However,for an arbitrary topology there remains the question of which spanning tree to use forPfarm. We show that all shortest-path, demand-driven distribution schemes with thesame overheads (/3e and /3) are equivalent. Let T58(S) be the execution time of ashortest-path, demand-driven distribution scheme S.Chapter 4. Processor Farm: Design and Modeling 42Theorem 1 For any topology G, T55(S) equals T(Pfarm(T)) where Pfarm(T) isPfarm executing on T, a breadth-first spanning tree of G rooted at the source of thetasks.Proof:Let L(i) of G denote the set of processors that are at distance i from the source ofthe tasks. Since S and Pfarm(T) are both shortest-path distribution schemes, tasksexecuted at a processor in L(i) must have been forwarded from processors in L(i—1). Letsi(S) and s(T) denote the combined throughput of all the processors in L(i) for schemeS and Pfarm(T), respectively. We claim that for all i, si(S) = s(T). It is initially truefor n, the last level, since in both schemes the processors in L(n) do not idle and canonly execute tasks. In general, by induction on the level, the fact that processors do notidle and si(S) = s(T) implies that both schemes must execute the same number of taskson processors in L(i — 1). Thus s_1(S) = s_1(T) and, in particular, se(S) so(T).Therefore for a fixed M, since the throughput for both the schemes are the same,T58(S) =T85(Pfarm(T)).0It follows from Theorem 1 that for a fixed M, there is only one value of T88 thatensures that processors do not idle. This does not exclude the possibility that thereexists some non-shortest path scheme or a scheme that introduces idle time that wouldperform better. However, since this either increases the overhead to forward a task toa worker or reduces the computational power of a processor, it would be surprising if itoutperformed the work efficient scheme we have analyzed.Start-up and Wind-down AnalysisStart-up and Wind-down costs depend on the task distribution strategy and the hardware topology. For analyzing start-up and wind-down costs, we consider the structure ofthe underlying process graph. Given a tree architecture, the process graph of the systemcan be constructed by replacing each processor by the process structure given in Fig-Chapter LProcessor Farm: Design and Modeling 43ure 4.1. We remove the processes that gather the results and only consider the processesthat are involved in task distribution, that is, the worker process, the manager processand the link processes. The process graph of the architecture given in Figure 4.3(a) isshown in Figure 4.3(b). Notice that there are two types of edges in Figure 4.3(b): edgesthat represent the inter-processor communication and the intra-processor communication.Figure 4.3: (a) node graph (b) process graph (c) subtree decompositionLet T be the process graph of an N node architecture. We will add, as part of T, aninitial OutLink process which we take as the root of T. In total, T has 4N processes oralternatively T can be viewed as consisting of N subtrees (or nodes) of the type depictedin Figure 4.3(c).Start-upStart-up begins when the first task enters the system and ends when all the processorsOutUnk0 InUnkO ManagerO Workere split(a) (b) (c)Chapter 4. Processor Farm: Design and Modeling 44have received at least one task. The duration of the start-up phase depends on how thetasks are distributed to the processors. As mentioned in Section 4.1, the manager process gives priority to the worker process over the OutLink processes while allocating thetasks. Thus, when a worker process becomes free, it will receive the next available task.The OutLink processes, when free, receive tasks from the manager process; if more thanone OutLink process is free, the manager allocates the tasks in a round robin ordering ofthe OutLink processes. Thus, as the tasks start entering the system, a processor keepsthe first task it receives and then, as long as the worker process has not completed thecurrent task, distributes the incoming tasks to its children.In order to obtain an upper bound on start-up time, we discretize the start-up intoa sequence of steps. On each step, a task is transferred from an OutLink process of oneprocessor to either the worker process, or an OutLink process, or, when they all havetasks, to the last process along the path without a task. Furthermore, it is assumed thatduring the start-up phase, no worker process finishes its first task. This is a reasonableassumption since the task forwarding link processes run at high priority while the workerprocess runs at low priority. Assuming that there is a continuous flow of tasks into thesystem, we seek to bound the number of steps required before every worker process hasa task.For analysis purposes, we use a collapsed process graph like the one shown in Figure 4.3(c). This graph is identical to the original architecture graph except that eachnode consists of the InLink, the manager, and the worker process of a processor plusthe corresponding OutLink process of its parent processor. In this tree, all the tasksare initially at the root and on each step every node with more than one task forwards,in round robin order, the last task it received to a child. In order to analyze the worstcase, we assume that the tasks can be buffered at the manager process and thus eachsubtree has an infinite capacity to absorb tasks. We first obtain an upper bound on thenumber of steps to distribute at least one task to every node. This is equivalent to theprocedure described above, except that it takes one additional step at the end to ensurethat a leaf node forwards its task to the worker process.Given a rooted oriented tree T, with child nodes numbered from left to right startingChapter 4. Processor Farm: Design and Modeling 45at one, let c(v) be the child number of node v with respect to the parent of v, p(v). Letp(v) denote the uth ancestor of node v in T, and let deg(v) be the down-degree of nodeV.Definition 1 Let d(v) equal fl0 deg(p(v)).Lemma 1 For any rooted oriented tree T, the first task received by node v is then—21 + c(p1(v)) +j=otask at the node p(v), the root of T.Proof:Let s(v, i) be the number of the ith task received by node v. Using the fact that thechild c(v) receives every deg(p(v)) task arriving at p(v), except for the first which is keptby p(v), we obtain the following recurrences(v, i) = s(p(v), 1 + c(v) + (i — 1)deg(p(v))).In general,s(v, i) s(p’(v), 1 + c(p°(v)) + (i — 1)do(p(v)))= s(p(p(v)), 1+ c(p(v)) + [1+ c(v) + (i - 1)do(p(v)) - 1]do(p(p(v)))= s(p2(v), 1 + c(p’(v)) + c(v)do(p2( )) + (i — 1)do(p’(v))do(p2(v) )= s(p2(v),1 + c(p(v)) + (i — 1)d(p(v)))s(v,i) = s(p(v)),1 +c(p(v)) + (i — 1)d_i(p(v)))At the root, s(v,i) = i. Thus, when p(v) is the root,s(v, 1) = 1 + c(p’(v)) (4.4)DChapter 4. Processor Farm: Design and Modeling 46Example:Let us derive the task number of the first task that arrives at node 5 in the graph shownin Figure 4.3(a) using equation (4.4). For node 5, n = 2. Thus,s(5, 1) = 1 + c(p1(v)) + c(p°(v))do(p2(v )= 1+2+2x2=7.The time step at which the worker process on node v receives task s(v, 1) is n + s(v, 1).This follows from the fact that task s(v, 1) leaves the root after s(v, 1) — 1 steps, andit takes n + 1 more steps for this task to reach the worker process on node v as thistask gets forwarded in every step because s(p”(v), 1) < s(v, 1). The next theorem followsfrom these remarks.Theorem 2 For any tree structured process graph, aftermax {n+s(v,1)}v a leafsteps, every worker process has a task to execute.The time required for each step is determined by the communication cost to transfera task from a processor to its child and the associated CPU overhead. The averagecommunication time needed to transfer a task from one processor to its child is givenby Td = d/T. The associated overhead is /3f/2 since /3 includes the overheads for bothtransferring a task to a child node and returning the corresponding result to the parent.Therefore, the time required for each step is given by Td + 3j/2.From Theorem 2 it follows thatT8 = (Td+/3f/2) max {m+s(v,1)} (4.5)v a leafFor the tree shown in Figure 4.3(a), the start-up time is determined by the timerequired for node 5 to receive its first task. As derived earlier, s(5, 1) = 7 and m = 2.Chapter 4. Processor Farm: Design and Modeling 47Thus, the start-up time for this example is given by= 9(Td +/3f/2) (4.6)The upper bound on the number of steps required for every worker process to havea task to execute can be exponential in N. Consider for example a tree with a longpath where each node on the path has large degree but all nodes off of the path areleaves. The sum of products in s(v, 1) grows exponentially with respect to N. This is aresult of our assumption that subtrees can always accept tasks. In practice, the upperbound cannot exceed the number of buffers in the tree, 4N. But it is possible to comearbitrarily close to 4N. As the upper bound is proportional to the number of buffers, forstart-up, ideally the number of buffers should be minimized. This is one of the reasonswe chose to have only one additional buffer on every worker processor.Although the degree and depth of the tree is fixed, it is possible to change c(v).Theorem 2 provides a means for determining an orientation of the tree that minimizesstart-up time.Example:Earlier, for the tree shown in Figure 4.3(a), we found that s(5, 1) 7. If we changethe orientation of the tree by interchanging nodes 2 and 3, the start-up time is stilldetermined by the time required for node 5 to receive its first task. Now, howevers(5, 1) = 6.In summary, to minimize start-up, the tree should be oriented so that the longest pathappears on the left (that is, on start-up, tasks are forwarded first along the longer paths).Wind-downWind-down begins when the last task enters the system and ends when the last resultleaves the system. The Wind-down phase can be broken into two parts: the time tocomplete the remaining tasks in the system and the time to return results that are inthe system after all the tasks have been executed.Consider the state of the process graph when the last task enters the system. Asgiven in Section 4.1, there are 4N tasks. In order to derive an upper bound, let usChapter 4. Processor Farm: Design and Modeling 48assume that all the remaining tasks have just started to execute. We will bound themaximum number of tasks executed by a single processor during the wind-down phase.This gives an upper bound on the time taken to complete the 4N tasks.If all of the tasks have just begun execution, then after time c, each processor hasexecuted one task. Since priority is given to distributing the tasks, some of the tasksbuffered in each processor will be forwarded to the child processors. In the worst case,when c greatly exceeds,the tasks will be forwarded as far as possible towards theleaves.A worker process at a leaf can only execute those tasks available at its ancestorprocesses in the process graph. Therefore, we should consider the longest path fromthe root to a leaf in the tree (this is at most 3N + 1, for example see Figure 4.3(c)) toderive an upper bound. Let m be the number of ancestor processes of the leaf processin the longest path, each of which initially contains a task. Starting at the root, everythird process along the path is adjacent to a worker process. Therefore, after time o,when each of the worker processes have finished executing a task, each worker processadjacent to the path will receive a task from a manager along the path. The remainingtasks on the path shift down towards the leaves filling as many buffers as possible. Thisresults in the following recurrence for p(m), the length of the path,p(O) =p(m) p(n-l)-.Solving this recurrence for p(m) 1 shows that after log312 ml + 1 steps, all tasks havebeen processed. Thus, the time taken for the first part of the wind-down phase is givenby c(log312m1 + 1).The second part of the wind-down cost is determined by the time required to forwardthe last result from the leaf process to the root. If r is the average result size per task,the communication time needed to transfer a result from one processor to its parent isgiven by Tcr = r/-r. The associated overhead per transfer is /32 since f3 includes theoverheads for both transferring a task to a child node and returning the correspondingresult to the parent node. Therefore, the second part of the wind-down cost is given byChapter 4. Processor Farm: Design and Modeling 49m/3 x (Tc,. +/3f/2) since there are m/3 processors in the path. If m is the length of thelongest path, the wind-down cost is given byTd = (log312rn1 +l)+(Tcr+f/2). (4.7)In the tree shown in Figure 43(a)), there are two longest paths, the path from theroot to node 4 and the path from the root to node 5. In this example m = 9 and thewind-down time is given byTd 7a+3(Tcr+/3f/2). (4.8)This analysis depends only on the depth of the tree and therefore gives the samebound for all breadth-first spanning trees of the topology. Note, however, that theanalysis is overly pessimistic since subtrees along a path from the root to a leaf alsoreceive tasks from managers on the path. The actual wind-down time also depends onthe number of nodes along the path with down degree greater than one. The fewer thenumber of nodes of degree one, the smaller is Td. Because of our round robin schedulingpolicy, any node of degree greater than one can only forward at most one task towardsthe leaf of the chosen path out of every three tasks that arrive at the node. Therefore,the number of tasks on the path decreases more quickly and if every processor on thelongest path has at least two children, then the number of tasks executed by a leaf isbounded by max{ log3 ml + 1, 4}. Leaves must execute at least 4 tasks, namely thosein their local buffers.For the example topology shown in Figure 4.2, the total execution time for processingM tasks is given bya3(M — 20)Ttotai 9(Tcd +/3f/2)+ 52—6c3 + 2i3+ 7c + 3(Tcr +/3f/2). (4.9)4.2.2 Balanced Tree TopologiesIn this section, we analyze the performance of processor farms on balanced tree topologiesusing the framework given in the previous section. Balanced tree topologies are of interestbecause (a) a k-ary balanced tree topology, where k is the number of links on each node,Chapter 4. Processor Farm: Design and Modeling 50provides optimal performance, and (b) for balanced trees, it is possible to obtain closedform solutions for system throughput and speedup.For processor farms, a k-ary balanced tree topology provides optimal performanceamong all the topologies of the same size for the following reasons:1. From Theorem 1, we know that the best topology for processor farms is a breadthfirst spanning tree. As a k-ary balanced tree is a spanning tree with minimumpossible depth among all degree k graphs, it provides a steady state performancethat is as good as any possible spanning tree.2. As explained in Section 4.2.1, start-up cost is proportional to the length of thelongest path in the graph. It also depends on the ordering of the children, and canincrease when the graph is unbalanced. The length of the longest path in a k-arybalanced tree topology is minimum among all the graphs with the same number ofnodes. The symmetry in balanced tree implies that the orientation of the tree doesnot affect the start-up cost. Thus, a balanced k-ary tree topology has minimumpossible start-up time among all trees with the same number of nodes.3. As shown in Section 4.2.1, wind-down cost is proportional to the length of thelongest path in the graph. The wind-down cost also decreases as the number ofchildren at processors in the longest path increases since these children steal tasksfrom the path. As a result, this decreases the number of tasks forwarded to the leafprocessor in the longest path. As mentioned earlier, the length of the longest pathin a k-ary balanced tree is minimum among all trees of the same size. In addition,all the non-leaf nodes in a balanced k-ary tree have the maximum possible numberof children and thus reduce the number of tasks forwarded to any particular leaf.Thus, a complete k-ary tree has minimum possible wind-down time among all thetrees with the same number of nodes.As defined in Section 4.2.1, let o be the average processing time plus the overheadto execute a task locally and let /3f be the overhead for every task forwarded to a childprocessor. Consider a D level balanced k-ary tree with k processors on level i. Figure 4.4shows binary and ternary tree topologies with D = 4 and 3 respectively.Chapter 4. Processor Farm: Design and Modeling 51Figure 4.4: Binary and Ternary Trees with D = 4 and 3We begin by analyzing the steady-state phase, followed by an analysis of start-upand wind-down.Steady-state AnalysisLet M be the total number of tasks processed during the steady-state phase. Assumingthat no processor idles during the steady-state, all the processors at a particular level ofthe tree execute the same number of tasks. We can then express the steady-state time(T85) for each processor in terms of the number of tasks processed and forwarded andtheir associated costs and overheads. From the general analysis in Section 4.2.1, we haveT8 = ci(Vj — ) + /3 (4.10)jEC(i) jEC(i)Since all the processors on a level execute same number of tasks, by summing equation (4.10) over all the processors on level i, we have=)+3i i jC(i) i jEC(i)=(4.11)i+1LetL, = Vj(a) Binary Tree (b) Ternary TreeChapter 4. Processor Farm: Design and Modeling 52be the number of tasks that visit a processor on level i.Therefore,T58 = oL+(/3f—o)kL+l.By rearranging the above, we havek(c — = oL—T55 (4.12)and L0 M.The above recurrence is of the following form described in Chapter 2 of Knuth’sConcrete Mathematics [GKP89].an = k(c—/3f)== —T8SnSolving the recurrence, we obtaini 1—L M [k f)] [ 1— k(-f)Leta=ThenL = Ma—(i — 1/ac 1—1/a=MazT/1. (4.13)SinceL=Ofori=D,mIDMaD — —- (1-1/a)T — MQ(1 — 1/a)SS— (1_i/aD)Chapter 4. Processor Farm: Design and Modeling 53By substituting for a, we obtain— M[a—k(o—31)]T5—— (k(f))D (4.14)DiscussionFrom the steady-state execution time given by equation (4.14), we can derive expressionsfor throughput and speedup. The steady-state throughput of a D level balanced k-arytree is given byMSD =I1 1(k(_f) (415)a—k(ci—/3f) )The steady-state speedup of a D level balanced k-ary tree is given bySPD=I=— k(a—D(4 16)k(3)Here, speedup* is defined as the ratio of the execution time of the parallel algorithm ona single processor (execution time includes the associated overhead in addition to taskprocessing time) to the total execution time of the algorithm on the parallel system.We can also determine the fraction of the total number of tasks that are executedon each processor. For a processor on level i, it is given by— L—kL1417MBy substituting equation (4.14) for T5 in equation (4.13),Maz— M(a — 1) [a 1]’L aD= a —(a —1)aD_iaD— a=*This is different from the usual theoretical measure of speedup, the time of the most effective serialalgorithm divided by the time of the parallel algorithm.Chapter 4. Processor Farm: Design and Modeling 54By substituting the above in equation (4.17),aD— a a’ —= a3 — 1— ka’ — 1— aD(l_k)+ai+l_ai418— aD_iNow consider the affect of 13 on the overall performance. Let g be equal to theratio /3f/a. This is the inverse of granularity which is generally defined as the ratio ofcomputation to communication overhead. The speedup given in equation (4.16) can nowbe expressed in terms of g as follows:a_____SPD = 1ja—k(a—/3f) a-_______— 1—k(1—g)=[k(1 -When g —* 0, we have1 — kDSPD= 1-k=N,the total number of processors (i.e., when overhead I3j is negligible compared to a,speedup is proportional to the number of processors in the system).When g —* 1 (i.e., when overhead /3 is almost equal to a), SPD = 1. For the case0 < 1— g < 1, (1 — g) is the factor by which the processing capacity of a processor isreduced due to the overheads involved in dynamically distributing the work. Figure 4.5shows how overhead /3f affects the efficiency.In case of a linear chain of N nodes, speedup is given byN 1 1_NN= (.)1—(1—g) g gFor large N,SPN=9as the second term in equation (4.19) becomes negligible. Thus l/g gives an upperbound on speedup on a linear chain.Chapter 4. Processor Farm: Design and Modeling 550.0.Efficiency0.0.Figure 4.5: The affect of /3 on efficiencyThe analytical results derived here for k-ary balanced trees are similar to those obtained in [Pri87, Pri9O, TD9Oj. The difference is in the way the overhead parameters areincluded in the model. With proper substitutions, it is possible to obtain their throughput expressions from our model. Pritchard has analyzed processor farm on a linearchain and his model [Pri87, Pri9O] is an abstract one which uses the characteristics ofthe machine as the overhead parameters to the model. As a result, it does not take intoaccount the scheduling strategy and the associated software overheads. Tregidgo andDownton [TD9O] have extended Pritchard’s analysis for balanced binary and ternarytrees. Again, these models only considered the hardware characteristics as the overheadparameters. Tregidgo and Downton have validated their model using a simulator. Contrary to statements in [TD9O], the model they derived also holds for distributed farmsas well as small centralized farms.In comparison, we have provided a general framework to derive the performancemodels for processor farms on any topology. These models assume a realistic dynamicscheduling strategy, and account for all the associated software overheads. Also, we haveanalyzed start-up and wind-down phases, which are significant for applications consisting0.2 0.4 0.6 0.8granularityChapter 4. Processor Farm: Design and Modeling 56of a smaller number of tasks. The models have been experimentally validated and areaccurate as discussed in Chapter 5.Start-up AnalysisThe start-up analysis presented in Section 4.2.1 for arbitrary tree topologies can be usedto obtain start-up costs for balanced tree topologies. The start-up cost for an arbitrarytopology is given by equation (4.5), which is reproduced below,T8 = (Td+13f/2) max {m+s(v,1)}. (4.20)v a leafIn a balanced tree, the rightmost leaf node will be the last among all the nodes to receiveits first task. Thus, the start-up time is determined by the number of steps required forthe rightmost leaf node to receive its first task. For a D level, k-ary tree, n. = D — 1. Forthe rightmost leaf node, s(v, 1) can be obtained using equation (4.4), and is given bykD— 1k—ithe total number of nodes.Thus, for a balanced k-ary tree of D levels, the start-up cost is given byT8 = (N + D— 1)(Tcd + /3/2). (4.21)Wind-down AnalysisIn this section, we use the wind-down analysis presented in Section 4.2.1 for arbitrarytree topologies to derive the wind-down time for balanced trees. The wind-down time isgiven by equation (4.7) which is reproduced below:Td a(rlog312 ml + 1) + +/2). (4.22)where m is the length of the longest path in the process graph. For a N node linearchain topology, the wind-down analysis presented in Section 4.2.1 holds with m = 3N.The wind-down time is given byTd = (log312 3N1 + 1) + N(TCr +/3f/2). (4.23)Chapter 4. Processor Farm: Design and Modeling 57As explained in Section 4.2.1, in a balanced tree with degree greater than one, thenumber of tasks in the longest path decreases more quickly. For this case, the numberof tasks executed by the leaf node on the longest path is given by max{1og3ml + 1, 4}.For a D level k-ary balanced tree, m = 3D, and wind-down cost is given byTd = (rlog3D1 +1)+D(Tcr+/3f/2). (4.24)The total execution time for processing M tasks on a D level k-ary balanced tree isgiven byTtotai = + T88 + Td= (N + D- 1)(Tcd ++ (M - 4N)[ - k(- )]1 — (k(a_/3f))+€(rlog33D1 + 1) + D(Tcr + /3/2).4.2.3 Communication BoundThe performance models derived in Sections 4.2.1 and 4.2.2 are applicable only whenthe system is computation bound. If the system is communication bound, it may neverreach steady-state and processors may idle. In this section, we analyze the performanceof processor farms, when the system is communication bound.There are two cases in which the performance of a processor farm system might becommunication bound. The first corresponds to the actual transfer portion of the communication; the second corresponds to the CPU overhead required for communication.Here, we assume that the links are homogeneous.Case (1)In the first case, throughput is limited either by the the rate at which the farm canreceive tasks or the rate at which it can transfer results to the manager, whichever issmaller. LetT = max{Tcd,Tcr}.The system throughput corresponding to this limit isScorni= T±’(4.25)Chapter 4. Processor Farm: Design and Modeling 58where [3 is the processor overhead required to receive a task from a parent or to send aresult to the parent. [3c includes the overheads required to make the newly arrived taskavailable for processing (or forwarding), to allocate a new buffer for the Inlink processto receive the next task and to initiate the communication. It also represents the corresponding time required to initiate the transfer of a new result after the communicationof a previous result to the parent node is completed. In addition, overhead /3 can beestimated by /3 /3/4 as /3f is the total processor overhead for receiving a task, forwarding it to a child processor, receiving the corresponding result and forwarding theresult to the parent node.Case (ii)The second factor that limits throughput is the CPU overhead in transferring the tasksand results. Since the first worker processor in the farm has to incur an overhead of atleast /3 for every task received from the manager and forwarded to a child processor,the overall throughput is limited byScorn2 < , (4.26)irrespective of the number of workers in the farm.The communication bound on the throughput of the system is the smaller of the twobounds obtained from equations (4.25) and (4.26).4.3 DiscussionIn this section, we discuss how the performance models derived in Section 4.2 can beused in performance tuning.4.3.1 Optimal N and TopologyOur models can be used to determine the optimal N and topology to maximize performance for a given application program.If 3j < (Tc + J3c), then the intersection of equations (4.15) and (4.25) gives theoptimal number of processors to use to maximize throughput. Beyond this optimalChapter 4. Processor Farm: Design and Modeling 59number, overall performance of the system does not increase. For this case, ignoringstart-up and wind-down, the optimal level of a k-ary processor tree, is given by— log [i—_(I3f)]D0— rk(—!3f) ( .2 )bogLIf /3 > (T + j3), then the optimal number of processors is given by the intersectionof equations (4.15) and (4.26). For this case, D0 is given by— log [i — c_k_I3f)]D0—. ( .28)log30002500_____________________________________Comm Iimit22000 Steady-stateThroughput1500Comm limiti100050010 20 30 40 50Number of processors, NFigure 4.6: Plot of throughput curves for a linear chain (with Te = lOms, /3e 4821Us,13e = 453its)For a linear chain of N nodes, steady-state throughput is given bySN= 1[((cf))N](4.29)In Figure 4.6, we have plotted the three throughput equations (4.25), (4.26) and (4.29)for a linear chain with a set of typical parameter values. As /3 < (T + /3) in thisexample, optimal N can be obtained from equation (4.27) which gives a value of 20.60Chapter 4. Processor Farm: Design and Modeling 60Equation (4.28) is not applicable for linear chain even if [3 > (T + i3). For this case,As N—÷ oo, throughput reaches the limit l/3. But from Figure 4.6, we can observe thatafter a certain value of N, the increase in throughput with an increase in the numberof processors is very small. This value of N can be determined by iteratively evaluatingthe throughput for increasing N using equation (4.29).6000Ternary TreeThroughput E_Number of processors, NFigure 4.7: Comparison of processor farm throughput on linear chain, binary tree andternary tree configurationsIn Figure 4.7, we have plotted throughput as a function of the number of nodes forthree different topologies: linear chain, binary and ternary tree. As expected, we canobserve that for any particular N, the ternary tree configuration gives better throughputthan the other two, as long as the system has not reached one of the two communication bounds. This shows that with a k-ary tree topology, it is possible to achieve thesame throughput with fewer number of nodes. It also shows the dramatic increase inthroughput possible by using a binary tree rather a chain. The increase in throughputfrom binary to ternary tree is much smaller.If the total number of processors available in the system is less than the optimalnumber and it is not possible to have a complete k-ary tree, it is better to use a topologyChapter 4. Processor Farm: Design and Modeling 61in which all the levels except the last are complete k-ary with the remaining nodesbalanced at the last level. If it is not possible to obtain a k-ary tree due to systemconfiguration restrictions, it is better to use a tree with the next possible larger degree.If the total number of processors available in the system is larger than the optimalnumber of nodes for a given application program, then one can use multiple k-ary treetopologies to improve performance. This is possible only if there are multiple links fromthe manager to the worker farm. The manager could be a host workstation or one ofthe multicomputer nodes. In the first case, the number of k-ary tree topologies one canuse is limited by the number of available host links. In the second case, it is possible tohave up to k k-ary tree topologies. For both cases, overall throughput of the system isgiven by the sum of the throughput of each of the individual tree topologies, providedthe manager can keep up a continuous flow of tasks to all the worker trees.In practice, throughput can decrease as N exceeds the optimal value. ‘vVe haveobserved this phenomenon in the validation experiments and the reasons are discussedin Section 5.3.4.3.2 Problem ScalingOur models can be used to determine how well the speedup scales with problem size.In the processor farm case, an application can be scaled in two different ways. One wayto scale the problem is to increase the total number of tasks, M. It follows from equation (4.16) that steady-state speedup is independent of M. Thus, scaling the problemsize by increasing M does not lead to any increase in the steady-state speedup. However,it may lead to a small increase in the overall speedup because wind-down and start-upnow represents a smaller portion of the total execution time.•The second way to scale a problem is to increase the granularity of the tasks byincreasing Te. In Figure 4.8, we have plotted steady-state speedup of a linear chaintopology for different values of Te. Figure 4.8 shows that steady-state speedup increaseswith increasing values of Te as long as the system does not reach either of the twocommunication bounds. Thus, to increase speedup, it is better to increase Te ratherthan M.Chapter 4. Processor Farm: Design and Modeling 624’Te = 20 ms3’Speedup2 Te=lOms1 Te=5msTe = 1 msFigure 4.8: Measured speedup for processor farm on linear chain4.3.3 GranularityThere are many applications in areas such as nnmerical analysis and image processingin which it is possible to decompose a problem of fixed size in several ways. The computation requirements of the tasks and the total number of tasks may vary from onecase to another. Also, some programs may be easily restructured to produce tasks ofdifferent computation requirements. Granularity of the tasks is given by the computation requirements of the tasks. Performance models can be used to determine the bestgranularity to be used for an application to obtain maximum performance.It can be observed from Figure 4.8 that for a fixed N, steady-state speedup increaseswith an increase in Te. Also, the optimal value of N increases with increasing Te. For aproblem of fixed size, increasing the granularity reduces the total number of tasks, M. Inaddition, it may increase data and result sizes which leads to increased communicationcosts. Both small M and larger communication costs will add to start-up and wind-downcosts.Figure 4.9 shows a graph of speedup, with the effect of start-up and wind-downincluded, as a function of granularity and N for a processor farm running on a linear10 20 30 40 50 60Number of processors, NChapter 4. Processor Farm: Design and Modeling 634,Speedup 2’GranularityFigure 4.9: The affect of granularity on speedupchain for a problem of fixed size. Here, granularity is defined as the ratio of new Te to anoriginal smaller Te. Notice that, up to certain point, speedup increases with granularity,and then starts decreasing. Therefore, one can not use an arbitrarily large granularity,rather the optimal operating point of the system must be calculated as a function of N,Te and the values of the overhead parameters.4.4 Chapter SummaryIn designing an efficient processor farm system, many trade-offs have to be considered.In this chapter, we have described the design of Pfarm, detailing the factors that affect the overall performance of the system and how they have to be addressed in theprocessor farm design. We have presented a general analytical framework that can beused to determine the performance of a processor farm system on any topology. The40Number of processors, NChapter 4. Processor Farm: Design and Modeling 64interaction between the design and modeling phases have been discussed throughout thechapter. We have outlined how the models can be used in restructuring applicationsand in determining the optimal number of nodes and topology to be used to maximizeperformance.In Chapter 5, we experimentally validate the performance models derived in thischapter on a large transputer-based multicomputer. The research results described inthis chapter along with the experimental validation presented in the next chapter werepublished in [SCW92, WSC93].Chapter 5Processor Farm: ExperimentsIn this Chapter, we validate the performance models derived in Chapter 4 for processorfarms. The models are experimentally validated using a Pfarm implementation on themulticomputer described in Section 3.3. Pfarm was validated using the Logical Systemsversion of the software.We used a syllthetic workload in all of the validation experiments. The applicationprogram consisted of a set of tasks, each of which executed an empty loop. The numberof iterations of this loop determines the task execution time Te for the particular experiment. By running the ioop at high priority, it was possible to determine the numberof iterations necessary to produce a task size of 1 ms. Multiples of this value were thenused to obtain the different Te’5.In Section 5.1, we describe the experiments conducted to determine the values ofthe system overhead parameters. In Section 5.2, we validate the performance modelsfor arbitrary tree topologies, and in turn show that Pfarm works on an arbitrary topology. Performance models for balanced tree topologies are validated in Section 5.3. Wedescribe the results of the experiments to test the robustness of our models in Section 5.4.5.1 Determining System OverheadsIn order to compare the analytical model with the actual execution, it is first necessaryto determine the values of the system overhead parameters, /3e and /3f As explained in65Chapter 5. Processor Farm: Experiments 66Section 4.2, /3e is the processor overhead to execute a task locally and /3 is the processoroverhead for every task forwarded to a child processor. These overheads depend onlyon the implementation of Pfarm and are independent of the application program. Also,they do not depend on the underlying topology because the software paths in Pfarm thatconstitute these overheads are the same for any topology. Therefore, these overheadshave to be determined only once for a particular Pfa’rm implementation. In some applications, it may be difficult to distinguish between what constitutes the computationtime of a task (Te) and the associated overhead (/3e). For these cases, we can measureo (the sum of Te and /3e) and use it in the models. The techniques for measuring thevalues of o are discussed in Section 8.2.The values of the overhead parameters, /3 and,can be determined by running afew experiments on configurations with one and two worker processors (see Figure 5.1).These experiments were first conducted with the Logical Systems implementation ofPfarm.Figure 5.1: Configurations for determining 13e and /3fFor the configuration shown in Figure 5.1(a), the total execution time is given byTtotai M(Te+13e). (5.1)Experiments were run on this configuration with Te 1, 5, 10, 20 and 40 ms and alarge value of M 10000. The value of /3e was obtained by substituting the measured(a)(b)Chapter 5. Processor Farm: Experiments 67execution time in equation (5.1) for each of the five cases. As expected, for different Te’S,/3e remained constant, varying by less than 3is. The average value of 13e was 482 is.For the configuration shown in Figure 5.1(b), we can express the total executiontime in terms of the number of tasks processed (M1) and forwarded (M2) by Workerl.Workerl spends (Te + /3e) for every task it processes, and /3 for every task it forwards.Thus,Ttotai Mi(Te+/3e)+M213f. (5.2)The same set of experiments were run on configuration 5.1(b). We measured Ttotai, M1and M2 and solved for /3. by substituting these values into equation (5.2). Again, thevariation between the values obtained for 13r for different cases was within 3ps. Theaverage value of /3 was 453 1us.The same set of experiments were run with the Trollius version of Pfarm, usingphysical layer communication. For this case, /3e was 570 is and /3 was 1.3 ms. Eventhough the design is the same for both Logical Systems and Trollius versions of Pfarm,the implementations are slightly different. The overheads are higher in Trollius becauseof the higher costs of memory allocation and deallocation, and process management. Forthe validation experiments, the Logical Systems version of Pfarm was used.5.2 Arbitrary TopologiesThe analytical framework described in Section 4.2.1 was used to determine the performance of Pfarm on arbitrary topologies. To test the general model, we conducted severalexperiments on three different breadth-first spanning trees of the 8 x 3 and 8 x 8 meshtopologies.Table 5.1 shows the predicted and measured total execution time for the differentbreadth first spanning trees (shown in Figure 5.2) of an 8 x 3 mesh. For each case,the total number of tasks is 10000 and time is given in seconds. Note that becauseof the large number of tasks, steady-state time dominates the execution time, so theexperiments are generally testing the accuracy of the steady-state model. Although thedistribution of tasks to processors is different for different breadth first spanning trees,Chapter 5. Processor Farm: Experiments 68BFST1 BFST3Figure 5.2: Three breadth-first spauuiug trees of the 8 x 3 mesh.as predicted by Theorem 2, the overall executiou time remaius the same (neglecting thevery small experimental errors).Results for two differeut breadth-first spanuiug trees of the 8 x 8 mesh topology(similar to the first two BFSTs of 8 x 3 shown in Figure 5.2) are given in Table 5.2.Again, the total number of tasks used is 10000 and time is in seconds. As Tables 5.1and 5.2 show, the maximum error between the predicted and measured total executiontime is less than 1.5%.For comparison purposes, in Tables 5.1 and 5.2 we have included the results of usinga chain rather than a breadth first spanning tree. Note that the speedup obtained byusing the spanning trees is significantly higher than that obtained by using the chain.The minimum value of Te used in the experiments is 10 and 30 ms for’ the 8 x 3and 8 x 8 mesh, respectively. For smaller Te, system throughput reaches the secondcommunication bound given by equation (4.26). This scenario can be easily identifiedby using the analytical technique described in Section 4.2.1. If, in solving the equations,BFST2©©©©©ccc.)1CRCRCRcccc-C)00Cl)C) C) H E C)C)00I. CR—4p C)—4CRC)1 C C c) Ci) C C C) C•) C) C) Ci) C) C) Ci) Ci) 00 x 00 C) CI)C) Ci) C/):.I’C)91-D00)100C)ccCiCl)©00ccH00CiCRt00IC) Ci) C) cL H C)-i00C)cEic00ccooHicccczC)1 C C E -i Ci) C C I-i C) C) C) C) Ci) C) -i C) Ci) Ci) 00 x cz B C) Ci)pppp©©©©©HCX)—4C)CRC)CZCl)ccoo-CRc0000C)00—ccc—————————C)—--C)P-9191H-00CRCi)cc-—.C)©CR:‘00C)—00C)—)----CccCRCRC)©00C)-Cl)Cl)cc-HC)CRCiZ:--------B.---i4-©C)F—i-CccCRi-C)C)cccc-C)H00ccczCRH E__iC)-cccCRCl)cc-4HcccC)C—91C)H:‘cc—ci—.CRCiCi)CRICRt300C)C———)-Cl).I)Is):--cc00C)’)--4I:iI’I’(C—CRcc-H:00Is)CRIs))—.C?)Ccccc©czC)i—(DC—)C/).-iCRCRCRC)p-C)cIs)C)CRIs)ccCi)C)CRCRccc-Chapter 5. Processor Farm: Experiments 70a processor executes a negative number of tasks, the system is communication bound.In this case, it is better to use a smaller topology. An optimal subtree can be foundby removing, one by one, the leaf nodes that are farthest from the root and solving thesystem of equations until a feasible solution is obtained (i.e., all processors execute apositive number of tasks).The models for start-up and wind-down were validated by running a separate set ofexperiments on both of the configurations. The start-up model was validated by runninga number of experiments with different Te and M and observing the task number of thefirst task processed by each processor. In all the cases, the task number of the first taskexecuted at a node was same as that predicted by the model given in Section 4.2.1. Thewind-down model was validated by running experiments with M = 4N and observingthe number of tasks executed by the leaf node in the longest path. For all the cases, thisleaf node executed as many or fewer tasks compared to that predicted by the wind-downanalysis given in Section 4.2.1. These experiments were performed with a varied Te onboth the 8 x 3 and 8 x 8 mesh.5.3 Balanced Tree TopologiesIn this section, we validate the performance models derived in Section 4.2.2 for balancedtree topologies. Table 5.3 gives the range of experiments conducted to validate thesteady-state, start-up and wind-down models.Model Topology N M Te(ms)Steady-state Chain 1 to 64 10000 1,5,10,20,40Binary tree 1 to 63 100000Ternary tree 1 to 40Start-up and Chain 1 to 64 4N 1,5,10,20,40wind-down Binary tree 1 to 63Ternary tree 1 to 40Table 5.3: Range of processor farm experimentsTo validate the steady-state model, experiments were run using a large M so that,Chapter 5. Processor Farm: Experiments 71in comparison start-up and wind-down was negligible. The minimum value of T6 chosenfor these experiments is 1 ms because in order for Pfarm to make use of more than oneworker, T must be greater than /S, where /3 453jts. We validated the start-up andwind-down analysis separately by performing experiments with M = 4N. For this valueof M, the total execution time consists of only the start-up and wind-down phases sincethe system never reaches steady state.5.3.1 Steady-state ValidationFirst, we present the results of the validation experiments in which data and result sizeswere small, and (T + /3) < /3. This ensured that the communication bound givenin equation (4.25) was not reached for any of the experiments. Table 5.4 shows thepercentage error between the predicted and measured execution time for a linear chainconfiguration. Table 5.4 shows that the percentage errors are within 3%. Also, for afixed 7’s, the total execution time continues to decrease up to a certain value of N. Afterthis point, there is no considerable decrease in the execution time as the throughputapproaches the asymptotic communication limit of l//S. For example, for T = 1 and 5ms, the decrease in execution time is small after 8 and 32 nodes, respectively.Tables 5.5 and 5.6 show the percentage error between the predicted and measuredexecution time for binary and ternary tree topologies, respectively. From the tables,observe that the percentage error again does not exceed 3% until the system reachesthe asymptotic bound, i//3. Unlike in the linear chain case, the measured executiontime may begun to increase as N increases after the optimal point. For example, inthe ternary tree case, for Te = 1 ms the optimal N is 13 corresponding to the numberof nodes in a 3-level tree. However, performance degrades significantly when anotherprocessing level is added. For N larger than the optimal number, the total processingcapacity of the system exceeds the rate at which tasks can flow into the system, whichis i//Sj. But, any demand-driven dynamic scheduler continues to forward tasks to theworkers farther down the tree as long as these workers have free buffers and there areunprocessed tasks. Thus, the workers closer to the manager execute fewer tasks sincemost of the time they are busy forwarding tasks. This in turn, leads to poor utilizationC’DaC cjC)C—.C) C) C) C) CO C)00)—d C)C):C)C)C)C):01cc01c-&—cc00—.-t-1010101—4CC—400—.t’-D00cc0000C)CCZ03CO01C)00Q)-4©—4CRCC)©©©03cc©cC)-----S--------—>4II—C)I—01©00©cc©3ccC00—.-CRCRCRCR00—400—---4—03400©C)Co03—40103CO©—-4CRC0100Coccc©©00C)b©©©©©©o--—1CCC01©©©ccccc0000I-3cc©©i—-ccCR—1C01-4—1I-::‘CR©©CRH-H©©©tcccccc—.CRC00©©CRCCX—.C—:03—3—4C)©03CRcc03©CRC)00cc00CR—3—400CRC)-0003C4-©CO©C)---II C)3C)—I3CR©©©I301CRccCRcc©©F—03-4CCCcc—.-CRCR—4©©CRc00—.-CO-ZCR—303cct3—C)CO00cc—4Is©03C)0303:‘Cc03CRC)00©©13:-C)III©©©II©©©©©©©©—©©©c00—401CC1-,3—3303C—3-:•CCCC03©I’-©03F—C©—303-Chapter 5. Processor Farm: Experiments 73of these workers in terms of the number of tasks locally processed, especially the rootworker which only ends up processing the first task it receives. When this phenomenonoccurs, the time spent by the root processor to forward all the tasks, except the firstone, generally exceeds the total time taken by this processor for the cases in which theprocessor topology had not reached the asymptotic limit. This causes the total executiontime to increase when the size of the processor topology is increased. Even though thisphenomenon occurs in the linear chain case, it does not lead to any appreciable increasein the total execution time because the flow of tasks down the topology is smaller. Theaffect of this phenomenon for tree topologies would becomes even worse when the numberof buffers on each processor was increased.Te10ms Te20msN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time1 104.881 104.868 0.012 204.941 204.915 0.0133 36.014 36.027 -0.036 69.369 69.370 -0.0017 15.970 15.974 -0.025 30.261 30.267 -0.02015 7.753 7.742 0.142 14.431 14.410 0.14631 4.829 4.828 0.021 7.158 7.138 0.27963 4.776 4.632 3.015 4.846 4.706 2.889______Te4Oms_____N Predicted Measured % ErrorExec Time Exec Time1 405.061 405.029 7.900e-33 136.102 136.096 4.408e-37 58.876 58.855 0.03615 27.820 27.788 0.11531 13.663 13.618 0.32963 6.864 6.820 0.641Table 5.5: Comparison of Predicted and Measured Total Execution Time for ProcessorFarm running on Binary Tree.In Tables 5.7 and 5.8 we have tabulated the percentage error between the predictedand measured total execution time on linear chain and binary tree configurations forexperiments with larger data and result sizes. Both the data and result size used inthese experiments are 1000 bytes per task. This leads to a larger communication timeChapter 5. Processor Farm: Experiments 74Te = 1 ms Te 5 msN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time1 14.827 14.808 0.128 54.851J 54.832 0.0354 4.811 4.809 0.042 14.628 14.633 -0.03413 4.793 4.555 4.966 4.854 7.080 -45.85940 4.749 6.016 -26.679 4.773 6.730 -41.001TelOfflS Te=2OmsN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time1 104.881 104.868 0.012 204.941 204.915 0.0134 27.116 27.133 -0.063 52.135 52.134 0.00013 8.682 8.677 0.058 16.383 16.376 0.04340 4.803 6.324 -31.668 5.471 5.460 0.201Table 5.6: Comparison of Predicted and Measured Total Execution Time for ProcessorFarm running on Ternary Tree.for forwarding tasks and results, and /3f < (T + /3). In this case, the rate of taskprocessing will be limited by the communication latency time. The system reaches thecommunication bound given by equation (4.25) after an optimal value of N. This valueof N depends on the computation size of tasks, Te. From the tables, we can observe thatfor Te = 5 ms, the system reaches its communication bound at N 12 and 15 for linearchain and binary tree respectively. While the system remains in steady-state, the erroris within 3%, however, once the communication bound is reached, the error is around10% and remains almost constant as N increases. In this case, measured execution timedoes not increase with an increase in N. This is because it takes longer to forwarda task to a child node and thus tasks do not get forwarded to the nodes farther fromthe manager for both chain and tree topologies. The errors obtained when the systemis communication bound are larger compared to the steady-state error because of thedifficulties in obtaining an accurate value for T. The value of r changes depending onthe utilization of the link in both the directions. We have used an optimistic value for rthat leads to a slightly larger value for optimal N. This is reasonable as the performanceof the system does not decrease in this case even when a larger N is used.Chapter 5. Processor Farm: Experiments 75Te = 5 msN Predicted Measured %ErrorExec Time Exec Time1 54.845 55.183 -0.6162 28.604 28.873 -0.9404 15.532 15.795 -1.6938 9.092 9.341 -2.73912 7.464 8.327 -11.56216 7.464 8.320 -11.46824 7.464 8.312 -11.36132 7.464 8.301 -11.214Te = 5 msN Predicted Measured %ErrorExec Time Exec Time1 54.845 55.183 -0.6163 19.347 19.585 -1.2307 8.844 9.009 -1.86515 7.464 8.172 -9.48531 7.464 8.169 -9.445Table 5.8: Comparison of Predicted and Measured Total Execution Time for ProcessorFarm running on Binary Tree under Communication BoundTable 5.7: Comparison of Predicted and Measured Total Execution Time for ProcessorFarm running on Linear Chain under Communication Bound9) ccCD CDCDCD‘-CSCDCD- ‘-C C ‘-CCl)9) CR p e0000:‘ie>CDCD‘CDPHeCs:)—b—.C)CsCRczcc0001cci-CDC—— CD9)9)9)9)9)9)HC—F-)——.Ci)CR:00©00CD—4C)C)c’iccC)CD-<CDCD‘-CDHeCD CDCD-Cl)‘-CCD:rj‘-C-C C ‘-CCRC)S— ‘-i-C-Ci)ccCD—CD Ci)i—h‘-CCDCDp-- CDCCDCDCDCD CD‘-C— HCD CDCDcC)Ci)CD‘lI- 9) 00I C)Cs:)0190 —:1Cs:)1’I 9)1C)CRCl’CDCiDCR CD CD Ci) ci) c -C C-li190 00 CR9) -4 ccL:Ij‘-C‘-Cc-.CDH cC’CDCDCD—‘-C-•‘-CcccCDcC’C’)—-CDCs:)CI)Ci)—cC’cC’ ‘-CCD ‘-CCD—CDccC’-CDCDcC’ CDCDCI)CDC))CD‘-CCDcC’ce,CDcC’ —.c+ Ci)CDCDCDocC’Cl)‘-CCDCi)CDCD‘-CDcC’CDCi)‘-CDCD,PCD—._cC’cC’cC’CDCD‘-C‘‘-CCD‘-CCD‘-C.C1_0,CD0‘-CCcC’,CD CDCDP Cs:) CR CR P Cs) 00CD‘-CDHoCDCDCD cC’ CI) -CCDCD-C‘-C C ‘-Cp 00 00E Ci)1CDCD-CCDCD-p -4 C) CCCp CR -4 ccCDCD-CCDCD-Li i.1E Ci)p Cs) C--i 9) ccLi CD229)9)9)9)-4C)C)CR,—.-CCs)—4Cs)Ic-)CDC)C)C)CRccC)CD‘-CR cc ccLi -C 0 -CChapter 5. Processor Farm: Experiments 77TelOms Te=2OmsN Predicted Constant Uniform Predicted Constant UniformExec Time %Error %Error Exec Time %Error %Error1 104.880 0.010 -0.400 204.941 0.013 -0.4532 53.602 0.000 -0.396 103.625 0.000 -0.4584 27.986 0.000 -0.382 52.978 -0.028 -0.5258 15.225 -0.110 -0.512 27.677 -0.188 -0.75516 9.020 0.610 0.455 15.235 0.407 -0.21732 6.047 0.496 -0.050 9.018 0.044 0.01148 5.186 0.174 -0.116 7.019 -0.527 -1.22564 4.828 -0.249 1.263 6.068 -0.906 -1.269Table 5.10: Comparison of Predicted and Measured Total Execution Time for uniformtask distribution for Processor Farm running on Linear Chain5.4 RobustnessIn all the experiments discussed so far, we have used a constant value for Te. However, inpractice, Te may vary from one task to another. In order to test the robustness of usingaverage values for prediction under this condition, we experimented with two commondistributions for task sizes: uniform and bimodal. Experimental results are comparedwith those predicted by the model using the average value for Te.We ran several sets of experiments with uniform distribution of task sizes. For allthe experiments, the total number of tasks used was 10,000. Table 5.10 shows thepercentage error between the predicted and measured total execution time for two setsof experiments on a linear chain configuration. The task execution time varies from 1 to19 ms (average Te = 10 ms) for the first set, and from 1 to 40 ms (average Te = 20 ms)for the second set. As Table 5.10 shows, the errors are all within 3%.Several sets of experiments were conducted with bimodal distribution of task sizes.In these experiments M was 10000, and the values of Te were 1, 5, 10, 20 and 40 ms.Here, we describe a set of experiments in which we used 5,000 tasks of 1 ms duration andanother 5,000 of 20 ms duration. In the bimodal distribution case, the order of arrivalof the tasks into the system also affects the performance. Experiments were conductedwith four different arrival patterns:Chapter 5&ocessor Farm: Experiments 781. Both 1 ms and 20 ms tasks arrive with equal probability.2. 20 ms tasks arrive at a probability of 0.75, until all 5,000 of them are processed.3. 1 ms tasks arrive at a probability of 0.75, until all 5,000 of them are processed.4. All the 1 ms tasks arrive before any 20 ms tasks.100-0- Casel-- Case 2% Error-a- Case 3-10 0- Case 4-200Figure 5.3: Error graph for processor farm on linear chain with tasks of bimodal distributionIn Figure 5.3, we have plotted the percentage error between the predicted and measured total execution times for these experiments on a linear chain topology. For prediction, we have used an average value of T = (1 x 5000 + 20 x 5000)/10000 10.5 ms inthe model. As we can observe from the figure, the errors are within 3% for all the fourcases, when N is less than 8. For larger N, the prediction is accurate for the first casebut the error increases for the other three cases. The maximum error observed variesfrom around 6.5% in the second case to around 10.25% in the third case, and is highestat around 15.0% for the fourth case.The errors depend on the extent to which the average Te reflects the actual computation requirements of the tasks. As long as N is smaller than the optimal valuecorresponding to the smaller Te of the two, the average value works well for all the cases.20 40 60 80Number of Processors, NChapter 5. Processor Farm: Experiments 79However, for larger values of N, the average value works well only wheu the two kindsof tasks are well mixed with respect to arrival order, as in the first case with equalprobability. Among the other three cases, tasks are mixed for a larger portion of thetotal execution time in the second case compared to the third case, aud there is no mixat all in the fourth case. In these cases, there is a correspondiug increase in the error,with the largest errors occurring for the fourth case. Obviously the average value is notappropriate for the fourth case since there is no task mixing at all. One can view theexecution as occurring in two distinct phases of computation, the first consisting of allthe 1 ms tasks and the second with all the 20 ms tasks. For this case, it is better touse the model twice, predicting the execution time for 1 and 20 ms tasks separately andadding them together for the total time.There is a theoretical possibility of finding a distribution of tasks with a particulararrival pattern that could lead to arbitrarily poor performance. This is due mainly tothe possible failure of the task scheduling strategy to balance the load among all theprocessors. However, it is very difficult to come up with this distribution as the systemis dynamic and events happen in a nondeterministic order. We believe it to be unlikelyfor any application program to consist of tasks with such a distribution.5.5 Chapter SummaryIn this Chapter, we experimentally validated the performance models derived in Chapter4 using a Pfarm implementation on a large transputer-based system. Experimentally weshowed that on a fixed topology, the performance obtained by Pfarm is the same for anybreadth-first spanning tree, as predicted by Theorem 2 in Chapter 4. We also discussedthe experiments conducted to determine the values of the system overhead parameters,/e and /l.Chapter 8 describes how Ffarm can be integrated into a programming environmentthat includes other programming tools such as a graphical interface, mapper and debugger. We describe the user interface for Pfarm and discuss how the models can be usedfor performance tuning.Chapter 6Divide-and-Conquer: Design andModelingIn this chapter, we describe the design of TrEK (Iee xecution kernel) that providesruntime system support for divide-and-conquer applications. TrEK is designed such thatit can execute divide-and-conquer computations of any fixed degree and depth on anytree topology. We derive models that accurately describe the behavior and performancecharacteristics of TrEK or any similar system that satisfies the assumptions outlined.Section 6.1 describes the design and implementation of TrEK. In Section 6.2, modelsthat describe the start-up, steady-state, and wind-down phases of the computation onany tree topology are derived. We use this modeling technique to derive performancemodels for balanced tree topologies. We close this chapter by discussing how thesemodels can be used in performance tuning and restructuring of application programs.6.1 TrEK: Design and ImplementationAs described in Section 3.4, TrEK assumes that there is a flow of tree structured computations that enter the root processor. Each of these tree structured computationcorresponds to a divide-and-conquer task and tasks are assumed to have a known fixeddegree and depth.TrEK is a runtime kernel that runs on each worker node. In order to execute anapplication, TrEK has to be provided with three application dependent functions, split,80Chapter 6. Divide-and-Conquer: Design and Modeling 81join and compute. The split function takes a task as input, splits it and outputs twoor more subtasks. The join function takes two or more results as input, joins them andoutputs a single result. Finally, the compute function takes a task as input, processes itand returns the result. An example of the task graph corresponding to an instance of adivide and conquer task is shown in Figure 6.1.Figure 6.1: Divide-and-Conquer Task StructureProcessor farm can be viewed as a degenerate case of divide-and-conquer. Thus, itis possible to extend the Pfarm design to that of TrEK. In addition to the design issuesand goals addressed by Pfarm, TrEK should also be able to execute divide-and-conquertasks of any fixed degree and depth on any arbitrary processor topology. In this section,we explain the design modifications and extensions for TrEK from that of Pfarm.As in Pfarm case, TrEK is designed as a set of cooperating processes in which eachlink is controlled by a separate process. Figure 6.2 shows the process structure of TrEKChapter 6. Divide-and-Conquer: Design and Modeling 82— —I— — —Figure 6.2: TrEK Process Graph on an Intermediate Worker Processoron an intermediate worker node in the processor tree. In the task distribution path,there is an InLink process that receives tasks and a number of OutLink processes thatforward subtasks to the children. In the result forwarding path, there are InLink processes that receive results of subtasks from the children, and an OutLink process thatforwards results onto the parent. There is a task manager process that controls the taskdistribution, and there is a result manager process that controls the collection and forwarding of results. In addition to these system processes, there are three user processeson each intermediate node. The split process receives tasks from the task manager andcalls the split function to split the tasks into two or more subtasks. The join processreceives the results of subtasks from the children and calls the join function to combine the results. There is a local worker process on each leaf processor as well as onintermediate processors that receives tasks from the task manager and processes them——4—I TaskDistributionResultForwarding——Chapter 6. Divide-and-Conquer: Design and Modeling 83to completion.As explained in the design of Pfarm, to overlap communication with computation, thecommunication processes and both manager processes execute at high priority, whereasthe worker process executes at low priority. As the split process is in the critical pathof task distribution, it must also execute at high-priority. The join process is also runat high priority as it decreases the response time which is important if there is anydependency among tasks.In an idealized parallel implementation of divide-and-conquer algorithms on treeprocessors, such as those discussed in [HZ83, Col89], intermediate processors executeonly split and join functions. This leads to an inefficient use of intermediate processorsas they idle while waiting for the results. In TrEK, we allow the intermediate processorsto do the processing of tasks in addition to executing split and join functions. This ispossible only if the application consists of either a flow of divide-and-conquer tasks or asingle divide-and-conquer task of large degree.TrEK uses a distributed demand-driven scheduling in which children processors,whenever they have free task buffers, greedily steal subtasks from their parents. Whenan intermediate processor gets a new task, there are two scheduling choices. The firstchoice is to split the task and put the subtasks on the output queue from which childrenprocessors get their tasks. The second choice is to allocate the task for local processing.A task allocated for local processing is executed until completion by recursively solvingsubproblems and joining the results as in the case of uniprocessor execution of divide-and-conquer. At intermediate processors, priority is given to splitting the task andforwarding the subtasks over allocating it for local processing. The scheduling is demanddriven since all the subtasks are stored in a single output queue from which the childrenwith a free task buffer compete for tasks.This task scheduling strategy is similar to the one used in the ZAPP [MS88] system.However, in the case of TrEK, there is an important difference, a processor cannot grabsubtasks from its own output queue. As shown in Section 6.2, this restriction allows usto model the system and does not degrade performance.Chapter 6. Divide-and-Conquer: Design and Modeling 84In TrEK, we initially flood-fill the system with tasks. However, once full, the system isdemand-driven with new tasks entering the system as the current tasks are completed. Asin the case of Pfarm, the advantage of this scheduling strategy is that it opportunisticallytakes advantage of varying loads.As in the Pfarm case, the number of additional task buffers to be allocated is animportant design issue. On a worker processor, each link process in the task distributionpath holds an active task (or subtask) and the local worker process has an active taskthat is being executed. Aside from these active tasks, there is an additional task bufferin the task manager for the reasons explained in the Pfarm case. In addition, eachintermediate processor in TrEK has an output queue that holds the subtasks producedby the split process before forwarding them to the children processors. The outputqueue consists of as many buffers as the number of subtasks produced from a task. Ifthis output queue was not present, the children processors could wait for subtasks whenthe parent is busy doing a split. As in Pfarm, we only restrict the number of taskbuffers, whereas we freely allocate result buffers. As results are also collected, joinedand forwarded at high-priority, at any given time, the number of result buffers on a nodeis small.In order to be topology independent, TrEK should be able to execute divide-and-conquer tasks of any degree on any topology. In TrEK, a parent processor does notpredetermine the child to which a particular subtask is going to be forwarded. Subtasksproduced by splitting a task are put in a single queue, and all the children processorsget their tasks from this queue. Thus, the number of children that execute the subtasksof a particular task depends on the load, and it is possible for all the subtasks of a taskto be forwarded to the same child. The Result manager on the parent node joins theappropriate subresults of a task. Therefore, it is possible to execute tasks of any degreeon any topology. Each TrEK kernel has to be provided with the number of children ofthe processor on which it runs and the degree of task either at runtime or at compiletime.Chapter 6. Divide-and-Conquer: Design and Modeling 856.2 Performance ModelingIn this section, we derive performance models for application programs running withTrEK or any other system that satisfies the following assumptions. The main characteristics of the system are:1. The hardware system is a distributed memory message passing architecture described in Section 3.2 with a linear message cost model.2. There is a flow of fixed degree divide-and-conquer tasks into the system.3. The depth of the tasks is equal to or greater than the depth of the underlyingprocessor topology.4. Tasks originate at a single source and the results are returned to the source.5. Intermediate processors are also allowed to process the tasks in addition to executing split and join.6. Tasks are dynamically distributed to the worker processors.We assume that at each step, tasks are split into subtasks of equal computation.With a fixed degree divide-and-conquer task in which at each step, the work is dividedinto k equal parts, we haveW(n) = split(n) + join(ri) + kW(n/k), (6.1)where W(n) is the total amount of work of a task with an input data size of n, split(n) isthe work of splitting a task of size m and joiri(ri) is the work of joining the correspondingk subresults. Later in Section 7.3.5, we show experimentally that the models derivedwith this assumption also work well for applications in which tasks are split into subtasksof unequal computational requirements.Our objective is to find a distribution of the load to all the processors so as tominimize the overall execution time, where load consists of both the computationalrequirements of the tasks and the associated overheads for splitting, forwarding andChapter 6. Divide-and-Conquer: Design and Modeling 86executing them. As in the processor farm case, the system can be either computationbound or communication bound. When it is computation bound, the system acts as apipeline with three phases to be analyzed: start-up, steady-state and wind-down. In thecase of TrEK, start-up phase ends when all the leaf processors have received at least onesubtask. At the end of the start-up phase, intermediate processors may or may not haveany tasks for local processing, but they will be busy with splitting and forwarding. Thedefinitions of steady-state and wind-down phases are same as that in the processor farmcase, and the total execution time is given by,Ttotai T8u+T38+Td (6.2)First, we derive performance models for the case in which the system is computationbound. In Section 6.2.1, we present a general analytical framework to analyze thesteady-state performance on arbitrary tree topologies. We also derive upper boundsfor start-up and wind-down costs on arbitrary tree topologies. In Section 6.2.2, we usethis analytical approach to derive models for the special case of fixed degree divide-and-conquer computations running on balanced tree topologies. We discuss the limits onperformance for the communication bound case in Section 6.2.3.6.2.1 Arbitrary Tree TopologiesLet T be a tree architecture with processors pi,•• ,PN And let C(i){i I pj is a child of p} denote the children of p in T. Let k be the degree of thedivide-and-conquer tasks to be processed.Let = Te(j) + e, where Te(i) is the time required for processing a subtask locallyby the ith processor and /3e is the associated overhead. Te(i) is given by W(n) inequation (6.1), where n is the input data size of a task that arrives at the ith processor.Let O = T8(i) + Tj(1) + /3f, where T5(i) and Tj(i) are the split and join time at theith processor, and are given by split(n) and join(ri) respectively. /3 is the associatedoverhead for every task split and forwarded to the children processors. /3f includes allthe CPU overheads involved in receiving a task, splitting and forwarding the subtasksto the children processors, receiving the corresponding subresults and joining them, andChapter 6. Divide-and-Conquer: Design and Modeling 87forwarding the result to the parent. Unlike in the processor farm case, /3f is not aconstant because the overheads involved in forwarding subtasks and receiving results isproportional to the number of subtasks. /3f can be expressed asI3f = !3f1 + k/3f2, (6.3)where /3f1 and /3f2 are constants. /3f includes the overheads required to receive a taskfrom the parent and to send the result back, and thus, it is independent of the numberof subtasks. /3f2 is the overhead required for forwarding a subtask to a child processorand to receive the corresponding result. Thus, Oj = T5(i) + Tj(i) + /3i + k/3f2.Steady-state AnalysisThe steady-state phase begins once all the leaf processors have a subtask to executeand ends when the last task enters the system. It is assumed that no processor will beidle during the steady-state. Leaf processors will be busy processing the subtasks, andthe intermediate processors will be busy either splitting the tasks (or subtasks), joiningthe results or processing the tasks (or subtasks). Suppose that M divide-and-conquertasks of degree k are executed during this phase and let V denote the number of tasks(or subtasks) that visit pj. Then, we can express steady-state execution time (T8) interms of the number of tasks executed locally and the number of tasks forwarded alongwith their associated costs. T38 is given by the following equation that holds for all theprocessors in T,T35 = cj(Vj— y)+8j > (6.4)jEC(i) jEC(i)=Vj+(O-cj) (6.5)jC(i)The factor 1/k appears because of the fact that for every k subtasks that are forwardedto the children, there is only one original task, and split and join are done only once forthese k subtasks.This system of equations is similar to that of Pfarm and, for tree topologies, also hasa unique solution. The major differences from the processor farm case are, we have cj’sChapter 6. Divide-and-Conquer: Design and Modeling 88and O’s that vary from one processor to another unlike in the processor farm case, andthe degree (k) of divide-and-conquer tasks appears in these equations. Given M, oj’sand O’s, we can solve for T5 and Vi to VN. An example of this analysis is shown inFigure 6.3.(a) Task Graph (b) Architecture Graph[ v1\ /T88\V21_ITSS1- 3) (8-V3 I - I T8 II V4 I T IQ5 ) V5) \\Tss)T — M1c2i3o4k— 2aaa3 223&1 — 2a123 + 2281&3 + 12a4k+a1a4k—o24&1k— —a2o34kFigure 6.3: An example of the steady-state analysisThis analysis gives the execution time of the steady-state phase in terms of parameters, whose values can be determined prior to the execution. The total number of tasks(M) and the degree of division (k) are usually known. Te(i),Ts(i) and Tj(i) can beestimated or measured experimentally by the techniques described in Section 8.2.1.( a1 — ai) (i — a1)a2 a3The processor overheads e, /3f1 and /3f2 are dependent on the TrEK implementaChapter 6. Divide-and-Conquer: Design and Modeling 89tion and hardware processor characteristics, but are independent of the application andthe hardware topology. Therefore, they need only be determined once for a particularimplementation of TrEK.We have experimentally found that on a fixed topology, a breadth-first spanning treewith maximum number of leaves provides maximum performance. The experiments arediscussed in Section 7.2.Start-up and Wind-down AnalysisAs in the processor farm case, start-up and wind-down analysis depend on the underlyingtopology and the task scheduling strategy. We construct the process graph of the systemby replacing each processor by the process structure given in Figure 6.2. Again, we ignorethe processes that are not in the task forwarding path. The process graph of the topologygiven in Figure 6.4(a) is shown in Figure 6.4(b).Figure 6.4: (a) node graph (b) process graph (c) subtree decompositionOutLinko• Manager• Worker(a) (b) (c)Start-up and wind-down time depends on the number of task buffers in the system.Chapter 6. Divide-and-Conquer: Design and Modeling 90Notice that the tasks in processors at different levels are different in terms of their work.In TrEK, on every intermediate processor, there is one task on each of the followingprocesses: the task receiving InLink, the task manager, and the worker. The splitprocess produces /c subtasks per task, which are put into a single output queue thatcan hold only k subtasks. Each task forwarding OutLink process has a single bufferto hold an outgoing subtask on that particular link. If we count this subtask towardsthe particular child processor to which it is getting forwarded, then every intermediateprocessor has five task buffers. Each leaf processor has four task buffers since the leavesdo not have a split process. Thus, the total number of active tasks at any given time isgiven bywhere m is the number of processors at the ith level when the levels are numbered 1 toD from the root. In case of k-ary divide-and-conquer tasks running on a k-ary D levelbalanced tree, the total number of active tasks at any time is 5D — 1.Start-upStart-up begins when the first task enters the system and ends when all the leaf processorshave at least received one subtask. As explained in Section 6.1, the scheduling strategyin TrEK gives priority to splitting the task and forwarding the subtasks to childrenprocessors over allocating the task for local processing. When the first task entersa processor, the manager passes it on to the split process which splits the task intok subtasks that are kept in a single output queue controlled by the manager. Taskforwarding OutLink processes, when free, receive a subtask from the manager process.For analysis purposes, we use the collapsed process graph like the one shown inFigure 6.4(c). Here, we are interested in obtaining an upper bound for the start-uptime. In an arbitrary tree, start-up time is given by the maximum time taken for a leafto receive its first task. First, we will obtain an expression for the task number of thefirst task received by a processor. Then, we will determine the time required for a leafprocessor to receive its first task. The technique used here is similar to that describedin Chapter 4 for the start-up analysis for the processor farm case.Given a rooted oriented tree T, with children nodes numbered from left to rightChapter 6. Divide-and-Conquer: Design and Modeling 91starting at one, let c(v) be the child number of node v with respect to the parent of v,p(v). Let p(v) denote the nth ancestor of node v in T and let deg(v) be the down-degreeof node v. Let k be the degree of the divide-and-conquer tasks.Tideg(p(v))Definition 2 Let d(v) equal fl0 kLemma 2 For any rooted oriented tree T, the first task received by node v is the(r(i)rc(pn—1(v))1 + k 1 — d__2(pv))task at node pTi(v), the root of T.Proof:Let s(v, i) be the number of the ith task received by node v. The first task to arriveat a child node v is the subtask of F1th task that arrives at the parent node p(v).r deg(p(v))1 tasks arriving at p(v).From then onwards, node v receives a subtask for every kThus, we obtain the following recurrencedeg(p(v)) 1s(v,i)=s(p(v), +(i-1) E kIn general,Ec(pO(v))s(v,i) = s(P1(v),k 1 + (i - 1)do(P(v)))F c(p(v))= s ((p(v)), rci + ( k 1 + (i - 1)do(p(v)) _i) do(p(p(v))))(p2(v)rc(p(v))1 rrc(p(v))1k + U k do(p2(v)) + (i - 1)dO(p1(v))dO(p2(V)))(p2(v)Ec(Pl(v))1+(FC1 d22(pv)) + (i-1)dl(P(v)))=‘ kc(pTi (v)) 72—2s(v,i)= s(PTi(v)E 1+H k i)d2(pv)) + (i -At the root, s(v,i) = i. Thus, when p72(v) is the rootrc(pn—1(v))1n—2 1s(v, 1)= k + k— 1]d72__2(pv)) (6.6)j=O LChapter 6. Divide-and-Conquer: Design and Modeling 92The time required for the task s(v, 1) to arrive at node v is given byT8(v) = s(v, 1) (TCd(D) +T8(D) + + kf2))+ (Td(i) + T8(i) + + kf2)),where the first part of the equation represents the time after which the root node p(v)sends a subtask to p’(v), and Td(D) and T8(D) represent the communication timeneeded to receive a task and the split time respectively at the root node. The second partof the equation gives the time it takes for the node v to receive its first task after nodep(v) starts forwarding the corresponding task to its child. Td(i) and T8(i) representthe communication time needed to receive a task and the split time respectively at nodep (v).Start-up time for any arbitrary tree topology is given byT8 — max {T8(v)}v a leafIn an arbitrary topology, if the down-degree of every node is less than the degree ofdivide-and-conquer tasks (k), then s(v, 1) = 1 for every node. For this case, start-uptime is determined by the longest path in the topology and is given by[TCd(i) + T8(i) + (i + kf2)] (6.7)where n is the length of the longest path. If the topology has nodes with down-degreegreater than k, then s(v, 1) has to be evaluated for every leaf node to calculate thestart-up cost. For this case, s(v, 1) is proportional to the number of buffers present oneach node. If the topology is an unbalanced one, start-up time increases as the numberof buffers increases as in the case of Ffarm. Start-up costs for balanced tree topologiesare discussed in the next section.Wind-downThe wind-down phase begins when the last task enters the system and ends when the lastresult reaches the manager. In comparison to Pfarm, the wind-down analysis of TrEKis complicated by the fact that the computation requirements of tasks at different levelsare different. Tasks at the root processor have maximum computational requirements.Chapter 6. Divide-and-Conquer: Design and Modeling 93Here, we derive an expression for the wind-down time for an arbitrary topology using anestimate for the number of tasks executed by the root processor. In the following section,we derive an upper bound for the more interesting case, k-ary divide-and-conquer taskson k-ary balanced tree topology.The total number of tasks at the beginning of the wind-down phase in a tree architecture is given by the number of active tasks at any given time. As derived in thebeginning of this section, it is given bymj mD= + (6.8)where D is the number of levels in the topology and m is the total number of processorsat ith level.Assume that the processors are all identical and that they all do the same amountof work. Also, we neglect the overhead in forwarding tasks. Given these assumptions,an estimate on the number of tasks executed by the root processor is given byF MM1=---where N is the total number of processors. In any tree architecture, the value of thisvaries from 1 to 4 based on the number of processors.Thus, an estimate for the wind-down time is given byTd Mi(T(D) + /3e)6.2.2 Balanced Tree TopologiesIn this section, we analyze the performance of balanced divide and conquer computationson balanced tree topologies using the general framework. We hypothesize that a g-arybalanced tree topology (where g is the number of links on each node) achieves optimalperformance for divide-and-conquer applications for the following reasons:1. Experimentally, we have found that on a fixed topology, a breadth-first spanningtree with maximum number of leaves provides maximum steady-state performance.Chapter 6. Divide-and-Conquer: Design and Modeling 94As a g-ary balanced tree is a breadth-first spanning tree with maximum numberof leaf nodes among all the topologies with the same number of nodes, it providesmaximum steady-state performance.2. As explained in the previous section, start-up cost is proportional to the length ofthe longest path in the topology. Balanced tree topologies have minimum lengthlongest path among all topologies with the same number of nodes. As a result, bythe analysis given in Section 6.2.1, it also minimizes start-up time.3. Wind-down cost is also proportional to the length of the longest path in the topology. Once again, the minimal path length and the symmetry of the balanced treealso minimizes wind-down time.First, we analyze the steady-state performance of divide-and-conquer tasks of anydegree and depth on balanced tree topologies of any degree and depth. Then, we derivecorresponding start-up and wind-down costs using the analyses given in the previoussection for arbitrary topologies.Steady-state AnalysisConsider a flow of 1 level k-ary divide-and-conquer tasks. Let g be the degree and D thenumber of levels of the balanced tree topology. Notice that the levels are numbered from1 to D starting from the leaves rather than the root since this simplifies the derivationof the recurrence formula.Assuming that there is no idle time, steady state time (T88) can be expressed in termsof the number of tasks processed and the number of tasks split and forwarded along withtheir associated costs and overheads. From the general framework, the steady-stateexecution time is given by equation (6.4), which is reproduced below.T88= (6.9)jC(i) jC(i)where V is the number of tasks (or subtasks) that visit a processor at the ith level, andT(i) + /3e and eu = T(i) + Tj(i) + /3f1 + k/3f2.Let M be the number of divide-and-conquer tasks processed during the steady-statephase. Let f represent the fraction of the total number of tasks (M) processed by allChapter 6. Divide-and-Conquer: Design and Modeling 95the processors at the ith level (assuming that the corresponding splits and the joins forthese tasks are executed by the nodes from levels i + 1 to D). Then, the number ofsubtasks processed by a processor at the ith level is given by(E7) fM,since there are gD_i processors at the ith level, and they have to execute k’3 subtasksto finish a single original task. The number of tasks forwarded by a processor at the ithlevel is given byi—iD_i)fiM.g j=iBy substituting the above in equation (6.9),= () fM+ () (fi) MO.Let=j=1where F represents the fraction of the total number of original tasks (i.e., tasks enteringat the root) executed by all the processors in levels 1 through i.Rewriting T88 in terms of F and= (:::) (F -F1)Ma+ (.)F1M6=(:::)+(::) FM(O -By rearranging the above,= F_1 ( 0i) + ). (6.10)LetMFSi = -i——, (6.11)Chapter 6. Divide-and-Conquer: Design and Modeling 96where S represents the throughput of a subtree consisting of the processors from levels1 to i (again assuming that the corresponding splits and joins were executed at levelsi + 1 to d), and S0 = 0. By substituting (6.11) into (6.10) and solving for Si, we obtain,/ 1s ( ) + k1 ) (6.12)(—:_:T) cE9We are unable to obtain a closed form solution to this recurrence. Therefore, steady-statethroughput of a D level balanced tree (SD) is obtained by recursively evaluating the Sj’sup to level D. In evaluating the Si’s, if Si > 5i+1, then the intermediate processors atlevels i + 1 to D can not split and forward the tasks at the rate at which the processorsin levels 1 to i can process them. The throughput limit corresponding to this case isgiven by equation (6.18).Once we obtain the steady-state throughput using equation (6.12), we can deriveother performance metrics such as steady-state execution time and speedup. Steady-state execution time for M tasks is given byMT88—DSteady-state speedup is given byMaDSFD= QDSD,where c is the execution time plus the associated overhead for each task at the rootprocessor.Start-up AnalysisStart-up time for a balanced tree is derived from the analysis given in Section 6.2.1 forarbitrary tree topologies. In the case of a balanced tree topology, start-up cost is givenby the time required for the last leaf (the rightmost) to receive its first task. For a Dlevel g-ary balanced tree topology, the task number of the first task received by the lastleaf node is given by equation (6.6) with n = D — 1.1) ri + : (Eii — i) dD_3(p2(last)) (6.13)Chapter 6. Divide-and-Conquer: Design and Modeling 97If g k, s(last, 1) = 1. For this case, start-up time is given byD[TCd(i) + T8(i) + + kf2)] (6.14)If g> k, s(last, 1) > 1 and start-up time is given bys(last, 1) [TCd(D) +T8(D) + + kf2)]+[TCd(i) + T8(i) + + kf2)] (6.15)For example, given a 6-level 4-ary tree topology executing binary divide-and-conquertasks,s(last, 1) = 2 + d3_(p+2(last))2+16+8+4+2rr32Wind-down AnalysisWe derive an upper bound on the wind-down time Td for the case of k-ary divide-and-conquer tasks executed on balanced k-ary topologies. The total number of tasks in thesystem at the start of the wind-down phase (Mmd) is 5D — 1, where D is the number oflevels of the topology.We discretize the wind-down phase into a number of steps, where a step is the timeto execute a subtask at a leaf in the tree. Note that each leaf has 4 subtasks and allthe remaining processors have 5 tasks (or subtasks). An upper bound is obtained bydetermining the number of steps required for each processor to contain at most one task.Let us derive the number of steps required to reduce the number of tasks at the rootto one, the task being executed. Consider a D-level tree (D > 2), which consists ofa root and two D — 1 level subtrees. Assume that at the end of a step, tasks at theroot are split and forwarded as far as possible towards the leaves. This is an optimisticassumption since when the tasks are not transferred to the leaves, there is less overheadand the load is better balanced. In particular, at the end of the first step, once theChapter 6. Divide-and-Conquer: Design and Modeling 98leaves have finished a task, it indirectly reduces the number of tasks at the root by one.At the end of the second step, two tasks at the bottom two levels finish, and only twotasks remain at the root. After the third step, the root contains only one task (the taskbeing executed). Thus, after three steps, there are Ic subtrees each with a maximum of(5(D — 1) — 1)/k tasks (this is an upper bound on the total number of tasks as sometasks at intermediate levels in the tree have been partially processed). Note that theroot is still executing a task.By recursively applying the above argument to the roots of each of the Ic subtrees,after 3(D — 2) steps, there remain only 2-level trees. In the trees that remain, the roothas 5 tasks and each leaf has 4 tasks. After one step, the leaves finish one task whichreduces the number of tasks at the root to 4. At the end of the second step, in additionto the leaves, the root also finishes a task reducing the total number of tasks to two.After three steps, only the leaf processors will be left with four complete tasks each.Thus, after 3D steps, each processor has only one task that it is executing or no task atall.The wind-down time (Td) also depends on the time it takes to execute a single taskat the root. Therefore,Td max{(3D + 1)(Te(1) + /3e), (Te(D) + /3€)} (6.16)where Te(1) is the execution time of a subtask at the leaf level and T(D) is the executiontime of a subtask at the root. For larger D (D > 5, small /3), the execution time of asingle task at the root dominates the wind-down time.6.2.3 Communication BoundsThe performance models derived in Sections 6.2.1 and 6.2.2 are applicable only whenthe system is computation bound. In this section, we discuss the performance of thesystem when it is communication bound. In this case, processors may idle as the systemnever reaches steady-state.There are two factors that may cause the system to be communication bound.Chapter 6. Divide-and-Conquer: Design and Modeling 99Case (1)As in the processor farm case, overall performance of the system can be bound by thetransfer costs whenever the data and result sizes are sufficiently large. In a divide-and-conquer task, the data and result sizes generally decrease towards the bottom of thetree. Thus, overall throughput of the system is bound byScorni= T±’(6.17)where T = max{Tcd(D),Tcr(D)}. As in the processor farm case, /3 is the processoroverhead to receive a task from a parent or to send a result to the parent. As the linkprocesses are identical in both Pfarm and TrEK, the value of /3 is also the same.Case (ii)In the second case, CPU time in transferring the tasks and results can limit the overallthroughput. An intermediate processor at the ith level has to incur a CPU cost ofT8(i) + Tj(i) + 13.R + k/3f2 for every task that is split and forwarded. In any divide-and-conquer application, the sum of split and join costs is maximum at the root of thecomputation. Thus, overall throughput of the system is limited by the rate at which theroot processor can split and forward the tasks independent of the number of processors.This bound is given byScorn2= T8(D) + (D) + f1 + kf2(6.18)6.3 DiscussionIn this section, we discuss how the performance models derived in Section 6.2 can beused for performance tuning.6.3.1 Optimal N and TopologyPerformance models can be used to determine the optimal topology and the numberof nodes to be used to obtain maximum performance for a given divide-and-conquerapplication.In Figure 6.6, we plotted throughput as a function In Figure 6.5, we have plottedChapter 6. Divide-and-Conquer: Design and Modeling 100Comm limiti800600Throughput40020020 40 60 80 100 120Number of processors, NFigure 6.5: Plot of throughput curves for Binary Divide-and-Conquer Tasks on BinaryTreethe three throughput equations (6.12), (6.17) and (6.18) for a binary tree topologywith a set of typical parameter values. The optimal number of processors is given by theintersection of the equations (6.12) and (6.17) or (6.18) which ever leads to the minimumthroughput. Beyond this optimal value, there will be no increase in the performancewith an increase in N.In Figure 6.6, we plotted throughput as a function of N for balanced binary andternary tree topologies executing 4-ary divide-and-conquer tasks. As mentioned in theSection 6.1, TrEK is topology independent and hence can be used to run any k-arydivide-and-conquer computation on any topology. As expected, for any particular N, aternary tree topology achieves better throughput than the binary tree case. In general,it is always better to use a g-ary tree topology, where g is the maximum number of linksavailable on each node. If the number of nodes available is less than the optimal fora given application, and it does not lead to a complete g-ary tree, it is better to usea topology in which all the levels, except the last one, are complete g-ary tree and theremaining nodes are balanced in the last level.Chapter 6. Divide-and-Conquer: Design and Modeling 101Throughput6042Figure 6.6: Comparison of divide-and-conquer throughput ontree topologiesComm limitiTernary TreeBinary TreeComm Iimit2binary tree and ternaryAs in Pfarm case, if the number of nodes available is larger than the optimal numberof nodes to be used for a given application, one case use multiple g-ary trees to increasethe overall performance.6.3.2 Problem ScalingAs in the case of Pfarm, the most effective way to scale the problem is by increasingthe granularity of the tasks. In Figure 6.7, we have plotted the steady-state speedupfor a binary tree topology for different values of Te(D). As shown in Figure 6.7, thesteady-state speedup increases with increasing values of Te(D) as long as the systemdoes not reach any of the two communication bounds.6.4 Chapter SummaryIn this chapter, we have described the design of TrEK, a runtime kernel for executingdivide-and-conquer applications. We described how the Pfarm design was modified and80Number of processors, NChapter 6. Divide-and-Conquer: Design and Modeling 102Figure 6.7: Measured speedup for divide-and-conquer on binary treeextended for TrEK. We developed a general analytical framework that can be used toanalyze performance of divide-and-conquer applications using TrEK. This frameworkwas used to derive performance models for fixed degree divide-and-conquer problem onbalanced tree topologies. In the next chapter we describe our experimental results.10080Speedup6040Te =5 msTe = 2 msTe = 1 ms20Number of processors, NChapter 7Divide-and-Conquer:ExperimentsThe performance models derived in Chapter 6 for divide-and-conquer applications wereexperimentally validated using TrEK implemented in C on Logical Systems environment.The application program used in the validation experiments consists of a set of divide-and-conquer tasks with a synthetic workload. As in the Pfarm case, the applicationprogram executes empty loops corresponding to split, join and compute functions. Thenumber of iterations of the empty loops determine the values of T8, Tj and Te used in aparticular experiment. In this chapter, each divide-and-conquer task is represented bythe following parameters: degree (k), number of levels (1), base case computation time(Te), split time (T5i)) and join time (Tj(i)). The base case computation time representsthe time needed for solving a leaf subtask of a divide-and-conquer task.The experiments for determining the system overhead parameters are described inSection 7.1. In Section 7.2, we validate the performance models for arbitrary tree topologies. Performance models for balanced tree topologies are validated in Section 7.3.It is often possible to execute a divide-and-conquer application using processor farmparadigm. In Section 7.4, we compare the performance of Pfarm and TrEK.103Chapter 7. Divide-and-Conquer: Experiments 1047.1 Determining System OverheadsFor experimental validation purposes, it is necessary to determine the values of thesystem overhead parameters, /3e and /3f• As explained in Section 6.2, /3 is the processoroverhead to execute a task locally and /3j’ is the processor overhead for every task thatis split and forwarded to the children, its value depends on the degree of the divide-and-conquer tasks. As defined in Chapter 6, /3f = /3f1 + k/312, where k is the degree ofdivide-and-conquer tasks. The overhead parameters, e, /3.f 1 and /3f2 are constants asthey correspond to the software costs to execute particular parts of the TrEK program.Also, these values do not depend on the topology being used. This is due to the factthat subtasks are put into a single output queue. As a result, the overhead to forwarda task is a function of the cost of adding and deleting from a queue, both constant timeoperations. In addition, values of these overheads do not depend on the characteristics ofthe application program. However, they are dependent on the implementation of TrEK,and the underlying processor characteristics. Thus, values of these parameters have tobe determined only once for a particular implementation of TrEK. Once these valueshave been determined, it is possible to predict the performance of an application thatfits the model.The overheads e, /3fi and 13f2 are determined by conducting several experiments onsimple configurations shown in Figure 7.1.The value of /3e is determined by solving for /3e in the expressionTtotai = M(Te+/3e), (7.1)where Ttotai is the execution time for the configuration shown in Figure 7.1(a). Experiments were run on configuration 7.1(a) with Te = 5, 10, 20 and 40 ms and a largeM = 10000. By using M, Te and measured Ttotai, one can solve for /3e. As expected,for different Te’5, /3e remained constant, varying by less than 3 s. Changing M had noeffect on the value of /3e The average value of /3e was 560 its.To determine the values of /3f1 and /3f2, experiments were conducted on the configurations shown in Figure 7.1(b) and (c). For these configurations, total execution timecan be expressed in terms of the number of tasks processed (M1) and forwarded (M2) byChapter 7. Divide-and- Conquer: Experiments 105ManagerWorker(a)Figure 7.1: Configurations for determining /3e and /3fWorkerl. Workerl spends Te+/3e for every task locally processed, and T+Tj+/3fl+k/3f2for every task that is split and forwarded. On the configurations in Figure 7.1(b) and(c), experiments were run with divide-and-conquer tasks of degree (k) 2 and 3, respectively, with the number of levels (ci) equal to 2. By choosing a sufficiently large M, wecan neglect the effect of start-up and wind-down. For sufficiently large M, the totalexecution time for configuration in Figure 7.1(b) is given byTtotai M1(T€ + i3) + M2(T8 + Tj + f1 +2/3f2), (7.2)and for configuration in Figure 7.1(c), it is given byTtotai Mi(Te + 3) + M2(T + Tj + !3f1 + 3!f2) (7.3)For each configuration, we ran several experiments with base case Te equal to 5, 10,20 and 40 ms. In all experiments, the split and join time was held constant at 1 ms,and M 10000. For each experiment, we measured the Ttotai, and M1 and M2. Thesevalues were then used in equations (7.2) and (7.3) to obtain the values of /3f and /3f2The variation observed between the values obtained for /3f1 and /3f2 for different valuesof Te was within 5 ps. The average values obtained for /3f 1 and /f2 were 520 s and 420is, respectively.(b) (c)Chapter 7. Divide-and- Conquer: Experiments 1067.2 Arbitrary TopologiesIn this section, we validate the general analytical framework described in Section 6.2.1and show, experimentally, that on a fixed topology, a breadth-first spanning tree (BFST)with maximum number of leaves outperforms other BFSTs. As described in Chapter6, by splitting a task, we can increase the amount of parallelism and make better useof the available parallelism in the hardware, but every split increases the total amountof work because of its associated overhead. In the case of a flow of divide-and-conquertasks, ignoring start-up, it is better to reduce the number of splits since the applicationconsists of a number of divide-and-conquer tasks. Thus, on a fixed topology, a BFSTthat does the minimum number of splits obtains the best performance since the overalloverhead in this case is smaller compared to other BFSTs. Since splits have to occur atinternal nodes, a BFST with a minimum number of internal nodes or maximum numberof leaves does minimum number of splits. Experiments were conducted on three differentbreadth-first spanning trees (shown in Figure 7.2) of an 8 x 3 mesh topology.Measured Total Execution TimeTe BFST1 BFST2 BFST3(3 leaves) (16 leaves) (9 leaves)0.001 101.675 70.922 90.8890.002 130.038 92.097 120.6340.003 188.199 113.763 159.9480.004 208.821 135.700 186.0490.005 229.898 157.455 211.225Table 7.1: Performance Comparison of three different BFSTs of the 8 x 3 mesh.Table 7.1 shows the measured execution time with the percentage error from thecorresponding predicted execution time for three breadth-first spanning trees of the8 x 3 mesh. In the table, times are given in seconds. For each case, a total of 1000binary divide-and-conquer tasks with 10 levels were used. The value of T shown in thetable is the base case computation time. A value of 1 ms was used for split and joincosts at each level. Tasks of 10 levels were chosen for these experiments because thenumber of levels of tasks has to equal or exceed the number of levels of the topology,Chapter 7. Divide-and- Conquer: Experiments 107BFST1 BFST3Figure 7.2: Three breadth-first spanning trees of the 8 x 3 mesh.9 in this example. In order to achieve steady state, the computation time of a subtaskat any node has to be greater than the sum of split and join times at the parent nodeplus the associated overhead /3f. Therefore, for the values of split and join times chosenin these experiments, base case Te must be at least 1 ms. As Table 7.1 shows, themeasured execution time for BFST2 is small compared to the other two BFSTs for allcases. Experimentally, this supports our claim that on a fixed topology, a BFST withmaximum number of leaves provides better performance compared to other BFSTs.7.3 Balanced Tree TopologiesIn this section, we experimentally validate the performance models for TrEK, derived inSection 6.2.2 for balanced tree topologies. The experiments were conducted to test themodels for steady-state, start-up and wind-down for k-ary divide-and-conquer tasks ong-ary balanced tree topology, variable split and join costs, and communication bound.The parameter values were chosen to satisfy the following conditions:BFST2Chapter 7. Divide-and-Conquer: Experiments 1081. the number of levels of tasks has to be equal to or greater than the number oflevels of the topology.2. the computation time of a subtask at any node has to be greater than the sum ofsplit and join time at the parent node plus the associated overhead /3.7.3.1 Steady-StateTable 7.2 gives the range of experiments conducted to validate the steady-state model.Steady state performance models were validated by experiments with a sufficiently largeModel Topology N M — — Taskski Te TTjSteady-state Binary tree 1 to 63 10000 2 6 5,10,20 1 17 1,2,5 1 1Ternary tree 1 to 40 10000 3 4 5,10,20 1 15 1,2,5 1 1Table 7.2: Range of Divide-and-Conquer steady-state ExperimentsM(10000) so that start-up and wind-down time can be ignored. For all experiments, avalue of 1 ms was used for both T and Tj at each level. Experiments with variable splitand join costs are described separately. A value of 1 ms was chosen for both split andjoin costs as larger values either limit the overall throughput (also discussed later) orrequire very large divide-and-conquer tasks to be in steady-state.Tables 7.3 and 7.4 show the percentage difference in predicted and measured execution time for binary and ternary tree cases respectively. In these experiments, divideand-collquer tasks of 7 and 4 levels were used on binary and ternary tree respectively.As can be observed from the tables, the errors are within 7%.7.3.2 Start-up and Wind-downIn the case of k-ary divide-and-conquer tasks running on k-ary balanced tree topologies,there will be 5D — 1 tasks in the system at any given time, where D is the number oflevels of the hardware topology. Start-up and wind-down models were validated withCCOI—COI’CR—4COIccCO--4COcp0000CR—ccc.1CR00CRCDCDECDI CR cc 00CD Cl) CD CD C-) CD C Cl) C CD C CD CCRCO -4 CR CR CO cc CR CR CO 00 -4 -1CD C-)CDCDCDIICR-,Cl)CJ)CDC><CD C-)CDECDCD>II._]Cl)CDCOCD- Cizccc-COic-zCOcc—30000-ccCO cc C)-4 C)-4 CO Ibi CD Cl) CD CD C-) CD C C)) C Cp C) C CDCCD C)CDCD- CIIIIIICOCRC)©00CRC)-1 CO00.D-4COCO O-CO00CRC)CRcC)CRCR00ccCRC-c00COCOcc-CO00CRccC)CRccccc©ccCO-I-C)IIIII©C)C)O:-4I-CChapter 7. Divide-and-Conquer: Experiments 110M 5D — 1 so that the system never reaches steady-state and the total execution timeconsists only of start-up and wind-down.Tables 7.5 and 7.6 show the percentage error between the predicted and measuredtotal execution time for these experiments. In these experiments, the values used for thebase case computation (Te) were 10 and 20 ms. The errors observed are large, especiallyfor larger topologies because the predicted wind-down costs are upper bounds. We havecalculated the wind-down cost based on the number of tasks that are predicted to havebeen executed on the root processor. For larger topologies, the model uses an upperbound of two tasks on root processor. If in the actual execution, the root processorexecutes only one task, then the error can be as large as 40%, because there are only afew large tasks and each task contributes considerably to the total execution time.T10ms Te=2OmsN Upper Bound Measured % Error Upper Bound Measured %ErrorExec Time Exec Time Exec Time Exec Time3 1.546 1.338 13.45 2.826 2.459 12.997 1.171 0.954 18.53 2.132 1.755 17.6815 0.795 0.573 27.92 1.435 1.053 26.6231 0.803 0.461 42.59 1.444 0.781 46.2063 0.812 0.477 41.25 1.452 0.781 46.21Table 7.5: Start-up and Wind-down Performance Comparison for Divide-and-Conquerrunning on Binary Tree.TelOms Te=r20msN Upper Bound Measured % Error Upper Bound Measured %ErrorExec Time Exec Time Exec Time Exec Time4 1.201 0.793 33.97 1.717 1.513 11.8813 0.913 0.635 30.45 1.163 0.695 40.2440 0.634 0.365 42.43 1.174 0.635 45.91Table 7.6: Start-up and Wind-down Performance Comparison for Divide-and-Conquerrunning on Ternary Tree.As the divide-and-conquer tasks are generally large, it is important to validate themodels for the cases in which the total number of tasks is not very large. Thus, weChapter 7. Divide-and-Conquer: Experiments 111conducted several experiments with M 1000. Tables 7.7 and 7.8 show the percentageerror between the predicted and measured total execution time for these experiments onbinary and ternary tree topologies. As Tables 7.7 and 7.8 show, the errors are within7%.Te = 5 ms Te = 10 msN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time1 222.674 222.694 0.000 382.754 382.786 0.0003 74.728 74.788 -0.080 128.013 128.325 -0.1447 32.576 32.436 0.429 55.236 55.358 -0.22115 15.641 15.581 0.384 26.406 26.193 0.80631 8.165 7.911 3.110 13.487 13.086 2.97363 4.708 4.693 0.319 7.407 7.372 0.472Table 7.7: Comparison of Predicted and Measured Total Execution Time for Divide-and-Conquer running on Binary Tree with M 1000.Te=lms TeSmsN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time1 161.640 161.662 -0.013 296.708 296.740 -0.0104 41.053 41.003 0.122 74.883 74.758 0.16713 13.294 13.308 -0.105 23.779 24.811 -4.34040 4.996 4.990 0.120 8.365 8.613 -2.965Table 7.8: Comparison of Predicted and Measured Total Execution Time for Divideand-Conquer running on Ternary Tree with M =1000.7.3.3 k-ary Tasks on g-ary Balanced TopologiesThe experiments discussed in the previous sections tested binary and ternary divideand-conquer tasks which exactly match the underlying topologies. In order to validatethe models for cases in which the task structures does not match the underlying topologies, we conducted experiments in which binary divide-and-conquer tasks were run onternary tree topologies. Table 7.9 shows the percentage error between the predicted andChapter 7. Divide-and-Conquer: Experiments 112measured total execution time for these experiments. The errors are within 7%.TelOms Te=2OmsN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time4 96.351 96.292 0.061 176.630 176.436 0.11013 30.165 31.422 -4.167 55.086 56.077 -1.78040 10.458 10.084 3.576 18.945 19.678 -3.869Table 7.9: Comparison of Predicted and Measured Total Execution Time for BinaryDivide-and-Conquer tasks running on Ternary TreeWe claim in Chapter 6 that the models can also be used for single divide-and-conquerproblem with large degree and depth. To substantiate this claim, several experimentswere run, each with a single large divide-and-conquer task.In the first example, a 5-ary 9-level divide-and-conquer task with base case computation time of 5 ms was run on a 40-node ternary tree. For prediction purposes, weevaluated the throughput excluding the root worker because in TrEK, with a single task,the root worker does not process any subtasks as it splits the task and forwards all thesubtasks to its children. The model predicted a total execution time of 275.581 secondswhereas the TrEK execution took 316.478 seconds. As a second example, a 6-ary 8-leveldivide-and-conquer task with base case computation time of 5 ms was run on a 40-nodeternary tree. For this case, the model predicted a total execution time of 232.683 ms,and the TrEK execution took 252.117 ms. The errors are larger than 7% because theprocessors closer to the manager (especially the root) can idle as the number of subtasksarriving at these processors is small.7.3.4 Variable Split and Join CostsIn all the previous experiments, split and join costs were kept constant at all levels of thecomputation. We conducted several experiments to verify the validity of the models forthe cases in which the split and join costs are different at different levels of computation.Results of two sets of experiments are tabulated in Tables 7.10. These experimentswere conducted with 1000 binary divide-and-conquer tasks of 6 levels with base caseChapter 7. Divide-and- Conquer: Experiments 113computation of 10 ms on a binary tree.In the first set of experiments, a variable split cost was used where as the join costwas kept constant at all levels. The split cost was varied from 1 to 5 ms, and a valueof 1 ms was used for the join cost. The second set of experiments were conducted withvariable join costs (varying from 0.5 to 2.5 ms) with a constant value of 1 ms for splitcost. Both the split and join costs were varied in the third set of experiments. Onceagain, the errors observed in these experiments are within 79bo.Variable Split & Constant Join Constant Split & Variable JoinN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time1 408.764 408.923 -0.038 369.745 369.889 -0.0383 136.688 136.980 -0.217 123.679 124.019 -0.2757 58.962 59.107 -0.246 53.381 53.587 -0.38615 28.185 28.011 0.617 25.550 25.483 0.26231 13.989 14.020 -0.222 12.708 12.844 -1.07063 7.852 8.250 -5.069 6.830 7.151 -4.700Variable Split & JoinN Predicted Measured %ErrorExec Time Exec Time1 434.777 434.854 -0.0183 145.363 145.736 -0.2577 62.688 62.851 -0.26015 30.014 29.774 0.80031 14.939 15.129 -1.27263 11.363 11.797 -3.819Table 7.10: Comparison of Predicted and Measured Total Execution Time for Divideand-Conquer Tasks with Variable Split & Join CostsWhen split and join costs are large, the system performance is bound by the rate atwhich the root processor can split the tasks and join the results and throughput is givenby equation (6.18). Several experiments were conducted to test the performance of thesystem for this case. Table 7.11 shows the predicted and measured execution time fora set of experiments in which 1000 binary divide-and-conquer tasks of 6 levels were runChapter 7. Divide-and- Conquer: Experiments 114on binary trees. The split and join costs were varied from 1 to 5 ms, with base casecomputation being 10 ms.Variable Split & JoinN Predicted Measured % ErrorExec Time Exec Time1 548.834 548.930 -0.0173 183.388 183.769 -0.2087 79.045 79.213 -0.21615 37.815 37.434 1.00831 21.370 24.534 -14.80663 21.370 27.656 -29.415Table 7.11: Comparison of Predicted and Measured Total Execution Time for BinaryDivide-and-Conquer Under Split and Join BoundTable 7.11 shows that the system performance reaches the limit for a 31 node binary tree topology. For larger topologies, system performance degrades as the dynamicdemand-driven scheduling strategy keeps forwarding tasks towards the leaf nodes causing the nodes closer to the manager to idle. This phenomenon is similar to that observedin the processor farm case and shows the importance of determining the right numberof nodes and topology to be used for a given application.7.3.5 RobustnessIn all the experiments discussed above, it is assumed that the tasks are split into subtasksthat take equal amount of computational time. In general, tasks may not always bedivided into equal sized tasks. To test the robustness of using the average value of Te, weexperimented with tasks which were split into k randomly unequal subtasks. Table 7.12show the percentage error between the predicted and measured total execution time forexperiments in which binary divide-and-conquer tasks were run on binary tree topologies.Two sets of experiments were conducted with tasks in which the sum of all the base casecomputations per task were 256 and 512 ms. It was assumed that the split and joincosts are proportional to the computational requirements of the task. The percentageChapter 7. Divide-and- Conquer: Experiments 115errors in this case are slightly larger, but are still around 10%. The measured executiontime is always larger than the predicted execution time because some of the subtasksare smaller than the overhead required to forward them, causing additional work.Te256ms T=512msN Predicted Measured % Error Predicted Measured %ErrorExec Time Exec Time Exec Time Exec Time1 297.668 297.696 -0.009 605.037 605.123 -0.0143 99.658 103.279 -3.634 202.149 209.926 -3.8477 43.091 45.878 -6.465 87.061 92.855 -6.65515 20.713 22.156 -6.966 41.488 45.067 -8.62631 10.391 11.162 -7.420 20.420 22.458 -9.98063 5.655 6.217 -9.938 10.747 11.834 -10.114Table 7.12: Comparison of Predicted and Measured Total Execution Time for BinaryDivide-and-Conquer Tasks with Subtasks of Unequal Size7.4 Comparison of Divide-and-Conquer with ProcessorFarmAs both divide-and-conquer and processor farm are task-oriented paradigms, it is possible to execute divide-and-conquer applications using the processor farm paradigm.Performance of divide-and-conquer applications executed with Pfarm compared to thatusing TrEK depends on the values of the application parameters such as the executiontime per task and the total number of tasks. In the following paragraphs, we comparethe performance of TrEK and Pfarm for various values of the parameters. In Table 7.13,we have tabulated the measured total execution time for binary divide-and-conquer applications executed with TrEK and Pfarm on a binary tree topology.Divide-and-Conquer strategy performs better compared to processor farm for applications with larger computation time per task. Computation time per task includes allthe split and join times in addition to the time required for processing all the subtasks.As shown in Table 7.13, the total execution time taken by TrEK is smaller than thattaken by Pfarm when the computation time per task is greater than 0.766 seconds. Thiscut-off value depends on the values of the various parameters of the system. This canChapter 7. Divide-and-Conquer: Experiments 116Mz== 1000 M100Te (D) TrEK Pfarm TrEK Pfarm7.169 117.980 136.315 14.670 21.5313.072 49.792 58.425 6.475 9.2331.535 25.175 29.204 3.403 4.6200.766 13.251 14.585 1.863 2.3110.190 4.541 3.634 0.632 0.5820.126 4.769 2.418 0.470 0.390Table 7.13: Comparison of Total Execution Time for Binary Divide-and-Conquer Applications with TrEK and Pfarmbe obtained by using the models to calculate and compare the performance of the application using Pfarm and TrEK. The percentage difference in the total execution timeis large when the M is small (see Table 7.13, M 100). Processor farm takes longerfor these cases because the wind-down phase is longer in Pfarm compared to that ofTrEK. In Pfarm, the wind-down phase begins when there are 4N tasks left in the system compared to only 5D — 1 in TrEK, where N is the total number of processors andD is the number of levels of the hardware topology. As the affect of wind-down becomesconsiderable for applications with fewer tasks, the difference in the total execution timebetween Pfarm and TrEK increases.From the table, it is evident that Pfarm works better for applications in which thecomputation time per task is smaller (less than 0.190 seconds). This is because theoverheads involved in splitting and joining in TrEK (overheads not present in Pfarm)become considerable compared to the execution time per task leading to a larger totalexecution time compared to that of Pfarm. Also, TrEK can not be used for applicationswith small computation time per task (e.g., smaller than 0.126 seconds for a 64-nodebinary tree topology). This is because the number of levels of the tasks should begreater than the number of levels of the topology, and the base case computation shouldbe greater than the overhead 13f to make use of all the processors effectively.The only way to use Pfarm for executing a single divide-and-conquer problem isfor the manager to split the task to obtain a sufficiently large number of independentChapter 7. Divide-and-Conquer: Experiments 117subtasks. However, since the manager must do a large number of splits and joins, it canquite easily become the bottleneck. In contrast, since in TrEK, the internal nodes dosome of the splitting and joilling, it is not necessary for the manager to perform as manysplits. Note that the manager still must do some splits since otherwise the nodes nearthe root may idle.7.5 Chapter SummaryPerformance models for divide-and-conquer applications derived in Chapter 6 have beenexperimentally validated using TrEK implementation on a 75-node transputer-basedmulticomputer. We have described the experiments conducted to determine the valuesof the system overhead parameters, /3e, /3f1 and /3f2. For a fixed topology, we have experimentally shown that it is better to use a breadth-first spanning tree with a maximumnumber of leaves. We have validated the models for balanced tree topologies with alarge number of experiments varying the values of all the parameters that affect overallperformance. As processor farm strategy can be used for some divide-and-conquer applications, we have discussed and compared the performance of using TrEK and Pfarmfor various cases.Chapter 8System Integration andApplicationsIn order to make programming and performance tuning easier, users have to be providedwith an integrated environment that includes tools that support all phases of programdevelopment and execution, in addition to runtime systems such as Pfarm and TrEK.In Section 8.1, we briefly describe Parsec, an integrated programming environment thatprovides Pfarm and TrEK with supporting tools such as a graphical interface, mapper,loader and debugger on our transputer-based system. We discuss how this integrated environment supports reusability, reconfigurability and performance tuning. In section 8.2,we discuss the techniques that can be used to obtain the values of application dependentparameters in order to use the models for performance tuning. Finally, we describe twoapplications that have been developed using Pfarm and TrEK.8.1 Parsec: An Integrated Programming EnvironmentPfarm and TrEK provide programming templates to efficiently execute applications thatfit into processor farm and divide-and-conquer paradigms, respectively. In order tomake it easier for application programmers to use these templates on a multicomputersystem, it is important to provide a programming environment that supports all phasesof program development and execution. Such a programming environment should notonly have a variety of tools that help programmers in developing, executing, debugging118Chapter 8. System Integration and Applications 119and tuning a parallel program, but must support their cooperative functioning throughclose integration. The following facilities should be present in an integrated programmingenvironment to effectively support Pfarm and TrEK on a multicomputer:• an interface that hides both the system hardware and software complexities,• support for reusability,• support for easy reconfigurability of the system,• support for loading and executing of programs,• performance monitoring and tuning facilities based on the performance models,and• program debugging tools.Initial work in developing an integrated programming environment that addressesthe above requirements on a large transputer-based system has been reported in[CGJ91, FSWC92]. Farsec is an on-going project at UBC to support creating templates or applications, and includes tools for building, mapping and loading the programonto the system. Ffarm and TrEK have influenced the design of Parsec. In order tomake performance tuning easier for applications using Ffarm or TrEK, Parsec supportsparameterized process graphs. A parameterized process graph is a family of interconnection networks with one or more parameters that control structural properties. Inaddition, Parsec allows users to change these parameters in an easier way. Thus, a usercan easily run an application on its optimal N and topology, once they are determinedfrom the models.In Parsec, a “template implementor” describes a template in terms of a parameterizedprocess structure which is then turned into a system module. Users of a template do nothave to understand the details of its implementation. They simply instantiate a copy ofthe template and provide any necessary parameters and code. Parsec creates all the files(makefiles, configuration files, and load scripts) necessary for running the application.Chapter 8. System Integration and Applications 120Pfarm and TrEK templates have been incorporated into Parsec and they make useof the parameterized graph structure to simplify scaling and restructuring of the system.Within this programming environment, currently oniy Trollius is available to the programmers. The following discussion focuses on how the various tools in the environmentsupport programming of applications that use the Pfarm template under Parsec.In Parsec, programmers are provided with an easy-to-use graphical interface to Pfarmand TrEK, and to the system in general. The interface, developed by Feldcamp [FW93]is an X windows application utilizing the OpenLook GUI. Figure 8.1 shows the graphical interface provided to application programmers when Pfarm template is selected.Programmers can easily modify the template by including the files that contain theapplication dependent code. Parsec supports system reconfigurability in an easier waythrough the graphical interface. The user can change the parameters (such as degreeand depth) that define the topology to be used for executing an application with Pfarm.Then, the programmer can build the object files needed to execute the application by using the makefiles generated by Parsec. These makefiles remove the concerns of choosingthe right compiler and libraries from the user. Users can easily include any additionallibraries, if necessary. To execute a Pfarm application, two different object codes haveto be built, one for the manager and another for all the workers.After choosing the topology to be used, the user must map this topology onto the 75-node transputer based hardware system. The mapping tool [Mu193] inputs a descriptionof the hardware, processors and crossbars, and outputs a crossbar setting, a process toprocessor assignment, and a configuration file. The mapping tool uses a greedy algorithmto do the mapping. In the case of Pfarm and TrEK, a different notion of mapping isneeded. Pfarm and TrEK need one-to-one mapping of workers on to the processors,without any dilation. Also, on a fixed interconnection network, the mapper should beable to map the workers onto a breadth-first spanning tree.Parsec includes a loader tool {Mu193] that builds the network configuration file basedon the mapping obtained by the mapping tool. In addition, the loader generates a scriptthat is used to execute the application program. This script includes the Trollius commands to boot the network and to load the appropriate programs onto the transputerChapter 8. System Integration and Applications 121Interface [:Parameters r Dc re eFigure 8.1: Graphical Interface to Pfarm in ParsecChapter 8. System Integration and Applications 122nodes. Users execute the application program by running this load script. The loader allows users to choose either the network or physical level of communication. The networklevel is slower compared to the physical level, but unlike the physical level, users are ableto print from any node and to monitor the state of the processes on each node. Thisallows users to choose network level during the program development and debuggingphases, and then use physical level to obtain better performance.8.2 Performance TuningIn this section, we describe how the programmer can use models for performance prediction and tuning.8.2.1 Parameter MeasurementsIn order to use the models for performance prediction and tuning, one has to determinethe input parameters to the model. The user must supply the values for all the parameters other than the system overhead parameters (/3e and /‘3f in case of Pfarm, and e,/3f1 and /3f2 in case of TrEK). The values of these overhead parameters are obtainedonce using the techniques described in Sections 5.1 and 7.1 for Pfarm and TrEK respectively. Here, we explain the techniques that are useful in determining the values of theapplication dependent parameters.Processor FarmIn the processor farm case, the application dependent parameters that affect the overallperformance are: the average execution time per task(Te), the total number of tasks(M), the data size (d) and result size (r) per task. An application programmer generallyknows the values of M, d and r for an implementation, otherwise these values can beeasily obtained. Obtaining the value of Te is not so straightforward as it depends on thenature of the application program in addition to the implementation. Here, we brieflydescribe the different techniques that can be used to obtain Te.Chapter 8. System Integration and Applications 1231. In the case of application programs that consist of tasks, each of which require thesame amount of computation, the easiest way to measure Te is to scale down theprogram to a single task or a small number and execute it with Pfa’rm on a singleworker node. Then, Te can be obtained from the expression Ttotai = M(Te + ISe)using the measured total execution time and the value of ,6.This technique can also be used for application programs that consist of tasks withvaried computation requirements. In this case, one can determine average Te byfinding 72 for a representative set of tasks. If it is difficult to choose right set oftasks, then, the third technique can be used.If the application program consists of multiple phases, where all the tasks in aphase belong to the same type, then as explained in the robustness section 5.4, itis necessary to calculate a separate Te for each phase.2. If there is a direct relationship between the input data size of a task and itscomputation requirement, then, by determining Te for a certain data size, one canestimate T for a new data size based on the relation.3. If Te cannot be obtained by either of the previous techniques, then it can bedetermined by executing Pfarm on a smaller number of processors and using themodel (with known N and T) to calculate If the performance obtained is notequal to that of the communication bounds, then, models can be used to obtainthe average value of T by plugging in the total number of tasks and the measuredtotal execution time.Divide-and- ConquerIn the case of TrEK, the application dependent parameters that affect the performanceare: the execution, split and join times (T(i),T3(i), andTj(i)) at each level of the divideand-conquer task, the total number of tasks (M), the data size (d), and the result size(r) per task. As in the Pfarm case, an application programmer generally knows thevalues of M, d and r for an implementation. Obtaining the values of the execution, splitand join times is not so straightforward.Chapter 8. System Integration and Applications 124As explained in Section 6.2, the computational requirement of a fixed degree divide-and-conquer task (or subtask) with an input data size of n can be expressed asW(n) split(ri) + join(n) + kW(n/k), (8.1)where split(m) is the splitting cost for a task with size n, join(n) is the joining costto produce a result of size n and k is the degree of the divide-and-conquer tasks to beprocessed. Te(i),Ts(i) and Tj(i) are given by W(n), split(m) and join(n), respectively,depending on the data size (n) of the tasks at the ith level. Thus, in order to usethe performance models, one has to determine W(n), split(ri) and join(n) for the givenapplication program.In general, W(n) can be expressed by the time complexity of the algorithm such as(9(nlogm) or O(m2 logn). Thus, we have to find the constants underlying these timecomplexities to get the value of W(m), the time needed to execute the task on a processor.This can be determined by executing a scaled down version of the problem as in the firsttechnique outlined for measuring Te in the processor farm case. This experiment canbe repeated with few different input sizes to verify the values of the constants. Similarmeasurements can be used to obtain the values for the constants to be used for split andjoin times.8.2.2 Performance Analysis LibraryTo make it easier to use the models for prediction and tuning, application programmersare provided with a set of performance analysis library functions for both Pfarm andTrEK. These functions accept the values of application dependent parameters and outputthe predicted performance metrics such as throughput, speedup and total execution time.8.3 User InterfacePfarm and TrEK were designed to hide the underlying complexities of the multicomputersystem from the user. All the system dependent code is in the execution kernels and theuser has to concentrate oniy on the application dependent code. The kernel can be usedfor application programs that fit the corresponding parallel programming paradigms.Chapter 8. System Integration and Applications 125Both Pfarm and TrEK are run time kernels where the user code is linked with thesystem code to produce a single executable object for each processor node. In themanager, the user invokes the system routines to submit tasks and receive back results.In the workers, program control lies within the system code and the system invokes theuser code at the appropriate times.8.3.1 PfarmIn the case of Pfarm, there are two different executables, one for the manager nodeand the other for the worker nodes. The user part of the manager code consists of thefollowing functions:1. master_mit 0 - The system code calls this function at the beginning of the execution. This function consists of the initialization part of the user code, such asreading data from an input file, etc.. Also, if there is any global data to be broadcast to all the worker nodes, the user code can initiate the system call, bc_send()to do the broadcast.2. data_generator() - This function consists of the part of the user code that generates the tasks. In real-time applications, it could receive data from some device andgenerate the corresponding tasks. The tasks are passed to Pfarm for processingby the system call do_task 0.3. result_receiver() - This function consists of the part of the user code thatcollects the results. The system call get_result 0 returns the next available result.This function could also include any processing of the results.Both data_generator() and resultreceiver() are called by low priority systemprocesses. These functions are called in the beginning of the execution after creating allthe processes and are run concurrently until they finish their respective jobs. In the caseof applications in which later tasks depend on the results of the initial tasks, the userprogram could consist of only one function, data_generator 0. This function performsboth the system calls, do_task 0 and get_result 0.Chapter 8. System Integration and Applications 126•The user part of the worker code consists of the following two functions:• s].ave_init() - This function consists of any initialization part of the user codeneeded on each worker node. Also, if there is any broadcast of the global data,this function can do the system call, bc_receive() to receive the broadcast data.• comp_fn()- This function contains the user code to process a task. It is called bythe worker process and takes a pointer to the task data and returns a pointer tothe result and the size of the result.Ffarm provides the following system calls:1. do_task(task_type, task_size, task_ptr)2. result_ptr get_resu].t()3. bc_send(bc_dat a_size, bc_data_ptr)4. bc_data_ptr bc_receive()8.3.2 TrEKTrEK provides the same system calls to the user as in the Pfarm case. The user partof the manager code to be run in the TrEK case is similar to that in the Ffarm case.The user part of the worker code includes the following two functions in addition to thecomp_fn() described in the Pfarm case.• split_fn()- This function consists of the user code that splits a task and is calledby the split process in the TrEK. It takes a pointer to the task as input and returnsthe pointers to the subtasks.• join_fn() - This function consists of the user code that combines the results ofsubtasks and is called by the join process in the TrEK. It takes the pointers to thesubresults as input and returns a pointer to the combined result.As an example, of the user code for an FFT implementation that uses TrEK hasbeen included in the following section.Chapter 8. System Integration and Applications 1278.4 ApplicationsSeveral applications have been developed using Pfarm and TrEK on our transputer-basedsystem by other graduate students and myself. In this section, we discuss two interestingcases where the models were used to understand the performance of applications. Resultsreported here are for Logical systems versions.8.4.1 Cepstral filteringPfarm was used to parallelize a vision application that performs Cepstral filtering formotion analysis [BL93]. It takes two images of the same subject at different instances oftime and determines the motion of the subject. The user code on the manager consists oftwo parts, a data-generator and a result-receiver. The data generator function partitionsthe images into small blocks and puts two corresponding blocks, one from each of the twoimages, into a single task. The result-receiver collects the results of the partial motionanalysis and assembles them. In the following paragraph, we discuss how the modelswere used for performance tuning.Initially, the program was tested using two smaller images of 64 x 64 bits. The taskexecution time (Te) for images divided into blocks of 16 bits was measured by executingthe program on a single worker node. For this case, the value of T was 178.048 ms.We were interested in using the program with larger images of 512 x 512 bits. Theperformance model was used to determine the best topology and the number of nodesto run this application for larger images. The model predicted that a 63-node balancedbinary tree gives maximum performance with the total execution time of 3.141 seconds.We executed the application program on a 63 node binary tree and found that it took4.492 seconds, which was considerably larger than the predicted value. To investigatethe reasons for the large error, we ran the program on a linear chain and a ternary tree.The model predicted that on a 64 node linear chain, it would take 4.743 seconds and ona 40 node ternary tree 4.813 seconds. For these cases, the prediction was accurate sincethe measured total execution times were 4.517 and 4.756 seconds for chain and ternarytree cases respectively. Then, the program was executed on a 32 node binary tree andChapter 8. System Integration and Applications 128the prediction was found to be accurate.To investigate the reasons for the large error in the prediction for the 63-node binarytree case, we recorded the number of tasks executed on each node. We found that theleaf nodes executed a fewer tasks than the intermediate nodes. This occurs only whenthe system is communication bound, but according to the models, the system shouldnot have reached the communication bound (based on the data and result sizes of thetasks). The only explanation for this scenario is that the system violated one of theassumptions in the model. On examination, we hypothesized that the data generatormay not be generating the tasks at the rate at which the system can receive and processthem. Therefore, we ran several experiments to measure the rate at which the data generator was generating the tasks, and found that the rate was 248.32 tasks/sec. However,according to the model, a 63-node binary tree for this application program can processtasks at the rate of 349.06 tasks/sec. This confirmed our suspicion that the large erroroccurred because the data generator was unable to keep up a continuous flow of tasksinto the farm. For chain, ternary tree and smaller binary tree topologies the predictionswere accurate because the task processing rate of these topologies were smaller than therate at which tasks were generated.8.4.2 Fast Fourier Transform (FFT)Divide-and-conquer strategy has been used to design efficient sequential FFT algorithms.In this example, we have used TrEK to parallelize a sequential algorithm that uses FFTfor multipoint evaluation of a polynomial over a field [Sed83]. For this implementation,execution time starts increasing for binary trees with more than 4 levels as the throughput reaches the communication bound given by equation (6.18) between levels 3 and4.The sequential FFT algorithm {Sed83] is reproduced below:Algorithm FFT(N, a(x), w, A)if N = 1 thena0;else/* split */n := N/2;Chapter 8. System Integration and Applications 129b(x) Zril ajxZ;c(x) : Z’/* recursive calls */FFT(n,b(x),w2,B);FFT(m, c(x), w2,C);/* combine */for Ic := 0 to ri — 1 doAk := Bk + wkCk;Ak+n := Bk —endforendifIn order to show the interaction between TrEK and the user code, the user code thatimplements this FFT algorithm has been included.data_generator 0{for (i=1; i<TotalJobs; i++) {/* prepare the data for a task *//* send the task to TrEK by calling do_task */do_task(task_type, user_data_size, ptr);}}result_receiver(){user_result *ptr;for (i1; i<=TotalJobs; i++) {1* get the next available result */ptr = (userresult *) get_resultO;/* process the result */}}comp_fn(user_data *ptr, mt *ur_size, char **ur_ptr){/* call the fft function */fftl(N, fptr, A);/* set the argument values to be returned*/*ur_sjze = user_result_size;*ur_ptr = (char*) p;}Chapter 8. System Integration and Applications 130split.in(ptr, split_datasize, ptrs)user_data *ptr;mt *split_datasize;char *ptrs [1;{/* split the task *//* set the argument values to be returned*/ptrs[0] (char *) ptrl;ptrs[1] = (char *) ptr2;*split_datasize = user_data_size;}join_fn(jb_userres, jres_size, ptr)char *jb_userres [TASK_DEGI;mt *jres_size;char **ptr;{/* join the results *//* set the argument values to be returned*/*ptr = (char *) p;*jres_size = user_result_size;}After parallelizing this algorithm using TrEK, we ran the program on a single nodewith smaller N (32 and 64), and used the technique described in Section 8.2 to determinethe values of the application dependent parameters. As the sequential algorithm is an0(NlogN) algorithm, we setTe(N) aN + bNlogN.Because the split and join costs in this algorithm are 0(N), we setT3(N)+Tj(N) c+dIV.We calculated the values of the constants (a, b, c and d) for this implementation using themeasured execution times for smaller N. The values of these constants are: a = 0.000330,b 0.000091, c = 0.001695 and d = 0.000082.The models were used to predict the performance of this implementation for largerN (128, 256, 512 and 1024). From the models, we found that the throughput for any NChapter 8. System Integration and Applications 131Problem Size Lower bound Measured MeasuredN Time Time Speedup128 13.55 16.841 7.32256 24.05 28.620 9.46512 45.06 52.864 11.13102 87.06 102.820 12.42Table 8.1: Experimental results for FFT on a 16-node binary treereaches the communication bound given by equation (6.18) for binary trees of 4 levelsand ternary trees of 3 levels. The system reaches this bound because of the split andjoin costs at the root processor. If a larger tree is used, the root processor would beunable to split and forward the tasks at the rate in which the rest of the processors canprocess the tasks. The only way to improve the performance in this case is to optimizethe code for split and join functions.We verified these predictions by experimenting with larger N (see Table 8.1). Aspredicted, the measured total execution time did not decrease when we increased thesize of a binary topology from 4 levels to 5 levels. Actually, the measured execution timeincreased because of the reason described in Section 7.3.4.Chapter 9ConclusionsThis dissertation has explored a parallel programming approach that addresses the needof providing a programming environment that is easy to use, efficient and supports performance tuning on multicomputers. In this approach, users are provided with programming support based on parallel programming paradigms. We have studied two commonlyused parallel programming paradigms: processor farm and divide-and-conquer. Runtimesystem support for these two paradigms are designed such that they are easy-to-use andcan maximize the performance for applications that fit these paradigms. Performancemodels are derived for these systems taking into account the computation and communication characteristics of the applications that fit the paradigm in addition to thecharacteristics of the hardware and software system. The models determine the parameters that affect the performance and can be used for performance prediction and tuning.This work has contributed to our understanding of these systems and their limitations.In designing reusable and efficient runtime systems, many trade-offs have to be considered. In Chapter 4 and 6, we have described the trade-offs involved in the designof Pfarm and TrEK respectively. Hiding the complexities of the underlying hardwareand software system is the major consideration in the design of these runtime systems.These systems include all the necessary code for the system dependent issues such ascommunication, synchronization, task scheduling, and load balancing. Thus, users canconcentrate on the application dependent compute intensive code. In order to be efficient, runtime systems are designed to make use of all the available parallelism in the132Chapter 9. Conclusions 133hardware system such as the ability to simultaneously communicate on all the links.The system overheads that limit the overall performance of the applications are keptto a minimum. Both systems implement distributed dynamic task scheduling strategiesso that they can work well even for applications that can be decomposed into taskswith varying computational requirements. The systems are designed such that they aretopology independent, i.e., they can scale and run on any processor topology.It is difficult to obtain a single performance model that can be used for all theapplications on a parallel system. However, it is possible to derive good performancemodels for each of the virtual machines as every paradigm is a restricted model ofparallel computation. Performance models for processor farm and divide-and-conquervirtual machines have been derived in Chapter 4 and 6 respectively. These models takeinto account the computation and communication characteristics of the applications thatfit the paradigm in addition to the characteristics of the hardware and software system.General analytical frameworks that can be used to predict the performance on any treetopology have been presented for both of these paradigms. As both of these task-orientedsystems behave like a pipeline, it is important to analyze start-up and wind-down.For the processor farm case, we have shown that, on a fixed topology, a breadth-first spanning tree provides maximum performance and steady-state performance of allbreadth-first spanning trees are equal. As balanced tree topologies provide maximumperformance in the case of reconfigurable systems, we have derived performance modelsfor these topologies using the general analytical framework.TrEK can execute divide-and-conquer computations of any degree and depth on anyarbitrary tree topology. Unlike idealized parallel implementations of divide-and-conqueralgorithms on tree processors [HZ83, Co189], TrEK allows intermediate processors to dosubtask processing to make use of all the available parallelism in the hardware system.The analytical framework assumes a flow of divide-and-conquer tasks. As explainedin Section 3.4, this framework works well even for applications that consist of a singledivide-and-conquer computation with large degree and depth compared to the underlyinghardware topology. Experimentally, we have found that, on a fixed topology, a breadthfirst spanning tree with maximum number of leaves obtains maximum performance. AsChapter 9. Conclusions 134balanced tree topologies provide maximum performance in the case of reconfigurablesystems, we have derived models that can predict performance of any fixed k-ary divide-and-computations on any g-ary balanced tree topology.Pfarm and TrEK have been implemented on a 75 node transputer-based multicomputer. They are implemented using C on two different software environments: LogicalSystems and Trollius. As Pfarm and TrEK provide standard interfaces to the user codeirrespective of the environment they are implemented on, the user code is portable fromone system to the other. Performance models have been experimentally validated usingPfarm and TrEK. The models are found to be accurate as reported in Chapters 5 and 7.In order to use the models, it should be possible to determine the values of the parameters in an easy way. We have explained the techniques that can be used to determinethe values of the system dependent and application dependent parameters. We havediscussed how the models can be used in predicting and tuning the performance.It is possible to use the processor farm strategy to parallelize some divide-and-conquerapplications. Divide-and-conquer strategy performs well compared to the processor farmstrategy for applications with larger tasks and for those that consist of a smaller numberof tasks. Performance models can be used to determine the strategy to be used for agiven application.In order to make it easier for application programmers to use runtime systems on amulticomputer, they must be provided with a programming environment that supportsall phases of program development and execution. In Chapter 8, we have described suchan integrated programming environment, Parsec, developed on our transputer-basedmulticomputer. In addition to providing Pfarm and TrEK templates to applicationprogrammers, Parsec supports tools such as a graphical interface, mapper, loader anddebugger We have discussed how Pars cc supports reusability, reconfigurability and performance tuning.Chapter 9. Conclusions 1359.1 Future DirectionsThis research can be continned in several directions to further explore the usefulness ofthis approach for parallel programming.We have mentioned that the same design can be used for implementing Ffarm andTrEK on any distributed memory parallel computer that has the characteristics detailedin Chapter 3. Also, the performance models derived here can be used for any system thatsatisfies the assumptions used in the models. By implementing Pfarm and TrEK on twodifferent software environments, Logical Systems and Trollius, we have shown that theuser programs are portable and the same models can be used for both implementationsusing appropriate parameter values. The claims of being able to use the same designand modeling on any multicomputer system in addition to user programs being portablecan be further strengthened by implementing the rnntime systems on other hardwareplatforms, such as C40 based machines. Developing more application programs withthese runtime systems in several application areas could further support the usefulnessof these runtime systems.It is interesting to research the expressiveness of the task-oriented paradigms, processor farm and divide-and-conquer, i.e., whether these paradigms and the associatedimplementations can be modified or enhanced to make use of them for applications thatmay not exactly fit the underlying computational models.Pfarm system can be used for applications with differing characteristics as discussedin Chapter 8. It is possible to modify the same design to develop a runtime system thatcan be used for the Task Queue paradigm. The applications that fit the Task Queueparadigm consist of an initial set of tasks, which could be allocated to various processorsin the system. These tasks may generate new tasks that have to be processed. Unlikein the case of divide-and-conquer, the results of these new tasks need not be joined bythe parent. In this case, the task scheduling and load balancing has to be different fromthat of Pfarm. Each processor can keep a local task queue to which all the new tasks areadded. It can exchange the load information with its neighbors and transfer the tasks tothe lightly loaded neighbors, if necessary. Deriving performance models for such systemsChapter 9. Conclusions 136may need probabilistic models that reflect the task generation.It is possible to expand this approach to develop virtual machines that supportapplications that fit other parallel programming paradigms such as Compute-AggregateBroadcast, Systolic and Dynamic Programming. Also, in practice, some applicationsmay consist of several phases each of which may need different programming paradigms.This work can be expanded by developing programming environment that supportssuch applications. There are many issues to be considered here such as how the differentsystems exchange the data and results, whether they can be co-existent on the systemor they should be interleaved.Bibliography[ACG91] S. Ahmed, N. Carriero, and D. Gelernter. The Linda Program Builder. InA. Nicolau, D. Gelernter, T. Gross, and D. Padua, editors, Advances inLanguages and Compilers for Parallel Processing, Research Monographs inParallel and Distributed Computing. MIT Press, Cambridge, Massachusetts,1991.{AG89] D. Atapattu and D. Gannon. Building Analytical Models into an InteractivePerformance Prediction Tool. Proc. Supercomputing ‘89, pages 521—530,November 1989. Indiana U.[AHU74j A.V. Aho, J.E. Hopcroft, and J.D. Ullman. The Design and Analysis ofComputer Algorithms. Addison-Wesley, Reading, MA, 1974.{A1v90] G. A. Alverson. Abstractions for Effectively Portable Shared Memory Parallel Programs. Technical Report 90-10-09, Dept. of Computer Science andEngineering, University of Washington, October 1990.[Amd67] G. M. Amdahl. Valildity of the Single-processor Approach to AchievingLarge Scale Computing Capabilities. In AFIPS Conference Proceedings.AFIPS Press, Reston, Va., 1967.[Ar188] R. Arlauskas. iPSC/2 System: A Second Generation Hypercube. In Proceedings of the Third Conference on Hypercube Concurrent Computers andApplications, pages 38—50. ACM Press, January 1988.[A588] W. C. Athas and C. L. Seitz. Multicomputers: Message-passing concurrentcomputers. IEEE Computer, August 1988.{BBFB89] Moshe Braner, Gregory D. Burns, David L. Fielding, and James R. Beers.Trollius OS for Transputers. In D. Stiles, editor, Transputer Research andApplications (NA TUG 1), 1989.[Ben8O] J.L. Bentley. Multidimensional Divide-and-Conquer. Commun. A CM,23:214—229, 1980.[BFKK91] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer. A Static Performance Estimator to Guide Data Partitioning Decisions. In Proc. 8th ACMSymp. on Principles and Practice of Parallel Programming (PPOPP), pages213—223, April 1991.137Chapter 9. Conclusions 138[BL93] E. Bandari and J. J. Little. Multi-evidential Correlation & Visual EchoAnalysis. Technical Report TR 93-1, Department of Computer Science,University of British Columbia, Vancouver, Canada, January 93.[BTU88] R. D. Beton, S. P. Turner, and C. Upstill. A State-of-the-Art Radar PulseDeinterleaver - A Commercial Application of Occam and the Transputer. InC. Askew, editor, Occam and the Transputer - Research and Applications,pages 145—152. lOS Press, 1988.[CCT93] A.G. Chalmers, N.W. Campbell, and B.T. Thomas. Computational Modelsfor Real-Time Tracking of Aircraft Engine Components. In S. Atkins andA. Wagner, editors, Transputer Research and Applications 6 (NA TUG 6).lOS Press, May 1993.[CG89] N. Carriero and D. Gelernter. How to Write Parallel Programs: A Guide tothe Perplexed. ACM Computing Surveys, pages 323—357, September 1989.[CGJ91j S. Chanson, N. Goldstein, J. Jiang, H. Larsen, H. Sreekantaswamy, andA. Wagner. TIPS: Transputer-based Interactive Parallelizing System. InTransputing ‘91 Conference Proceedings, Sunnyvale, Calf. lOS Press, April1991.[CHvdV88] N. Carmichael, D. Hewson, and J. van der Vorst. A Prototype SimulatorOutput Movie System based on Parallel Processing Technology. In C. Askew,editor, Occam and the Transputer - Research and Applications, pages 169—175. lOS Press, 1988.[CK88] D. Callahan and K. Kennedy. Compiling Programs for Distributed-memoryMultiprocessors. Journal of Supercomputing, 2:151—169, October 1988.[CK91] K. M. Chandy and C. Kesselman. Parallel programming in 2001. IEEESoftware, pages 11—20, November 1991.[Cok9l] Ronald S. Cok. Parallel Programs for the Transputer. Prentice-Hall, 1991.[Co189] M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge, Massachusetts, 1989.[CU9O] I. Cramb and C. Upstill. Using Transputers to Simulate OptoelectronicComputers. In S.J. Turner, editor, Tools and Techniques for TransputerApplications (OUG 12). lOS Press, April 1990.[Dav85] R. E. Davis. Logic Programming and Prolog. IEEE Software, pages 53—62,September 1985.[Den8O] J. B. Dennis. Data Flow Supercomputers. Computer, pages 48—56, November 1980.{Dun9OJ R. Duncan. A Survey of Parallel Computer Architectures. IEEE Computer,pages 5—16, February 1990.Chapter 9. Conclusions 139[EZL89] D.L. Eager, J. Zahorjan, and E.D. Lazowska. Speedup Versus Efficiency inParallel Systems. IEEE Transactions on Computers, March 1989.[F90] D.L. Fielding et al. The Trollius Programming Environment for Multi-computers. In A. Wagner, editor, Transputer Research and Applications(NATUG 3). lOS Press, April 1990.[FJL88] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. SolvingProblems on Concurrent Processors (vol. 1). Prentice Hall, New Jersey,1988.[FK89] H. P. Flatt and K. Kennedy. Performance of parallel processors. ParallelComputing, 12:1—20, 1989.[F1y72J M. J. Flynn. Some computer organizations and their effectiveness. IEEETrans. Computers, C-21 (9) :948—960, September 1972.{FO9O] I. Foster and R. Overbeek. Bilingual Parallel Programming. In A. Nicolau,D. Glelernter, T. Gross, and D. Padua, editors, Advances in Languagesand Compiler for Parallel Processing, Research Monographs in Parallel andDistributed Computing. MIT Press, Cambridge, Massachusetts, 1990.[FSWC92] D. Feldcamp, H. V. Sreekantaswamy, A. Wagner, and S. Chanson. Towardsa Skeleton-based Parallel Programming Environment. In A. Veronis, editor,Transputer Research and Applications NA TUG 5. lOS Press, April 1992.[FT9O] I. Foster and S. Taylor. Strand: New Concepts in Parallel Programming.Prentice Hall, New Jersey, 1990.[FW93] D. Feldcamp and A. Wagner. Parsec: A Software Development Environment for Performance Oriented Parallel Programming. In S. Atkins andA. Wagner, editors, Transputer Research and Applications 6 (NA TUG 6).lOS Press, May 1993.[Gab9O] E. Gabber. VMPP: A Practical Tool for the Development of Portable andEfficient Programs for Multiprocessors. IEEE Transactions on Parallel andDistributed Systems, 1(3):304—317, July 1990.[Ga190] J. Galletly. Occam2. Pitman Publishing, 1990.[GC92J D. Gelernter and N. Carriero. Coordination Languages and their Significance. Commun. ACM, 35(2):97—107, February 1992.[GK92] A. Gupta and V. Kumar. Analyzing Performance of Large Scale ParallelSystems. Technical Report TR 92-32, Dept. of Computer Science, Universityof Minnesota, October 1992.{GKP89] R. L. Graham, D. E. Knuth, and 0. Patashnik, Concrete Mathematics: AFoundation for Computer Science. Addison-Wesley, Reading, Mass., 1989.Chapter 9. Conclusions 140[GM85] D. H. Grit and J. R. McGraw. Programming Divide and Conquer for aMIMD Machine. Software-Practice and Experience, 15(1):41—53, January1985.[GR88] A. Gibbons and W. Rytter. Efficient Parallel Algorithms. Cambridge University Press, Cambridge, 1988.[Gus88] J. L. Gustafson. Reevaluating Amdahl’s Law. Commun. ACM, 31(5):532—533, May 1988.[Hey9O] A. J. G. Hey. Experiments in MIMD Parallelism. In E. Odijk, M. Rem, andJ.-C. Syre, editors, PARLE: Parallel Architectures and Languages Europe,pages 28—42. Springer-Verlag, New York, 1990. Lecture Notes in ComputerScience 366.[HKT91I S. Hiranandani, K. Kennedy, and C. Tseng. Compiler Support for Machine-independent Parallel Programming in Fortran D. In J. Saltz and P. Mehrotra, editors, Compilers and Runtime Software for Scalable Multiprocessors.1991.[Hom93j D. Homeister. An Adaptive Granularity Scheduler for Multiprocessors. InS. Atkins and A. Wagner, editors, Transputer Research and Applications 6(NA TUG 6). lOS Press, May 1993.[HU79] J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 1979.[HZ83] E. Horowitz and A. Zorat. Divide-and-Conquer for Parallel Processing.IEEE Transactions on Computers, 32(6):582—585, June 1983.[Inc9l] Texas Instruments Incorporated. TMS 32OC4X User’s Guide. 1991.[1nc92] Texas Instruments Incorporated. TMS 2OCX Parallel Runtime SupportLibrary: User’s Guide. 1992.[K87j H.T. Kung et al. The Warp Computer: Architecture, Implementation, andPerformance. IEEE Transactions on Computers, C-36(12):1523—1538, December 1987.[K88J H.T. Kung et al. iWARP: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing ‘88, Orlando, FL. IEEE, Computer Society Press, 1988.[K90] H.T. Kung et al. Supporting systolic and memory communication in iWarp.In Proceedings of the 17th Intl. Symposium on Computer Architecture. IEEE,Computer Society Press, 1990.[Kar87] A. H. Karp. Programming for Parallelism. IEEE Computer, pages 43—57,May 1987.Chapter 9. Conclusions 141[KF9O] A. H. Karp and H.P. Flatt. Measuring Parallel Processor Performance.Commun. ACM, 33(5):539—543, May 1990.{Kog85] P. M. Kogge. Function-based Computing and Parallelism: A Review. Parallel Computing, 2:243—253, 1985.[KR87] V. Kumar and V. N. Rao. Parallel Depth-first Search, Part II: Analysis.International Journal of Parallel Programming, 16(6):501—519, 1987.[KT88] M. Kallstorm and S. S. Thakkar. Programming thre Parallel Computers.IEEE Software, 5(1):11—22, Jan 1988.[Kun89] H. T. Kung. Computational Models of Parallel Computers. In R. J. Elliot and C.A.R. Hoare, editors, Scientific applications of multiprocessors.Prentice Hall, 1989.[Lim84J INMOS Limited. Occam Programming Manual. Prentice Hall International,New Jersey, 1984.[1im87] Meiko limited. Computing Surface Technical Specifications. Meiko Ltd.,Bristol, UK, 1987.[Lim88] INMOS Limited. Transputer Development System. Prentice Hall, New Jersey, 1988.[Moc88] Jeffrey Mock. Process, channels and semaphores (version 2). Logical Systems C Programmers Manual, Logical Systems, 1988.[MS88] D. L. McBurney and M. R. Sleep. Transputer-based Experiments with theZAPP Architecture. In J. W. de Bakker et al, editor, Lecture notes inComputer Science. 1988.[Mu193j S. Mulye. Communication Optimization in Parsec: A Programming Environment for Multicomputers. Master’s thesis, Dept. of Computer Science,University of British Columbia, 1993. in preparation.[Ne187] P. A. Nelson. Parallel Programming Paradigms. Ph.D. thesis, Departmentof Computer Science , University of Washington, Seattle, WA, 1987.[NRFF93] D.F. Garcia Nocetti, M.G. Ruano, P.J. Fish, and P.J. Fleming. ParallelImplementation of a Parametric Spectral Estimator for a Real-time DopplerFlow Detector. In S. Atkins and A. Wagner, editors, Transputer Researchand Applications 6 (NA TUG 6). lOS Press, May 1993.{Pri87] D. Pritchard. Mathematical Models of Distributed Computation. In Proceedings of the 7th Occam User Group Technical Meeting. lOS, 1987.[Pri9O] D. J. Pritchard. Performance Analysis and Measurement on TransputerArrays. In Aad van der Steen, editor, Evaluating Supercomputers. Chapmanand Hall, 1990.142[PW86] D. A. Padua and M. J. Wolfe. Advanced Compiler Optimizations for Supercomputers. Commun. ACM, 29(12):1184—1201, December 1986.[RSV9O] M. Rao, Z. Segall, and D. Vrsalovic. Implementation Machine Paradigm forParallel Programming. In Proceedings of Supercomputing ‘90, pages 594—603. IEEE, Computer Society Press, 1990.[5CW92j H. V. Sreekantaswamy, S. Chanson, and A. Wagner. Performance PredictionModeling of Multicomputers. In Proceedings of the Tweith InternationalConference on Distributed Computing Systems (lCD CS), June 1992.[Sed83] R. Sedgewick. Algorithms. Addison-Wesley, Reading, MA, 1983.[Sei85] C.L. Seitz. The cosmic cube. Commun. ACM, 28(1):22—33, Jan 1985.[Son93] Y. Song. The Implementation of Sequential and Parallel Algorithms forSolving Almost Block Bidiagonal Systems. Master’s thesis, Dept. of Computer Science, University of British Columbia, March 1993.[SR85] Z. Segall and L. Ruldolph. PIE: A Programming and Instrumentation Environment for Parallel Processing. IEEE Software, 2(6):22—37, November1985.[SS91] S. S. Sturrock and I. Salmon. Application of Occam to Biological SequenceComparisons. In J. Edwards, editor, Occam and the Transputer- CurrentDevelopments, pages 181—190. lOS Press, 1991.[Sto87] Q. F. Stout. Supporting Divide-and-Conquer Algorithms for Image Processing. Journal of Parallel and Distributed Computing, 4:95—115, February1987.[Sto88] H. Stone. High Performance Computers. Addison-Wesley, 1988.[Sys9O] Logical Systems. C Programmers Manual. Corvallis, OR, 1990.[TD9O] R.W.S. Tregidgo and A.C. Downton. Processor Farm Analysis and Simulation for Embedded Parallel Processing Systems. In S.J. Turner, editor,Tools and Techniques for Transputer Applications (OUG 1). lOS Press,April 1990.[U1l84] J.D. Ullman. Computational Aspects of VLSI. Computer Science Press,Rockville, MD, 1984.[WSC93I A. Wagner, H. Sreekantaswamy, and S. Chanson. Performance Issues inthe Design of a Processor Farm Template. In Proceedings of the WorldTransputer Conference. lOS Press, 1993. To appear.143GlossaryAverage processor overhead for a locally processed task.Average processor overhead for a forwarded task.r Communication rate of the links.The fraction of the total number of tasks executed by a processor i.M Total number of tasks in an application program.N The total number of processors in the system.SD Throughput of a D-level tree.SPD Speedup of a D-level tree.Td Average communication time required to transfer a task froma processor to its neighbor.Td Average communication time required to transfer a result froma processor to its neighbor.T Average processing time per task in the processor farm case.Te(j) Average processing time per task or subtask at the ith levelof the hardware topology in the divide-and-conquer case.Tj (i) Average result joining time at the ith levelof the hardware topology in the divide-and-conquer case.T8(i) Average task splitting time at the ith levelof the hardware topology in the divide-and-conquer case.T88 Steady-state execution time.T3 Start-up time.Ttotai Total execution time.Td Wind-down time.Number of tasks or subtasks that visit processor i.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- An integrated approach to programming and performance...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
An integrated approach to programming and performance modeling of multicomputers Sreekantaswamy, Halsur V. 1994
pdf
Page Metadata
Item Metadata
Title | An integrated approach to programming and performance modeling of multicomputers |
Creator |
Sreekantaswamy, Halsur V. |
Date Issued | 1994 |
Description | The relative ease with which it is possible to build inexpensive, high-performance multicomputers using regular microprocessors has made them very popular in the last decade. The major problem with multicomputers is the difficulty in effectively programming them. Programmers are often faced with the choice of using high level programming tools that are easy to use but provide poor performance or low level tools that take advantage of specific hardware characteristics to obtain better performance but are difficult to use. In general, existing parallel programming environments do not provide any guarantee of performance and they provide little support for performance evaluation and tuning. This dissertation explores an approach in which users are provided with programming support based on parallel programming paradigms. We have studied two commonly used parallel programming paradigms: Processor Farm and Divide-and-Conquer. Two runtime systems, Pfarm and TrEK, were designed and developed for applications that fit these paradigms. These systems hide the underlying complexities of multicomputers from the users, and are easy-to-use and topology independent. Performance models are derived for these systems, taking into account the computation and communication characteristics of the applications in addition to the characteristics of the hardware and software system. The models were experimentally validated on a large transputer-based system. The models are accurate and proved useful for performance prediction and tuning. Pfarm and TrEK were integrated into Parsec, a programming environment that supports program development and execution tools such as a graphical interface, mapper, loader and debugger. They have also been used to develop several image processing and numerical analysis applications. |
Extent | 2177906 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-04-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051473 |
URI | http://hdl.handle.net/2429/6951 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1994-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1994-893923.pdf [ 2.08MB ]
- Metadata
- JSON: 831-1.0051473.json
- JSON-LD: 831-1.0051473-ld.json
- RDF/XML (Pretty): 831-1.0051473-rdf.xml
- RDF/JSON: 831-1.0051473-rdf.json
- Turtle: 831-1.0051473-turtle.txt
- N-Triples: 831-1.0051473-rdf-ntriples.txt
- Original Record: 831-1.0051473-source.json
- Full Text
- 831-1.0051473-fulltext.txt
- Citation
- 831-1.0051473.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051473/manifest