(CM)2PI : Compose-Map-ConfigureMPIDesign and Implementation of a Scalable ParallelArchitecture Specification and its DeploymentbyRyan Ki Sing ChungB.Sc., The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)November 2014c© Ryan Ki Sing Chung 2014AbstractIn order to manage the complexities of Multiple Program, Multiple Data(MPMD) program deployment to optimize for performance, we propose(CM)2PI as a specification and tool that employs a four stage approachto create a separation of concerns between distinct decisions: architectureinteractions, software size, resource constraints, and function. With func-tion level parallelism in mind, to create a scalable architecture specificationwe use multi-level compositions to improve re-usability and encapsulation.We explore different ways to abstract out communication from the tightcoupling of MPI ranks and placement. One of the methods proposed is theflow-controlled channels which also aims at tackling the common issues ofbuffer limitations and termination. The specification increase compatibilitywith optimization tools. This enables the automatic optimization of pro-gram run time with respect to resource constraints. Together these featuressimplify the development of MPMD MPI programs.iiPrefaceThis thesis is original work by the author Ryan Ki Sing Chung who discussedwith Dr. Alan Wagner, Dr. Humaira Kamal, Sarwar Alam, and ImranAhmed. The programming library FG-MPI is developed by Dr. HumairaKamal. This thesis incorporates suggestions and comments from Dr. RonaldGarcia.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 FG-MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 MPI Eager Send . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4.1 Component-Based Software Engineering . . . . . . . 82.4.2 Channel Communication . . . . . . . . . . . . . . . . 82.4.3 Flow Control . . . . . . . . . . . . . . . . . . . . . . . 92.4.4 Computational Resources and Placement of Work . . 103 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Multiple Layered Specification (CPA Architecture) . . . . . . 123.1.1 Specifying the Communication Process Architecture(CPA) . . . . . . . . . . . . . . . . . . . . . . . . . . 12ivTable of Contents3.1.2 The Assembly Line (The Four Stages) . . . . . . . . . 183.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.1 Process Discovery . . . . . . . . . . . . . . . . . . . . 233.2.2 Composition-Based Hierarchical Communicators . . . 253.2.3 Channels . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.4 Flow-Controlled Channels . . . . . . . . . . . . . . . 303.3 Encapsulation of Compositions . . . . . . . . . . . . . . . . . 413.4 Composition of Programs . . . . . . . . . . . . . . . . . . . . 424 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1 Combination of Concurrency and Parallelism . . . . . . . . . 474.2 Scaling to Larger or Smaller Resources . . . . . . . . . . . . 504.3 Heterogeneous Computation Load . . . . . . . . . . . . . . . 514.4 Glue Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.1 A: Main Init . . . . . . . . . . . . . . . . . . . . . . . 534.4.2 B: Map Init . . . . . . . . . . . . . . . . . . . . . . . 534.4.3 C: Map Finalize . . . . . . . . . . . . . . . . . . . . . 534.4.4 D: Get Parameters . . . . . . . . . . . . . . . . . . . 534.4.5 E: Communicator Permutation . . . . . . . . . . . . . 544.4.6 F: Free Parameters . . . . . . . . . . . . . . . . . . . 544.4.7 G: Main Cleanup . . . . . . . . . . . . . . . . . . . . 544.4.8 Glue Code Use Cases . . . . . . . . . . . . . . . . . . 554.5 Permutation of Processes . . . . . . . . . . . . . . . . . . . . 564.5.1 Communicators Permutation Callback Function . . . 565 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.1 Integration with Optimization Tools . . . . . . . . . . . . . . 585.1.1 Proof of Concept . . . . . . . . . . . . . . . . . . . . 595.2 Permutation of Processes . . . . . . . . . . . . . . . . . . . . 605.3 Adapting to Resource Size . . . . . . . . . . . . . . . . . . . 605.4 Integration of Separately Developed MPI Programs . . . . . 615.5 Performance Impact Evaluation . . . . . . . . . . . . . . . . 615.5.1 Process Mapping Storage Requirement . . . . . . . . 61vTable of Contents5.5.2 Process Discovery . . . . . . . . . . . . . . . . . . . . 625.5.3 Hierarchical Communicators . . . . . . . . . . . . . . 635.5.4 Channels Information . . . . . . . . . . . . . . . . . . 635.5.5 CMCMPI Init . . . . . . . . . . . . . . . . . . . . . . 635.5.6 Flow-Controlled Channels . . . . . . . . . . . . . . . 646 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70AppendicesA Specification Notation . . . . . . . . . . . . . . . . . . . . . . . 74A.1 Composition Notation . . . . . . . . . . . . . . . . . . . . . . 74A.1.1 MPI Process Declaration . . . . . . . . . . . . . . . . 74A.1.2 Concurrent Composition of MPI Processes . . . . . . 74A.1.3 Concurrent Composition of Compositions . . . . . . . 75A.1.4 Intermediate Composition . . . . . . . . . . . . . . . 75A.1.5 Communication Channels . . . . . . . . . . . . . . . . 75A.1.6 Port Hiding . . . . . . . . . . . . . . . . . . . . . . . 76A.1.7 Patterns for Communication Port Declaration . . . . 76A.1.8 Forming a Structure . . . . . . . . . . . . . . . . . . . 77A.1.9 Variable Initialization . . . . . . . . . . . . . . . . . . 77A.1.10 Initializing All Variables . . . . . . . . . . . . . . . . 77A.2 Placement Notation . . . . . . . . . . . . . . . . . . . . . . . 78A.2.1 Container Specification . . . . . . . . . . . . . . . . . 78A.2.2 Container Allocation to OS Processes . . . . . . . . . 78A.2.3 Forming a Container Specification . . . . . . . . . . . 78A.2.4 Mapping Functions to MPI Processes . . . . . . . . . 79A.2.5 Glue Specification . . . . . . . . . . . . . . . . . . . . 79A.2.6 Environment Variables . . . . . . . . . . . . . . . . . 80A.2.7 Combining Specifications . . . . . . . . . . . . . . . . 80viTable of ContentsB API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81B.1 Process Discovery API . . . . . . . . . . . . . . . . . . . . . 81B.2 Communicators API . . . . . . . . . . . . . . . . . . . . . . . 82B.3 Channels API . . . . . . . . . . . . . . . . . . . . . . . . . . 83B.4 Flow-Controlled Channels API . . . . . . . . . . . . . . . . . 84C Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86C.1 Flow of Execution . . . . . . . . . . . . . . . . . . . . . . . . 86C.1.1 Compiling the Generated Code . . . . . . . . . . . . . 87C.2 HelloWorld Example . . . . . . . . . . . . . . . . . . . . . . . 87C.2.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87C.2.2 FG-MPI . . . . . . . . . . . . . . . . . . . . . . . . . 88C.2.3 FG-MPI with (CM)2PI . . . . . . . . . . . . . . . . . 90C.3 PrimeSieve Example . . . . . . . . . . . . . . . . . . . . . . . 92C.4 Farm Example . . . . . . . . . . . . . . . . . . . . . . . . . . 98D Code sketch - Communicator Permutation . . . . . . . . . . 101viiList of Figures2.1 The above launches 16 processes (8 x executable1), (4 x ex-ecutable2), and (4 x executable3) on a 4 node cluster whereeach node has 4 cores. Processes are assigned to cores byfilling each node in order. . . . . . . . . . . . . . . . . . . . . 52.2 FG-MPI adds extra flexibility on placing MPI processes. How-ever, a mapper needs to be written for each change in map-ping. In the above, 4 OS processes are launched (1 with 8MPI processes), (2 with 2 MPI processes), and (1 with 4MPI processes) on a cluster with 2 nodes each with 2 cores. . 63.1 CPA Example of a prime sieve. The key aspect of this spec-ification include the declaration of the processes and theircompositions. This notation is similar to that of FSP (FiniteState Process) in Kramer and Magee [28]. . . . . . . . . . . . 133.2 A description of the architecture would indicate a variablesized sieve element section with 1 generator and 1 last ele-ment. The generator is PrimeStart in the specification andthe last element is PrimeEnd in the specification. . . . . . . . 133.3 CPAV Example of a prime sieve. Strings can be used as valuesfor function parameters only. . . . . . . . . . . . . . . . . . . 15viiiList of Figures3.4 Container Specification Example of a prime sieve. Constraintsof what MPI processes can go into each OS process as wellas the number of OS processes is defined. In this example,there are 160 OS processes in total. One of them will havea PrimeStart process and as many PrimeElement process asneeded. One of them will have a PrimeEnd process and asmany PrimeElement processes as needed. Finally, 158 OSprocesses will each have as many PrimeElement processes asneeded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 Function Spec Example of a prime sieve . . . . . . . . . . . . 173.6 The four stages. At each stage of the tool’s execution, aninput file is used to combine and further initialize the system. 183.7 The MPI ranks are assigned to the processes in the orderthey are defined in the execution command. A separate rankis given for each nfg value. . . . . . . . . . . . . . . . . . . . 203.8 Wrapper functions are provided for each function in the pro-gram. The compressed arguments list from the executioncommand is decoded and then passed on to the user’s function. 213.9 The project folder organization. The files generated are con-tained in a sub-directory of the project. . . . . . . . . . . . . 223.10 Visualizing the CPA using a tree of composition. The namingcan be seen clearly here by following from the root to thecorresponding MPI process. . . . . . . . . . . . . . . . . . . . 243.11 Visualizing the CPA. A name is given to each level of com-position as well as each process. . . . . . . . . . . . . . . . . . 243.12 MPI process information available for packing. The compo-sitions are expanded out and only processes with their fullname is kept for the next steps. . . . . . . . . . . . . . . . . . 243.13 Container Specification defines requirements on the contentsof the container. . . . . . . . . . . . . . . . . . . . . . . . . . 253.14 OS Processes are formed and placed adhereing to containerrestrictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25ixList of Figures3.15 Channels are specified by connecting port to port. Intermedi-ate ports are used when creating channels through composi-tions. The grey circles are external ports (intermediate portsvisible outside of a composition) while the white circles areinternal ports (an MPI process port visible only inside thecomposition). . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.16 The types of channels we consider. A) One-to-one channelswith a single source and single destination, B) One-to-manychannels where there is one source but many destinations, C)Many-to-one channels where there are many sources sendingto a single destination . . . . . . . . . . . . . . . . . . . . . . 343.17 Flow-controlled channel communication for a one-to-one chan-nel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.18 Flow-controlled channel communication for a one-to-manychannel. The proxy is co-located with the source to reducethe effect of the communication between the source and theproxy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.19 Flow-controlled channel communication for a many-to-onechannel. The proxy is co-located with the destination. Datamust travel from the proxy to the singular destination in thischannel construct so the goal is to minimize the communica-tion overhead of sending data from the proxy to the destination. 373.20 The typical messaging between source, destination and proxywhen the sender initiates the termination of the channel. . . . 383.21 The typical messaging between source, destination, and proxywhen the receiver requires the termination of the channel.Notice that the receiver can only request and not demandtermination. It is still up to the sender to terminate thechannel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.22 To optimize for performance, communication between co-locatedprocesses can be simplified as the communication is alwayssynchronous and flow-control is not an issue. . . . . . . . . . 40xList of Figures3.23 Creation of channels through composition. The white circlesrepresent ports while the grey circles represent intermediateports explicitly exposed by a composition. . . . . . . . . . . . 433.24 Assuming there are 3 process types: A, B, and C. If theseprocesses form a chain A) and we want to duplicate it, thenthe new chain must take a new set of rank numbers where B)and C) are possibilities. Traditional methods of communica-tion using ranks require the developer to consider and handlethese possibilities. . . . . . . . . . . . . . . . . . . . . . . . . 443.25 A prefix can be used to identify the processes of a particularcomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1 The use of wildcard in the architecture initialization givesfreedom to place as many processes as need such as the Masterprocess. The use of wildcard in the container specificationallows for freedom to place a fixed number of MPI processes(as specified in the architecture) to be placed evenly on theresources available. . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Loops can be used to create a simple pattern in the numberof processes placed into each container. . . . . . . . . . . . . . 494.3 The prefix of a composition can be specified in the containerspecification in remove ambiguity of process placement. . . . 504.4 Separation of concerns in the specification allows concise deci-sions to be changed. The tool automatically fits the programinto the resources available and generate the correspondingmapping. For large programs with many processes, changingthe mapper and configuration manually is error-prone andgrueling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51xiList of Figures4.5 The permutation of processes using communicators. The darkblue circles represent processes and the white circles repre-sents a rank naming at that communicator. Proceses are per-muted (green arrows) at the lowest level and the permutationpropagates back up. Any channel information (red arrows) isbased on the virtual ordering in the new communicators. . . . 57C.1 The order in which the tool is run from compilation to exe-cution of the program. . . . . . . . . . . . . . . . . . . . . . . 86C.2 The generator needs to send out numbers. The sieve elementsreceive numbers from the front of the sieve and pass numberson. The last element needs to terminate the program when itreceives a prime. The control channel is used for optimizationof the termination process. . . . . . . . . . . . . . . . . . . . . 92C.3 A description of the architecture would indicate a variablesized sieve element section with 1 generator and 1 last element. 93C.4 Three patterns of composition. A farm can be by itself A),chained B), and nested C). . . . . . . . . . . . . . . . . . . . 99xiiAcknowledgementsI would like to express my deepest gratitude to my supervisor, Dr. AlanWagner, for guiding me through my graduate studies. His passion inspiredme to explore fascinating topics. His knowledge, resourcefulness, and kind-ness supported me through countless obstacles.I would also like to thank Dr. Ronald Garcia for reading my thesis andproviding insightful suggestions and comments.Thank you Dr. Humaira Kamal for her work on Fine-Grain MPI whichmotivated my research. Her suggestions and support was indispensable.Thank you Dr. Nelson Souto Rosa for giving me interesting suggestionsfor improving my research.To all my friends, I thank you so much for all the memorable momentswe’ve shared. Special thanks to my closest lab buddies: Sarwar Alam, andImran Ahmed.Finally, thank you so much to my parents, sister, and aunt for theircontinuing support, love, and encouragement.xiiiDedicated to my parents, sister, and aunt.xivChapter 1IntroductionToday’s CPUs are no longer increasing significantly in clock speed but ratherin the number of cores available. In order to optimize for performance andto scale to larger problems on a cluster of multicore nodes, it is necessaryto use parallelism. A problem that is largely neglected is the placement ofprocesses to reduce communication overhead needed to keep compute coresbusy.Message Passing Interface (MPI) is an example of a library where thereis complete freedom in placing processes on compute resources. MPICHfrom Argonne National Laboratory [26] is middleware that implements theMPI standard library. The term MPI middleware is used to refer to theMPI runtime, tools, and support for communication. As MPI is built forperformance, processes run “hot” (constantly polling for messages) in orderto minimize overheads. As a result, it is usual that the number of processesneeds to match the number of cores. The OS scheduler views every processas independent and [18] found that performance is decreased by 10% whenapplications are run with two threads per core.Fine-Grain MPI (FG-MPI) is an extension of the MPICH middlewarewhich supports function-level parallelism where each MPI process is respon-sible for a function rather than a whole executable. The novelty is in theability to place more than one MPI process in a single OS process. TheMPI processes inside the OS process are scheduled by a lightweight sched-uler which is aware of the progress of each MPI process. This added layerof flexibility creates a new dimension for the specification of the placement.However, we do need a way to specify which MPI processes should be co-located and where.There are frequent computation and communication patterns [29] that1Chapter 1. Introductionexist in large programs. This is especially true in scalable programs asit is rare for every process to have distinct functions in a program withthousands of processes. When using FG-MPI, these patterns are commonlybased on the size of the program and the resources. Yet, it is time-consumingand error-prone to specify these patterns through code. We need a way toexternalize aspects of the code such that programs can be reused regardlessof their placement.In Jeff Squyres’ blog [32], he stressed the importance of what goes onto set up the MPI environment before calling MPI Init. But what’s alsoimportant is the configuration that happens before setting up the environ-ment.This research proposes (CM)2PI (Compose-Map-Configure MPI), a scal-able parallel program specification based on a four-stage approach to de-couple the software architecture and deployment decisions. A tool is alsoproposed to generate code based on this abstraction as wrappers aroundthe users’ functions. The main focus is on being able to configure parallelprogram placement for performance.This thesis makes the following contributions:• We propose a four stage approach to program specification. Thismethod allows decisions on architecture interactions, software size,resource capacity, and function to be separately specified in order toseparate concerns from the design to its deployment.• We propose a component-based architecture to allow parallel designpatterns to be reused. A multi-level composition design also allowsfor encapsulation of a group of functions for further compositions withseparately developed applications. This is the cooperation of smallparts to perform a bigger role.• We explore three different ways to specify communication in order toeliminate the reliance on MPI’s rank numbers in order to allow flexi-bility in placement and the separation of placement and programming.We explore three types of communication specification: a name-based2Chapter 1. Introductionprocess discovery library, a hierarchical communicator library, and alibrary based on unidirectional communication channels.• We investigate the inclusion of a flow-controlled channel library inorder to support communication termination (an agreement problem)as well as reducing the chance of memory exhaustion at the receiver’sbuffer.• We integrate the program specification with an optimization tool toautomatically place MPI processes on the resources available.The rest of the thesis is organized as follows: Chapter 2 explains someof the background information needed to understand the problem and tech-niques, Chapter 3 explains the programmability aspect through composition,Chapter 4 presents the performance aspect through placement, and Chapter5 discusses the effectiveness of the proposed solution. In the appendix, wepresent the detailed specification language, API, and three examples: a helloworld program (shows basic usage), a prime sieve program (shows channelsusage), and a farm program (shows composition, chaining, and channelsusage).3Chapter 2Background2.1 MPIMessage Passing Interface (MPI) [3] is a standard API for creating paralleldistributed memory programs that share information through passing mes-sage between processes. The implementation of MPI used in this researchis an open source library and middleware called MPICH [26].MPICH programs are launched by the mpiexec command, which mapsMPI processes to the compute cores available (see Figure 2.1). A machinefilelisting the available processing cores and their locations must be created first.Then, each different process is created as a separate executable that usesthe MPI library. Finally, a command is executed to specify the number ofeach executable to launch as processes and their ordering. Processes arethen created based on the order they are specified in the command and thenmapped to the compute cores in the order specified in the machinefile.All MPI processes have a default communicator called MPI COMM WORLD.A communicator is a communication scope which assigns a unique ranknumber to identify each member process. This rank number is used as atarget address when sending point to point messages. The default mes-saging type in MPI is asynchronous. Information can be passed throughend-to-end communication (sending to a specific rank) or through collectivecommunication (broadcast, scatter, gather) within a communicator.If more MPI processes are launched than there are cores available, thenthese processes are time-shared by the operating system’s scheduler. MPICHcan run on a cluster with multiple compute nodes each with multiple coresor on a single machine with one or more cores. At launch time, the pro-cess manager Hydra starts each process on its assigned node while launch42.2. FG-MPImpiexec -n 8 ./executable1 :-n 4 ./executable2 :-n 4 ./executable3COMMANDnode01:4node02:4node03:4node04:4MACHINE FILENumber of OS processesint main( int argc, char *argv[]){ … }executable1int main( int argc, char *argv[]){ … }executable2int main( int argc, char *argv[]){ … }executable3Figure 2.1: The above launches 16 processes (8 x executable1), (4 x exe-cutable2), and (4 x executable3) on a 4 node cluster where each node has 4cores. Processes are assigned to cores by filling each node in order.communication with other processes is typically through SSH. This allowsthe programmer to statically assign processes in order to optimize for local-ity when communication latency is important. In the case of heterogenousclusters with expert knowledge and manual adjustment, it is possible to putprocesses with fewer tasks on slower cores.2.2 FG-MPIFine-Grain MPI (FG-MPI) [23, 25] is an extension of MPI that allows morethan one MPI process to share a single processing core. FG-MPI’s inte-gration with the MPICH middleware allows MPI processes to be definedas functions rather than being separate executables. These functions canbe packaged into a single executable. The execution command is similar toMPI where the number of OS processes is specified using a -n flag but a newflag -nfg is added to support specifying the number of MPI processes (thenumber of MPI ranks) to be placed inside each OS process. Since more thanone MPI process can be co-located inside a single core, a mapper needs tobe written to specify which function should be assigned to each MPI rankwithin each core (see Figure 2.2).52.2. FG-MPImpiexec -n 1 -nf g 8./executable :-n 2 -nf g 2 ./executable :-n 1 -nf g 4 ./executableCOMMANDnode01:2node02:2MACHINE FILENumber of OS processesNumber of co- l oca t ed M P I processes# include " mpi.h"# include " f gmpi.h"int my main1( int argc, char** argv ); /*f orw ard declaration*/int my main2( int argc, char** argv ); /*f orw ard declaration*/F G _ P roces s P tr_ t binding_ f unc(int argc, char** argv, int rank ){ if (rank % 2 = = 0) return (& my main1); els e return (& my main2);}F G _ M apP tr_ t map_ look up(int argc, char** argv, char* s tr){ return (& binding_ f unc);}int main( int argc, char *argv[] ){ F G mpiexec(& argc, & argv, & map_ look up); return (0);}int my main1( int argc, char** argv ){…}int my main2( int argc, char** argv ){…}executableF G - M P I B oi l erpl a t eFigure 2.2: FG-MPI adds extra flexibility on placing MPI processes. How-ever, a mapper needs to be written for each change in mapping. In theabove, 4 OS processes are launched (1 with 8 MPI processes), (2 with 2MPI processes), and (1 with 4 MPI processes) on a cluster with 2 nodeseach with 2 cores.MPI allows parallelism of processes amongst the available computationcores. However, it is difficult to ensure that cores are never idle or to min-imize communication overhead for computation with dependencies (i.e. aprocess must wait for input before computing). A common technique used tocombat this is by programmatically integrating multiple tasks into a singleprocess through the use of conditionals and manually managed time-sharingto switch between tasks. This is cumbersome and is detrimental to thereadability and maintainability of the code.62.3. MPI Eager SendFG-MPI allows the developer to put more than one MPI process on thesame physical compute core. This enables processes to run concurrentlyas well as in parallel. Each MPI process can be programmed separatelyregardless of the number of physical compute cores available because theirconcurrent execution can be handled by the FG-MPI scheduler inside themiddleware.With function-level parallelism, code represents the software structurewhile at the same time giving the flexibility to co-locate processes to min-imize communication and the amount of idle cycles of a core. With thisnew opportunity to customize the placement of processes in parallel andconcurrently comes the need for a tool that makes the process of placingMPI processes manageable by breaking away from the need to use MPIrank numbers to identify processes.2.3 MPI Eager SendMPI has two types of protocol for sending messages: Eager and Rendezvous.The Rendezvous protocol is used when messages are larger than animplementation-specific eager limit. In this case, an envelope of the data issent to the receiver first. When the receiver indicates that it has sufficientbuffer space to receive the data, it is then sent. This method ensures thatbuffer space is not exhausted on the receiver side but has extra overhead forthe synchronization.On the other hand, small messages are sent using the Eager protocol.In this case, messages are sent to the receiver assuming there is sufficientbuffer space in the middleware to buffer messages that are not yet receivedby the sender. This method increases performance through uni-directionalasynchronous communication but there are potential problems with thismethod.One is the assumption that the there is sufficient buffer space on thereceiver side may not be true. Buffer space can run out or main memorycan be exhausted in the case of resizable buffers. Another issue is that asmessages queue up in the unexpected message queue, it adds to queueing72.4. Related Workdelays and search time as described in [34].Since eager messages have a lack of flow-control, we propose a compro-mise that works on top of the eager method by allowing a certain amount ofasynchronous messages dependant on the number of tokens provided by thereceiving end. This means that once in a while, a message in reverse needsto be sent to recycle tokens.2.4 Related Work2.4.1 Component-Based Software EngineeringComponent-based software engineering [6] suggests that the usage of compo-nents to encapsulate related functions improves application, component, andfunction re-usability as well as maintaining compatibility between differentsystems through modularity, cohesion, and low coupling. By communicat-ing through precise interfaces, this allows the components to be replacedwithout the need to know the internals of each component. Other benefitsinclude the ability to test components separately as well as the simplicity ofcreating precise documentation. This approach is widely used in softwaredevelopment.This technique can also be applied to parallel programming . TraditionalMPI code has a single “program” which when run determines its task basedon what rank it is. With FG-MPI, we can define functions as processes.Combining this with component-based software engineering, we can com-pose a set of functions (or MPI processes) to perform a simple task whileencapsulating all the details of its internals. These simple tasks can thenbe further composed into more complex tasks and so on. Code organizationand maintainability are also important factors to the success of a parallelprogram.2.4.2 Channel CommunicationChannels are a way for communication between two endpoints. They aredescribed in CSP [15] and used in many programming languages like Go [1]82.4. Related Workand Erlang [4]. A synchronous channel is achieved using a rendezvous toexchange data while an asynchronous communication channel is created tobe a buffer that processes can send data into and processes can retrieve datafrom. This is similar to the publish and subscribe pattern where senders arepublishing to this buffer while receivers are subscribing to its content. Oneadvantage of using channels is that the sender does not have to worry aboutwhere the endpoint of the data should go. Likewise, the receiver does nothave to know the exact source of the message. In other words, a channelabstracts away the end-to-end communication by using a named entity forthe transport of the message.2.4.3 Flow ControlFlow control is the management of the amount of data that can be sent fromsource to sink while taking into consideration the speed at which the sinkis able to consume this data and the speed at which the source is able togenerate this data.This is important when data is generated quicker than it is consumed.The goal of flow control is to make sure the sink is not overwhelmed withmessages while at the same time ensuring that the speed of communication isnear its optimal performance. Flow control is commonly used in networkingprotocols, e.g., TCP.With MPI it is not possible to drop packets as it would cause the wholeprogram to fail. Furthermore, with the preference for performance, messagessent eagerly assume that there is sufficient buffer space to store the messagewhen the receiver is not ready. Some user level techniques currently usedinclude synchronous sends, polling, and token systems.Synchronous sends wait for acknowledgement that a message is receivedbefore carrying on with computation. Polling and token systems have thesink requesting the source for messages and thus have control over the num-ber of messages forwarded. Obviously, all of these techniques incur extramessages and degrade performance.92.4. Related Work2.4.4 Computational Resources and Placement of WorkThe end of Dennard’s Scaling [7] showed that manufacturers must look toadding more cores rather than speeding up cores. To exploit this dimen-sional change in resource growth and to perform well, programs must nowbe written to use multiple cores and nodes. New issues must be consideredas communication incurs larger penalties and synchronization of tasks acrossmultiple cores is required to cooperate.Many programming languages were developed to aid developers in think-ing concurrently and to structure their code for concurrent computation.However, it appears that the support available focused only on the pro-grammability aspect and not the need to efficiently place these computa-tions.Two major influences on performance are the amount of idle processingoverhead, when available processing cores are not doing useful computationdue to other dependencies, and communication overhead, time spent waitingdue to passing of data and messages in order to cooperate.The Rust [9] language boasts programmability but deployment on clus-ters is not in its current plans. Other languages like Go [1] and Cilk [13]provide work-stealing runtimes while Hadoop and Erlang load balance work-loads by distributing processes during execution.These techniques are sufficient for Single Program, Multiple Data (SPMD)programs where most tasks take the same amount of time. However, forMultiple Program, Multiple Data (MPMD) programs, the amount of time aprocess takes to execute varies greatly, which means some of these processescan benefit from not sharing a core while others prefer the opposite.Any type of dynamic load balancing technique can introduce unnecessarycontext switches through transfer of work from one location to another.Also, simply placing work based on a core’s workload neglects how importantcommunication overhead is for some applications and the locality of theprocesses can play a larger factor than the compute power. Furthermore,the performance of processes may be known ahead of time and thus thereis no point in migrating processes during execution. Perhaps dynamic load102.4. Related Workbalancing may be more adaptive, but optimization tools may be able to findstatic placement of the processes.In other words, existing techniques to distribute processes work wellfor applications such as process farms but are not appropriate as a singleglobal technique especially for MPMD applications where services may stayalive and communication amongst different actors play an important rolein the overall performance of the system. To bring performance benefits toconcurrent programs, we must also think parallel.11Chapter 3Composition3.1 Multiple Layered Specification (CPAArchitecture)With the current situation where the execution command (configfile) andprocess mapping is so highly coupled, we need to find a way to abstract outthe different parts. We use the separation of concerns design principle todesign four distinct phases to specifying an FG-MPI program. The aim isto have each of the stages be highly cohesive and loosely coupled.3.1.1 Specifying the Communication Process Architecture(CPA)We propose four separate specifications to describe the system.a) The Communication Process Architecture for specifying the architecturecomposition and communicationb) The Communication Process Architecture Variables to parametrize thesystemc) The Container Specification for defining resource constraintsd) The Function Specification for assigning functions to the entities in thearchitecture.For information about the notation, see Appendix A. For examples of howto specify and run (CM)2PI, see Appendix C.123.1. Multiple Layered Specification (CPA Architecture)3.1.1.1 Communication Process Architecture (CPA)The CPA specification is used to describe the overall architecture design ofthe software system. The intention is that the architecture alone (out of allfour specifications) is the only information needed by the developer whenprogramming the functions. An example of a CPA is shown in Figure 3.1with its corresponding structure diagram shown in Figure 3.2.s tructure primes ieve (numprimes ) { P rimeS tart = inports < > outports < next, control> () P rimeE lement = inports < prev> outports < next> () P rimeE nd = inports < prev, k ill> outports < > () R = [0..numprimes -4] | | - P rimeS ieveC hain = ( P rimeE lement[numprimes -2] ) / ( f or i in R { next[i] -> prev[i+ 1]}, receive -> prev[0], next[numprimes -3] -> s end ) @ (receive, s end ) | | - P rimeS ieve = ( P rimeS tart | | P rimeS ieveC hain | | P rimeE nd ) / ( P rimeS tart.next -> P rimeS ieveC hain.receive, P rimeS ieveC hain.s end -> P rimeE nd.prev, P rimeS tart.control -> P rimeE nd.k ill) }Process DeclarationProcess CompositionFigure 3.1: CPA Example of a prime sieve. The key aspect of this specifi-cation include the declaration of the processes and their compositions. Thisnotation is similar to that of FSP (Finite State Process) in Kramer andMagee [28].Generator Sieve Element Last ElementSieve Element Sieve ElementFigure 3.2: A description of the architecture would indicate a variable sizedsieve element section with 1 generator and 1 last element. The generatoris PrimeStart in the specification and the last element is PrimeEnd in thespecification.133.1. Multiple Layered Specification (CPA Architecture)The architecture, or structure, of the system consists of the definitionof the processes and the composition of those processes. The definition ofa process includes assigning a name, defining the input and output ports,as well as any arguments that should be passed in as argv. The definitionsof compositions include a list of the processes of compositions which canrun concurrently together along with communication information such aschannels (see Section 3.2.3) and any exposed ports to be available for furthercompositions. The component-based design consists of basic componentswhich represent processes at the lowest level and can be composed intolarger components representing a group of processes. This is based on theconcurrency models from [28]. These components can further be composedwith each other to form an even larger software system. For convenience,the multiplicity (number of copies) of each component in a composition isspecified using array notation. Expressions in the CPA can include the +, -,*, and / operators along with variables which can be initialized with a valuein a later step of the four stages.A naming system is created during the specification of the CPA. Sinceeach process is given a name and each composition component is also givena name, this creates a tree like structure where the name of each particularprocess can be identified by following the names of the compositions. (seeSection 3.2.1).To facilitate communication between processes, each process has definedinput ports and output ports. During a composition of these processes,channels can be defined to connect an input port of a process to the out-put port of another process. To facilitate encapsulation, each compositionmust redefine the exposed ports that can be used when it is further com-posed with other components. This communication information is madeavailable during program execution allowing processes to identify communi-cation channels by name rather than rank. (see Section 3.2.3).Furthermore, if a process requires input parameters, the parameter typesshould also be specified as a list in the CPA. Values of the parameters areto be specified in the next step when their values are defined.143.1. Multiple Layered Specification (CPA Architecture)3.1.1.2 Communication Process Architecture Variables(CPA-V)The CPAV specification is used to initialize a CPA by giving values to anyvariables used in the CPA specification. This information combined withthe CPA creates an instantiation of the system. In Figure 3.3, variablesinclude those used for process multiplicity in a composition and those usedfor the parameters to each process. Values for the process or compositionmultiplicity can be specified as an equation. Equations in the CPA caninclude the +, -, *, and / operators along with other variables defined.Values for process multiplicity can also be a Kleene star (*) indicating awild card such that any number of processes of this type is acceptable or aplus-sign (+) indicating one or more is acceptable.In addition to variables being used by the specification for the purposeof multiplicity, parameters to processes defined in the CPA-V can be anystring.Overall, this specification determines how large a system is to be launched.initializ e primes ieve { numprimes = 2000}V ariab le I nitializ ationFigure 3.3: CPAV Example of a prime sieve. Strings can be used as valuesfor function parameters only.3.1.1.3 Container SpecificationsOnce we have the software system, it is then time to match it with thecompute resources available. The container specification (see Figure 3.4)provides the restrictions of what (MPI processes) can be put into a con-tainer (OS process). This specification consists of two parts: the containerspecifications, and how many of each container specification to create actualOS processes for. Specifications define how many of each process be can153.1. Multiple Layered Specification (CPA Architecture)placed in each container. The syntax used is motivated by regular expres-sions, however for simplicity, we limit the types of wild card characters. Thenumber may be a positive integer or a Kleene star (*) indicating any numberof that process is acceptable or a plus-sign (+) to indicate at least one ofthat process must be present. The containers can be nested in each other.Finally, the number of each container needs to be specified. The optionalnumber after the slash is a restriction to the number of MPI processes thatcan be put into this container. When more than one composition uses aparticular process, it is possible to create containers that identify the cor-rect instance of the process to place. One way is through the suffix which isspecified before the multiplicity (e.g. processname[6]). The other option isthrough a prefix which can be specified before a colon in front of a containerspecification name (see Chapter 4).containers { containers pec primeS ieveS tart = < P rimeS tart[1], primeS ieve > containers pec primeS ieve = < P rimeE lement[*] > containers pec primeS ieveE nd = < primeS ieve, P rimeE nd[1] > boxes ( primeS ieveS tart[1] | | primeS ieve[15 8] | | primeS ieveE nd[1] )}I nitializ ation of O S processesO S process ty pes ( w h at M PI processes can g o in th is specif ication)Figure 3.4: Container Specification Example of a prime sieve. Constraints ofwhat MPI processes can go into each OS process as well as the number of OSprocesses is defined. In this example, there are 160 OS processes in total.One of them will have a PrimeStart process and as many PrimeElementprocess as needed. One of them will have a PrimeEnd process and as manyPrimeElement processes as needed. Finally, 158 OS processes will each haveas many PrimeElement processes as needed.3.1.1.4 Function SpecificationProcesses defined in previous steps are placeholders so we need to bind eachof them to a function. The function specification (see Figure 3.5) associateseach of the processes specified in the CPA with a “.c” function and an “.h”163.1. Multiple Layered Specification (CPA Architecture)file containing the function header. Any additional glue function (see Sec-tion 4.4) that is required to be run by the system needs to be defined inthe function specification. This specification basically contains adjustableparameters for user defined functions and user defined variables. The func-tion specification is the choice of which user function to run. There is alsoa section to specify the user functions that are to be executed as a gluefunction (see Section 4.4). And the GENV (Global environment variables)specifications are for exposing user’s internal variables in the external inter-face.f unctionmapping { P rimeS tart = { " primeS ieve.h" , " generator" } P rimeE lement = { " primeS ieve.h" , " s ieveE lement" } P rimeE nd = { " primeS ieve.h" , " las tE lement" }}glue { gluemapping { genvcondition = { " M A P P I N G " , " s eq uential" } mapinit = { " mapping-cus tom.h" , " init_ s eq " } } gluemapping { genvcondition = { " M A P P I N G " , " s trides " } mapinit = { " mapping-cus tom.h" , " init_ s trides " } } gluemapping { genvcondition = { " M A P P I N G " , " random" } mapinit = { " mapping-cus tom.h" , " init_ random" } } gluemapping { maincleanup = { " mapping-cus tom.h" , " my _ f ree_ params " } }}genv { G R O U P = 25 S E E D = 6 C U T S = 20}G lu e F u nctionsG lob al E nv ironment V ariab lesF u nction to M PI process assig nmentFigure 3.5: Function Spec Example of a prime sieve173.1. Multiple Layered Specification (CPA Architecture)3.1.2 The Assembly Line (The Four Stages)The initialization of the system from the architecture up to creating arunnable program progresses through four stages by reading the definedspecifications one at a time (see Figure 3.6). The four stages were chosenin order to split up four decisions: the architecture design, the size of theprogram, the resource constraints, and the function binding.C P AStag e 1 ( D esig n)C P AC P A VI nstanc eC ontainer Sp ec if ic ationC ontainer A ssig nmentF u nc tionsM ap p ingC sou rc e c od eM ak ef ileC onf ig f ileStag e 2 ( I nstantiation)Stag e 3 ( P ac k ing )Stag e 4 ( Sh ip p ing )Figure 3.6: The four stages. At each stage of the tool’s execution, an inputfile is used to combine and further initialize the system.At first, it may seem like the architecture and the container specifica-tion seem like redundant specifications of the same thing and that they183.1. Multiple Layered Specification (CPA Architecture)could be combined to form a single specification. For simple examples thismay look to be the case, however the difference in the two specifications be-comes more apparent when some form of communication mechanism is used.When communication is used, it is either based on the naming in the CPAspecification or the channels specified there. Then it becomes apparent thatfunction implementations depend on the CPA specification and that changesto the CPA specification may break the program. However, one of our goalsis to place the program on different resources. We want to support manydifferent ways to map processes to the resource without having to changethe program. Thus, we chose to create two separate specifications.3.1.2.1 Stage 1 – DesignThe CPA is read from a file. Refer to Figure 3.1. The architectural designof the system is now present but uninitialized and so the system size is notyet known.3.1.2.2 Stage 2 – InstantiationThe CPA-V is read from a file. Refer to Figure 3.3. When combined withthe CPA, an instance of the software system is now created. The numberof instances of each MPI processes that is packed into the system can bedetermined now (except in cases where wildcards are specified, the exactnumber of that process is affected by the container requirements in Stage3). Compositions that contain a multiplicity adds a multiple of the sub-group of processes.At this point, we have determined how many of each process is in thesystem.3.1.2.3 Stage 3 – PackingThe Container specification is now read in. Refer to Figure 3.4. At thisstage, we now have the boxes (OS processes based on container specifica-tions) in which to pack our items (MPI process). So, using the number ofcontainers and the information from stage 2, the processes are packed into193.1. Multiple Layered Specification (CPA Architecture)the boxes evenly (with regards to the number of each type of MPI process)while satisfying container restrictions. This packing technique is a policythat can be changed in the tool’s program. Containers specifications needto specify processes that can be packed into each container (this is unrelatedto the composition specified in the CPA).Currently, the packing differentiates between each instance of the sameprocess only when composed separately. All processes that are in the samecomposition are treated the same. Glue functions may allow reordering ofsimilar processes (see Section 4.4).At this point, we decide which and how many of each MPI process is ineach OS process.3.1.2.4 Stage 4 – DeploymentThe Function specification is read from file (see Figure 3.5). Using the fileinformation, we can now create a mapping.c file that imports the appro-priate .h file for the functions needed and provide a function mapper thatreturns the correct function based on the rank of the process. A configfileis also generated. This configfile accompanies the mapping.c file to indi-cate which contiguous blocks of ranks belong to the same OS process (seeFigure 3.7).mpiexec -n 1 -nf g 8./executable :-n 2 -nf g 2 ./executable :-n 1 -nf g 4 ./executableCOMMAND8 M P I P rocesses( R a n k 0 t o 7 )2M P I P rocesses( R a n k 8 t o 9 )2M P I P rocesses( R a n k 1 0 t o 1 1 )4M P I P rocesses( R a n k 1 2 t o 1 5 )OS P r o ces s OS P r o ces s OS P r o ces s OS P r o ces sFigure 3.7: The MPI ranks are assigned to the processes in the order theyare defined in the execution command. A separate rank is given for eachnfg value.The machinefile can be used to customize which machine and whichcore each OS process is allocated to. The machinefile is a descriptionof the available compute nodes and the number of cores available on eachof them. When mpiexec is called, the OS processes are mapped to theircorresponding cores from the start to the end of the machinefile.203.1. Multiple Layered Specification (CPA Architecture)The mapping.c file creates a wrapper function for each MPI process.The default wrapper selects the correct parameters from the configfileand pass on to the original function as regular argc, argv parameters (seeFigure 3.8). This wrapper only runs once at the beginning. The wrapperonly requires the information of which function to call and the offset for theparameter. That is, the minimum parameter content for each process typeis its name (this is to keep the correct offset).mpiexec -n 1 -nf g 8./executable f n_ name,arg1_ 1,arg1_ 2 f n_ name2,arg2_ 1:-n 2 -nf g 2 ./executable f n_ name3 f n_ name4,arg4_ 1,arg4_ 2:-n 1 -nf g 4 ./executable f n_ name5COMMANDint my main1( int argc, char** argv ){…}int my main2( int argc, char** argv ){…}s tatic int w rapper_ f n1(int argc, char *argv[]){ return w rapper_ invok e(argc, argv, & my main1, 1);}I nt w rapper_ invok e(int argc, char *argv, int (*orig_ f unc)(int, char**), int of f s et){ int new _ argc = … ; char *new _ argv = … ; return (*orig_ f unc)(new _ argc, new _ argv);}Decodes argc & argvargc = 3argv = [“executable”, “fn_name,arg1_1,arg1_2”, “fn_name2,arg2_1”]new_argc = 3new_argv = [“fn_name”,“arg1_1”,“arg1_2”]Original function invoked with decoded argumentsMapper maps to wrapper instead of functionGenerated mapper for each function typeFigure 3.8: Wrapper functions are provided for each function in the program.The compressed arguments list from the execution command is decoded andthen passed on to the user’s function.The function for determining the parameter input to each MPI processis pluggable meaning the user could provide a separate function to program-matically determine the parameters.The mapping function is grouped by each line of the configfile (each-nfg -n executable). Within each group further describing what MPI pro-cesses they contain. This way, the size of the mapping function grows pro-portionally with the configfile. The configfile already contains the -n213.1. Multiple Layered Specification (CPA Architecture)and -nfg information in a compacted way and if the user needs to createcomplicated mappings then the configfile has to be more verbose andthus so does the mapping function. In the worst case, every MPI processcalls a different function which forces the mapping function to be completelyexplicit. In the best case, every MPI process calls the same function and asingle line mapping function would suffice.P roj ec t . o f iles . c f iles M ap p ing g lu e f u nc tions M ap p ing M ak ef ile M ak ef ile C onf ig f ile O th er g enerated map p ing f iles R u n sc rip t Figure 3.9: The project folder organization. The files generated are con-tained in a sub-directory of the project.The output files are to be in a sub-directory of the project directory(see Figure 3.9). The programmer will create a new target in the projectMakefile which will compile its project (leaving out any FG boilerplate)and then invoke make on the Mapping directory giving it the name of theexecutable make APP=exename. A cmcmpi.a is built inside the Mappingdirectory. This file needs to link with the user’s compiled code into a singleexecutable. A single run script is added to the project directory for execution223.2. Communicationof the mapping generated program.3.2 CommunicationOne of our main goal is to externalize process names so that re-mapping ispossible without the programmer having to do it. We propose three methodsof defining communication: Process Discovery, Hierarchical Communicators,and Channels.3.2.1 Process DiscoveryOne of the easiest way to make communication placement-independent isby giving each MPI process an identifier that does not change based onthe position of the process1. Conveniently, our compositions are alreadynamed and because it is a hierarchical composition, each process can beuniquely identified by chaining the names of the hierarchy (see Figure 3.10).If a multiple of the same process is in a composition, they are identifiedby the same name. Therefore, it is possible that more than one processmay be identified by the same name. However, this is not an issue becausethose processes are identical, there is no need to differentiate between themother than having an additional index value for each of them throughoutthe execution of the program. This means that these processes are uniquelyidentifiable using a combination of name and index.Consider this excerpt of the CPA’s composition.‖- 3© = ( 4©[ 5©] )‖ 1© = ( 2© ‖ 3© ‖ 6© )In this example (see Figure 3.11), there is a single process named 1©. 2©and there is a set of the same processes called 1©. 3©. 4© with indexes rangingfrom 0 to 5© or in other words 1©. 3©. 4©[0.. 5©]Only the MPI processes’s information is carried to the packing stage (seeFigure 3.12).1See Appendix B.1 for the Process Discovery API.233.2. Communication 12 634 4 4 51 2 M P IP roc essesC omp ositions. . .Figure 3.10: Visualizing the CPA using a tree of composition. The namingcan be seen clearly here by following from the root to the correspondingMPI process.1. . .2 3 5 64Figure 3.11: Visualizing the CPA. A name is given to each level of compo-sition as well as each process. 1 . 2 1 . 61 . 3 . 4 1 . 3 . 4 1 . 3 . 4 51 2 M P IP roc esses . . .Figure 3.12: MPI process information available for packing. The composi-tions are expanded out and only processes with their full name is kept forthe next steps.The MPI processes are put into available containers given they matchthe container specification (see Figure 3.13).After the MPI processes are assigned to each container, the container canbe seen as an OS process (see Figure 3.14). Now, a process can be lookedup using its process ID (pid) and its name given that there is no currentlyknown distinguishing features amongst the MPI processes with the same243.2. Communication 1 . 3 . 4 1 . 2 1 . 61 . 3 . 4 4 1 . 3 . 4 51 2 . . .C ontainer Sp ec if ic ations. . .P I D = 0 P I D = 1 P I D = NFigure 3.13: Container Specification defines requirements on the contents ofthe container.name. 1 . 2 1 . 6O S P roc esses 1 . 3 . 41 1 . 3 . 4 1 1 . 3 . 41 1 . 3 . 41 1 . 3 . 41 1 . 3 . 42 4 9 1 . 3 . 41 1 . 3 . 4 1 1 . 3 . 4 1 1 . 3 . 4 1 1 . 3 . 41 1 . 3 . 4 1 1 . 3 . 4 1 1 . 3 . 41 1 . 3 . 44 9 9 1 . 3 . 41 1 . 3 . 4 1 1 . 3 . 41 4 1 . 3 . 46 8 0 1 . 3 . 41 41 1 . 3 . 45P I D = 0 P I D = 1 P I D = N. . .Figure 3.14: OS Processes are formed and placed adhereing to containerrestrictions.3.2.2 Composition-Based Hierarchical CommunicatorsUsing a naming system allowed the movement of processes without havingto manually adjust the ranks in the programmers point of view. That sys-tem is lightweight and efficient for end-to-end communication. However, forcollective communication this is not ideal.In order to take advantage of MPI’s collectives, we need to use communi-cation scopes called communicators. To scope communication, it is naturalto have a communicator2 for each composition. This means that a processthat is composed twice would have 3 communicators, one for each level as2See Appendix B.2 for the Communicators API.253.2. Communicationwell as an overall world communicator. Since each process is a member ofmultiple communicators, a proper system to retrieve the intended communi-cator is required. Conveniently, our compositions were named and the samenaming system that is used to identify a process can be used to identify acommunicator. For example, a process named 1©. 3©. 4© has a communicatorcalled 1© and a communicator called 1©. 3© as well as a world communicator.The communicators are created at the initialization stage and thus incurno extra overhead during execution of the program. The initialization is doneusing MPI Comm split and done starting at the highest level (world) andmoving down. This allows us to create as many communicators as possiblesimultaneously rather than sequentially. Not only are these communicatorsefficiently created before the rest of the process execution, their smallerscope enable the easier creation of more communicators as there is no needto synchronize MPI COMM WORLD when creating a new communicator.3.2.3 ChannelsThe previous two methods abstracted away the need to use MPI processrank numbers. However, there was still a need to know the names of pro-cesses in order to discover them and communicate. For example, when usinghierarchical communicators, the developer still needs to know the name orlevel of compositions it is in to be able to get the correct communicator.In order to make processes completely modular and able to be developedwithout the knowledge of how they are composed, we need a way to spec-ify only the inputs and outputs of a process and then later connect themtogether appropriately. This allows functions to be developed to the inputand output ports specification which includes a name of the port as well asits direction. The local name of a port is unchanged regardless of where it isattached. So a port called next may be connected to a port called prev toform a channel and each process refers to that connection by its local name.A tag may also be specified to be used for communication in that channel.The creation of channels3 is similar to the creation of connection lines3See Appendix B.3 for the Channels API.263.2. Communicationon labels in [28] where similar labels represent a shared action between thetwo states. In our case, the label names are used in the program to identifythe ports from which communication can be used to send or receive data.So instead of using a renaming technique to make connections, we specifythe directional connections between two named ports in order to be distinctabout the flow of data.To facilitate encapsulation, each composition must define what ports areexposed for new channels when further composed. This is inspired by thehiding of actions in [28]. To the exterior of the composition, only these par-ticular ports are visible and their names are as specified by the compositionand not by the original process. The exposed port is seen as a double endedport (i.e. an in-port as well as an out-port). To make this clearer, both aninbound and an outbound channel must be connected to an exposed-port.However, to the individual processes, this intermediate port is hidden. Thedirection of the intermediate port is derived by the directions of the twochannels attaching to it (see Figure 3.15).Figure 3.15: Channels are specified by connecting port to port. Intermedi-ate ports are used when creating channels through compositions. The greycircles are external ports (intermediate ports visible outside of a composi-tion) while the white circles are internal ports (an MPI process port visibleonly inside the composition).Frequently, a set of recurring patterns must be applied to sets of portswhen creating channels. Conveniently, we provide a loop option for thispurpose. Ranges can be specified such as R = [0..numprimes-4] (which273.2. Communicationmeans that R is a value from 0 to numprimes-4 inclusive) and can be used togenerate channels over a pattern. For example, a chain can be specified as( for i in R next[i] -> prev[i+1] tag 5 ) which creates a channelfrom the next port of process index i to the prev port of the process at indexi+1 and to use the tag 5 for any communication across this channel.Channels can support both one-to-one communication as well as one-to-many communications as long as all the source processes are of the samename and all the destination processes are of the same name.The channel information defined per level in the CPA is translated intoan actual instance. This information is available to the program duringexecution. To clearly identify a process’s channels, each type of processhas its own structure to define its channel information. The struct’s nameis generated based on the processes’ base name as defined in the CPA. Forexample, if a composed process is called PrimeSieve.PrimeStart, the channelstructure is only based on the process name PrimeStart as they all have thesame channel structure regardless of how they are composed.Listing 3.1: Channel struct definitiontypedef struct chaninfo {int* rank;int size;int tag;int direction;MPI_Comm communicator;} chaninfo_t;typedef struct chan_primeelement{chaninfo_t prev;chaninfo_t next;} chan_primeelement_t;The structure of a processes’ information (see Listing 3.1) is based onthe names of the ports. Intuitively, it is possible to get the information ofthe channel through using the process’s port name (as defined in the CPA).In addition, the tag id to be used for this communication is also available283.2. Communicationas well as the direction of the channel. Further, the communicator to beused is also specified. At the moment, the lowest level common hierarchicalcommunicator is used. This opens up for further improvements to howcommunicators can be leveraged to provide a different layer of scoping forchannels. Since there are four pieces of information available for a particularchannel, we decided to use an additional C structure to store the four dataitems.Given the structure information available to the program during runtime, it is still up to the program whether or not this information is tobe used as intended. For example, a channel between two processes cansend data in a direction contradicting the original design. In other cases,the specified tag may not be utilised by the program even though it wasspecified in the CPA. It is important to note that the information specifiedin the CPA is only made available to the program but is in no way enforced.To enable this requires the creation of wrappers for all MPI send and receivemethods.These channels are assigned according to their process type and index.This makes it possible to rearrange similar processes by plugging a custompermutation as a part of the glue code (see Section 4.4).One interesting point to consider is what happens when two channelsfrom the same source and destination have the same tag. One way would beto consider each channel as a separate channel and thus we need to createanother scope for this channel. In MPI communication, there are 4 piecesof context that scope a communication message: source rank, destinationrank, tag, and communicator. Since the first three are unchangeable, theonly way we can scope these messages are through the creation of anothercommunicator. However, the creation of a separate communicator for eachchannel is not scalable.The other option is to treat the two channels as the same whenever theyhave the same tag. It is up to the user to create channels in the CPA withseparate tags. This is the same responsibilities required in traditional MPIcommunication so this option is reasonable. In the next section we describeflow-controlled channels which removes the need for a tag (see Section 3.2.4).293.2. Communication3.2.4 Flow-Controlled ChannelsChannels allowed a light weight approach to sending messages from sourceto destination by providing the bare minimal amount of information (ranks,tags, direction, and communicator).MPI communication is of two types, eager and rendezvous. The eagerprotocol is such that messages are sent from the sender assuming they aresmall enough to be buffered by the receiver. Any messages that exceed animplementation-dependant eager limit are sent using the rendezvous proto-col which involves sending an envelope data to the receiver and waiting untilthe receiver indicates that sufficient buffer space is available.We provide simple channel4 interface that includes flow control in orderto utilize the efficiency of eager sends while limiting the amount of queuedunexpected messages.3.2.4.1 BoundedThe amount of messages unreceived at the receiver must be bounded inorder to avoid exhausting the buffers and queuing delays. With this aim,we introduce an intermediate buffer between the source and the destination.The sender sends to the buffer until it is full and then future sends blockuntil space is available. When possible, the buffer sends the messages to thereceiver.In order to not flood the receiver, the receiver and proxy uses a token-based system where one token is consumed for each message sent to thereceiver. This is similar to the Guaranteed Envelope Resources (GER) tech-nique [5] where an each process-pair has a GER which guarantees againstfailure if the GER is not exceeded. The GER is estimated based on thenumber of processes there are per node and the number of possible senders.In our research, the number of tokens is a hard-coded number and is nota guarantee. Further research could gather memory size of the nodes andcalculate a guaranteed size.4See Appendix B.4 for the Flow-Controlled Channels API.303.2. CommunicationThe intermediate buffer allows for an extra buffer to enable more asyn-chronous messaging while allowing the sender to block when the buffer isfull and letting the receiver pass back tokens. When the receiver is out oftokens, the intermediate buffer can still buffer more messages for the senderuntil itself is full.The extra buffered message may be sufficient to allow the sender tocomplete its send and move on to other tasks while the intermediate bufferwould send the data to the receiver as tokens are required.Our technique is similar to TCP’s advertised window mechanism. Wedo not advertise the proxy size to the sender and instead post MPI receivesonly if there is sufficient buffer space. A sender using a synchronous sendthen blocks. In FG-MPI, MPI processes can be co-located to share a singleprocessing core and so cycles can be used by other MPI processes when asender is blocked.This also makes the communication architecture simpler as the senderonly needs to send and not receive any replies. The intermediate bufferhandles the receiver tokens instead. All the receiver API has to do is receivemessages and send back to the intermediate buffer some tokens for moremessages.It is more efficient to start updating tokens before the receiver reads allthe messages in order to allow the intermediate buffer to send more as soonas possible while keeping the number of token update messages low.The sender sends messages to the intermediate buffer via two regularsends followed by a synchronous send. The synchronous send allows thesender to block if the intermediate buffer is full because the intermediatebuffer does not post further MPI Recv until it has sufficient buffer space. Wedon’t want to send all messages synchronously because that is a performancehit so we take advantage of the middleware’s receive buffer queue of theintermediate buffer. This set up minimizes the chance of the channel beingcompletely empty in case messages are available to be sent.313.2. Communication3.2.4.2 BlockingThe FG-MPI scheduler contains a blocking mode where MPI processesblocked on a send or receive are put into a block queue so as not to waste pro-cessing cycles. The sender should then block when the intermediate bufferis full. To ensure that cycles are not wasted, blocking the sender processallows other co-located processes to proceed. Another advantage is that it ismuch simpler to code and debug a blocking sender as oppose to one wherethe send may either succeed or fail. The same is true for the receiver.3.2.4.3 In-Band ControlChannels are considered unidirectional because data is sent in only one di-rection. However, control data is sent in-band and transparent to the senderand receiver. These control messages include startup, tokens, and termina-tion messages.With FG-MPI, processes that are created but blocked on a call incurno run-time penalties except a small amount of memory. The FG-MPImiddleware uses shared memory access to optimize communication betweenco-located processes. FG-MPI also provides a zero-copy API that enablesthe sending of messages through passing pointers rather than copying mem-ory. The communication cost is much less compared to a message throughthe network. This allowed us to be more flexible when designing the archi-tecture for the intermediate buffer and where messages should go. Insteadof time-slicing the sender process to receive token updates and terminationmessages, we allow FG-MPI to provide concurrency when we use collocatedprocesses to do each of those tasks. The code is more cohesive and therebymore maintainable and easier to understand.3.2.4.4 Channel TypesWe support three types of flow-controlled channels:a) one-to-one channels with a single source and single destination,323.2. Communicationb) one-to-many channels where there is one source but many destinations(data from the source travels to one of the possible destinations), andc) many-to-one channels where there are many sources sending to a singledestination.3.2.4.5 One-to-OneIn the one-to-one case, it is a pipe for sending data through (see Figure 3.16label A).3.2.4.6 One-to-ManyIn the one-to-many case, we need to account for the distribution of theincoming data to the destinations available (see Figure 3.16 label B). Thereis a single intermediate buffer for all the messages that are sent out fromthe single source. That intermediate buffer is called a proxy process as itdistributes its data to the destinations. How it distributes the data is amatter of policy and can be changed as required. In general, there are a fewways to distribute the data: round-robin style, or priority style based on theavailable tokens.3.2.4.7 Many-to-OneIn the many-to-one case, we are gathering up data from many processesand sending them to one process (see Figure 3.16 label C). This case is alsosimple as the intermediate buffer proxy gathers all data from all sources andforward them on to the destination.3.2.4.8 Implementation DetailsFlow-controlled channels are created for each channel specified in the CPAno matter if they are used or not.Both sides need to open the channel before data can flow through it.We make the assumption that if one side opens a channel, the other sideeventually opens the channel also. This is synonymous to MPI’s send and333.2. CommunicationSou rc e SinkB u f f erSou rc e SinkB u f f erSinkSinkSou rc e SinkB u f f erSou rc eSou rc eA )B )C )Figure 3.16: The types of channels we consider. A) One-to-one channelswith a single source and single destination, B) One-to-many channels wherethere is one source but many destinations, C) Many-to-one channels wherethere are many sources sending to a single destinationreceive where if one side sends a message, it is assumed that the other sideeventually posts a receive. However, with the introduction of the open andclose API, it is possible to decide whether to allocate the resources neededto support the channel; such as buffer space.This also allows many-to-one and one-to-many channels to indicate whichsources and destinations are available to send and receive data. Especiallyin the one-to-many case, it is important to know which receivers are avail-able to send data to and which receivers to not send to because they havenot agreed yet to receiving any data. In other words, many-to-one andone-to-many channels have an agreement that once the channel is opened,there must be at least one sender and one receiver at each end at any time.Senders and receivers can unilaterally terminate communication by notifyingthe channel that it will not send or receive data until further notice as longas it is not the last active sender or receiver. Likewise, senders and receiversmay join the channel at anytime as long as it remains open under the termsabove. When the last sender closes the channel, the channel is shutdown and343.2. Communicationonly active receivers are notified. The channel is then permanently closedand can not be opened again. When the last receiver wants to close thechannel, then it must seek agreement from the sender and terminate afterthe sender shuts down the channel.To keep this efficient, there exists a proxy process (see Figure 3.17) thatacts as the intermediate buffer. This intermediate buffer listens for the openand close messages of the sender and receiver as well as receive and bufferincoming messages and sending them to receivers when tokens are available.D estinationSou rc e P rox yStatu s P rox yD ata/ C ontrol D ata/ C ontrolT ok ensStatu sSh ared memoryC o- loc ated ( sh ared memory ac c ess)T ermination R eqFigure 3.17: Flow-controlled channel communication for a one-to-one chan-nel.Tokens provided by receivers are also organized by this proxy process.This proxy is also required in order to support the blocking nature of thesend request so that when a synchronous send is sent to the proxy and theproxy buffer is full, then it blocks until the proxy accepts the data. Thebuffer in the proxy allows the sender to do other work while messages aresent to the receiver when possible. Without this proxy, it is difficult for thesender to decide when to receive token messages while carrying on its othertasks.If we instead bypassed the buffer proxy and block on token updateswhen the tokens are depleted, then the only buffer available to us is thenon-guaranteed buffering provided by the middleware. This proxy processis generated for each OS-process whenever channels are created in the ar-chitecture specification. As a result, every MPI process sharing a singleprocess core, there exists only one shared proxy process. The intermediate353.2. Communicationproxy process is co-located with at least one side of the channel in order tominimize the extra communication due to having an intermediate process.The most efficient placement of the proxy would be on the single processend of a one-to-many or many-to-one channel in order to reduce communica-tion overhead by leveraging FG-MPI’s use of shared memory for co-locatedprocesses (see Figure 3.18 and Figure 3.19).D estinationSou rc e P rox yStatu s P rox yD ata/ C ontrolStatu sSh ared memoryC o- loc ated ( sh ared memory ac c ess)D estinationD estinationD estination= D ata/ C ontrolT ok ensT ermination R eqN ote: th e d ou b le- end ed arrow s rep resent a c omp ressed rep resentation of th e c ommu nic ation b elow .Figure 3.18: Flow-controlled channel communication for a one-to-manychannel. The proxy is co-located with the source to reduce the effect ofthe communication between the source and the proxy.3.2.4.9 Channel TerminationChannel termination requires both sides to agree. For correctness, channelmust be emptied before termination. This implies that at some point thesender stops sending to give the receiver a chance to empty the channel, afterwhich the channel can be closed. There are two cases to the terminationproblem.a) If the sender terminates the channel first (see Figure 3.20), then thetermination message is sent with the same tag, source, and destination byMPI’s message ordering semantics the termination message reachs the re-ceiver after all the previous data has been received. Once the sender sends363.2. CommunicationD estinationSou rc eP rox yStatu s P rox yD ata/ C ontrolD ata/ C ontrolT ok ensStatu sSh ared memoryC o- loc ated ( sh ared memory ac c ess)T ermination R eqSou rc eStatu s P rox yD ata/ C ontrolStatu sSh ared memorySou rc eStatu s P rox yD ata/ C ontrolStatu sSh ared memoryCo-located (shared memory access)Co-located (shared memory access)Co-located (shared memory access)Figure 3.19: Flow-controlled channel communication for a many-to-onechannel. The proxy is co-located with the destination. Data must travelfrom the proxy to the singular destination in this channel construct so thegoal is to minimize the communication overhead of sending data from theproxy to the destination.the termination, the channel is closed from the perspective of the senderas the receiver can not demand more data. When the receiver receives thetermination message, it is also closed because the sender promises not tosend data.b) The second case is if the receiver closes the connection (see Figure3.21). It is a matter of policy whether the receiver can force the sourceto stop sending or merely request the source to stop sending. In eithercase, a termination request message is sent to the sender and dependingon policy, the sender carries out the same procedure when a sender termi-nates the connection. As a result, when the receiver sends the terminationrequest, the channel continues to receive data from the sender as long asthe sender’s termination message has not reached the proxy. To avoid non-blocking senders, termination requests should not be sent directly to the373.2. CommunicationO p enStatu sO p enD ataD ataSou rc e C ontrol P rox y P rox y D estinationD ataD ataD ataD ataD ataD ataD ataT ok ens U p d ateD ataT erminationT erminateStatu sI nitializ ationD ata f lowT ermination - - - - - - - - - - - - - - C o- loc ated - - - - - - - - - - - - - - Figure 3.20: The typical messaging between source, destination and proxywhen the sender initiates the termination of the channel.source. With FG-MPI, the issue is solved by having yet another co-locatedprocess for handling the termination request. This process is co-locatedwith the sender and can take advantage of the shared address space. Thenwhen this process receives the termination request, it only needs to set aflag in the local memory and the sender can check this flag before sendingany messages. This extra process incurs almost no extra resources as it isalways blocked on receiving the termination request.383.2. CommunicationO p enStatu sO p enD ataD ataSet T ermination F lagSou rc e C ontrol P rox y P rox y D estinationD ataD ataD ataD ataD ataD ataD ataD ataD ataT ok ens U p d ateT ermination R eqT ermination R eqD ataD ataD ataT erminationT erminateStatu s - - - - - - - - - - - - - - C o- loc ated - - - - - - - - - - - - - - I nitializ ationD ata f lowT erminationFigure 3.21: The typical messaging between source, destination, and proxywhen the receiver requires the termination of the channel. Notice that thereceiver can only request and not demand termination. It is still up to thesender to terminate the channel.3.2.4.10 Other ConsiderationsThere is an opportunity for optimization of this protocol when sending mes-sages between co-located processes. In FG-MPI, all message sends betweenco-located processes are synchronous anyways so there is no need to bufferany of the messages. In other words, flow control is not needed for mes-393.2. Communicationsages between co-located processes. So, to optimize this, we have messagessent directly from sender to receiver bypassing the proxy (see Figure 3.22).Sender’s termination messages also travel directly to the receiver. As well,the receiver’s termination request is sent directly to the sender’s controlproxy rather than through the intermediate proxy. This only works in theone-to-one case because the intermediate proxy in one-to-many and many-to-one also acts to load balance messages in addition to it’s buffering task.Furthermore, we are able to utilize FG-MPI’s zero copy API in order toeliminate an extra buffer copy when sending the data.D estinationSou rc eStatu s P rox yD ata/ C ontrolSh ared memoryC o- loc ated ( sh ared memory ac c ess)T ermination R eqFigure 3.22: To optimize for performance, communication between co-located processes can be simplified as the communication is always syn-chronous and flow-control is not an issue.We did not implement many-to-many FC channels. MPI collectives usingcommunicators are more suitable for this type of communication. We do notwant to re-create collectives since MPI provides them with optimization forcommunication and message buffering.The design above describes the messaging protocol to support flow-controlled channels. Alternative designs include changing the policy de-pending on the needs of the program. We have implemented one policy tothis problem but there are other that could be created to change the be-havior of: how messages are load balanced, whether the sender can keepsending messages once a termination request is received, whether channelscan be reopened after it is closed, whether individual senders or receiversmay terminate without the consensus of the others when it is not the lastsender/receiver in the channel.403.3. Encapsulation of CompositionsThe API does not completely solve the receiver buffer exhaustion prob-lem but rather mitigates it because the flow control is only present withineach channel. If a receiver has too many incoming channels, its bufferscould still be exhausted even if each channel sends a single message. WithFG-MPI, this issue is exacerbated as more co-located MPI processes meansthere is less memory available to each MPI process. Changes to the MPImiddleware are necessary to make this solution completely safe.Overall, the design of the proxy above includes the following points:a) To limit the number of buffered messages in a single channel to thesum of the size of the proxy’s buffer and the number of tokens provided bythe receiver.b) To disallow messages from traveling through the channel when itsstate is CHANNEL CLOSED. The sender can demand the closing of a chan-nel while the receiver can only suggest the closing of a channel.c) To be a reusable design for all three cases, one-to-one, one-to-many,and many-to-one.d) To solve the issue of scheduling when to receive control messages.Merging the tasks of the proxy with the sender creates a scheduling issuefor the developer. The only time the library’s code is run in the user’s codeis when it invokes one of the library’s API calls. But it is not efficient toreceive all control messages only when a library’s API is invoked.e) To ensure that no other non-channel related messages with the sametag, source, and destination interferes with the operations of the channelbecause messages are addressed to the proxy.3.3 Encapsulation of CompositionsGood software design calls for high cohesion and low coupling of pieces ofcode. Encapsulation of code normally suggests hiding the internal workingsof a black box. In other words, how functions interact with each other andhow things are represented inside the black box is unknown to the outside.The only way the external can invoke the black box is through a series ofinterfaces provided by the black box. So in a parallel program, the internals413.4. Composition of Programsare MPI processes (in FG-MPI, these are the functions). Encapsulation inthis case would be to hide what these processes are as well as how theycommunicate. To achieve this we use ports and channels.Ports are declared for each MPI process so distinct communication endpoints are known. To the process itself, it only needs to know about theport’s name regardless of how the process is composed or what its name is be-cause channels created between ports provides the communication medium.When composed, all internal ports are hidden.If any ports need to be available external to the composition, an exposedport must be declared and the internal port attached to it. To the externalenvironment of the composition, this composition has the same parts as aprocess: a name, and a set of ports. This encapsulation can even be multi-layered. For example, a system using a data structure is a composition ofthe system and the data structure. But then this composition can be furthercomposed with an application which treats the previous composition as asingle entity.How the channels are created through these compositions may be thoughtof as threading a string through a series of holes (or ports) and at the endpoints we tie them to complete the channel (see Figure 3.23). To the pro-cess itself, it has no interest in the intermediate holes (or ports) it has to gothrough but rather the end points only. Implementation wise, the interme-diate ports no longer exists after a channel is defined. Overall, this defineshow we address messages but as any other MPI programs the type of thedata itself must be agreed on separate to this specification.3.4 Composition of ProgramsThe ability to easily compose programs together to create a functioningsystem is important in exploiting the modularity of our programs. In asequential program, one can include a header file of another program andstart using it by communicating through an API. For a parallel program, itis not as simple as we need to account for the communication by identifyingthe correct targets as well as placements of the corresponding processes.423.4. Composition of ProgramsC h annels d ec lared th rou g h c omp ositionA c tu al c h annelsFigure 3.23: Creation of channels through composition. The white circlesrepresent ports while the grey circles represent intermediate ports explicitlyexposed by a composition.Creating two instances of a data structure is no longer simply having twovariables but rather we need to duplicate the processes that provide thosefunctions. Traditionally, one would have to handle the different communi-cation using complex and unintuitive ways of embedding conditionals in thecode in order for the process to communicate to the correct instance of itsneighbour (see Figure 3.24). One of our goals is to simplify this process byhandling the correct communication by providing an API for identifying thecorrect process for communication.To support modularity, each structure (or CPA) is given a name. Eachcan take a set of unbound variables that need to be initialized. In this case,the name of the structure creates a scope for the variables such that variabledeclarations for each structure do not conflict when structures are composedtogether. Maintaining modularity, it is possible to include structures fromother CPA specifications (see Listing 3.2).433.4. Composition of ProgramsP roc ess A( rank 0 )P roc ess B( rank 1 )P roc ess C( rank 2 )P roc ess A( rank 0 )P roc ess B( rank 2 )P roc ess C( rank 4 )P roc ess A( rank 1 )P roc ess B( rank 3 )P roc ess C( rank 5 )P roc ess A( rank 0 )P roc ess B( rank 1 )P roc ess C( rank 2 )P roc ess A( rank 3 )P roc ess B( rank 4 )P roc ess C( rank 5 )A )B )C )Figure 3.24: Assuming there are 3 process types: A, B, and C. If theseprocesses form a chain A) and we want to duplicate it, then the new chainmust take a new set of rank numbers where B) and C) are possibilities.Traditional methods of communication using ranks require the developer toconsider and handle these possibilities.Listing 3.2: Including another structure# Composition of helloWorld and primesieveinclude prime.compositioninclude hello.compositionEach “include” from each file is read by (CM)2PI. Each structure is ini-tialized by name. Once the structure has been initialized, its information(compositions and process declarations) is combined with the other struc-tures that are input into (CM)2PI. Currently, the tool looks at the directorythe specification was in for the include files as well as look into other directo-ries where it has already read a specification file. If this is insufficient, thereis an option to use a -i flag to specify locations to look for these include files.Each process is given a name and each further composition is given aname. This creates a tree of names from which we can trace out the full nameof an particular process. For example, PrimeSieve.PrimeStart means thereis a process called PrimeStart inside the composition PrimeSieve. To sup-port composition of components, different structures can reuse any compo-443.4. Composition of Programssitions created by another structure which it includes. For example, if wewere to compose two functioning parallel programs into one single parallelprogram. We would simply have to create a new CPA which includes thetwo CPAs of the respective parallel programs. This example does not in-clude any communication between the two components. However, it doesshow that we would not need to modify any code or rank numbers to achievethis because the tool would do it.However, if we were to make the two components communicate, thereare two ways to achieve this. One is through using the discovery API (seeSection 3.2.1) and the second is through channels (see Section 3.2.3). If thecommunication was to be through the discovery API, then the correspondingcode simply has to make sure the communication calls target the correctprocesses by discovering them through their name. If the communicationwere to be through the use of channels, then channels can be created betweenthe ports of two compositions while being oblivious about which processinside the compositions uses the port.In some cases, it may be required to create two copies of the same struc-ture. In this case, to preserve modularity, one can create this structure withall compositions hidden. Then in a separate structure create a compositionthat uses those compositions in the other structure (see Figure 3.25). Inthis case, in order to pack each program together, it is possible to identifywhich instance of that structure we wish to pack into a container by addingthe prefix of the composition (see Listing 3.3). For example if there wasa hidden composition called LinkedList and we create two instances of itby making a composition List1.LinkedList and List2.LinkedList. We canreuse the original containers for the LinkedList as well as identifying whichinstances is to be mapped to which container. If the original container wascalled LinkedListContainer, we can achieve our goal by specifying a prefixList1:LinkedListContainer and List2:LinkedListContainer.453.4. Composition of ProgramsListN od e ListN od eList1 . Link ed ListM anag erListN od e ListN od eList2 . Link ed ListM anag erA rc h itec tu reC ontainer Sp ec if ic ationListN od e*ListN od e*M anag erx 1M anag erx 1A c tu al M ap p ing( P ref ix ed )A c tu al M ap p ing( N ot p ref ix ed )O S P roc ess O S P roc essO S P roc ess O S P roc essFigure 3.25: A prefix can be used to identify the processes of a particularcomposition.Listing 3.3: Duplicating a structure# Assume that the LinkedList composition is defined in astructure defined in linkedlist.cpainclude linkedlist.cpastructure linkedlistx2 () {|| List1 = ( LinkedList )|| List2 = ( LinkedList )}initialize linkedlistx2 {}containers {boxes ( List1:linkedlistspec [8] ||List2:linkedlistspec [8] )}46Chapter 4Placement4.1 Combination of Concurrency and ParallelismThe composition specifies the system and its correctness with respect toprocess communication and composition. There still remains the placementof the computation on the machine. Satisfying the resource constraintswe can also optimize for performance. The container specification speci-fies the available resources that the tool places on the previously initializedMPI processes. The container specification defines two parts: the containerspecifications, and the initialization of those specifications into boxes (OSprocesses).The container specification define how many of a particular process canbe placed into an OS process made with that container specification. Thesespecifications can also be composed into other container specifications. Eachof these specifications can also specify an nfg limit in order to limit theamount of collocated processes that are allowed in the container in orderto have fewer MPI processes contending for the same processing power. Anumber of OS processes (or boxes) are specified using the container specifi-cations defined. Any number of these boxes can be created and each one inthe end translates to being an OS process.When the number of each process to be placed in an OS process isspecifically defined, it is easy to allocate. However, when the number ofprocesses to allocate is a wild card or that the number of a particular processthat can be placed into an OS process is a wild card, there is more flexibility(see Figure 4.1).However, with this flexibility comes a problem to allocate the MPI pro-cesses to the OS processes intelligently. What is considered a smart tech-474.1. Combination of Concurrency and ParallelismC ontainer Sp ec if ic ationListN od e*ListN od e*M anag erx 1M anag erx 1ListN od eX 2 0 0 0ListN od eX 2 0 0 0M anag er*M anag er*I nstantiated Sy stemP lac ement u sing 4 O S p roc essesListN od ex 5 0 0ListN od ex 5 0 0M anag erx 1M anag erx 1X 4P lac ement u sing 8 O S p roc essesListN od ex 25 0ListN od ex 25 0M anag erx 1M anag erx 1X 8Figure 4.1: The use of wildcard in the architecture initialization gives free-dom to place as many processes as need such as the Master process. Theuse of wildcard in the container specification allows for freedom to place afixed number of MPI processes (as specified in the architecture) to be placedevenly on the resources available.nique is subjective and dependant on the problem. However, without anydomain knowledge, the reasonable option is to balance out the number ofMPI processes on each OS process. Guidance can be given by limiting thenumber of MPI processes in a container specification. Hence in the end, wehave a set of constraints: the number of each type of MPI processes in thesystem, the number and types of MPI processes that can appear in an OSprocess, and the maximum number of MPI processes.484.1. Combination of Concurrency and ParallelismTo quickly calculate a placement, we used a linear programming ap-proach as the constraints fits it well. Lpsolve [2] is the linear programminglibrary used. Dependant on the linear programming library used, whengiven just the constraints the allocation of MPI processes to OS processesmay not be even (i.e. the first OS process is stacked with MPI processeswhile the rest are empty). To mitigate this problem, a target number ofMPI processes is calculated and the linear optimization is to minimize thedifference from the target numbers.Often times there are simple yet interesting patterns of MPI processallocation that yield better performance. For example, in a prime sieveproblem where each process is is a sieve then it makes sense to co-locatefewer MPI processes at the beginning of the sieve because more numbersare forwarded through the start of the sieve than the rest. A simple loopmechanism is available to support simple patterns like these (see Figure4.2). A range can be specified for a loop when creating boxes of a containerspecification. Inside the container specification, it can use in an equationthe variable being looped.container {containers pec s ieves pec = < S ieveE lement[ 100 * i ] >R = [1..8]boxes ( f or i in R { s ieves pec } )R eu sab le rang eL oop ov er th e rang eV ariab leFigure 4.2: Loops can be used to create a simple pattern in the number ofprocesses placed into each container.Usually, when assigning MPI processes to container specifications theMPI process is named by the CPA’s process name. However, there are timeswhere the same process is used in multiple compositions but when allocatingthem to container specifications we need to clearly define the compositionfrom which the process belong. In this case, a prefix on the process name canbe specified before a colon and only processes that match the compositionprefix are allocated in that specification (see Figure 4.3). The prefix is494.2. Scaling to Larger or Smaller Resourceschosen because it is common that a whole composition is duplicated butwith a different higher level composition name. In this case, just specifyingthe higher level composition name is sufficient as opposed to using the entirename.stru c tu re tw od atastru c tu res ( ) { D ataElement = inp orts < ͙> ou tp orts < ͙> ( ) | | - D ataC h ain = ( D ataElement[ nu melements] | | ͙) | | Application1 = ( D ataC h ain | | ͙ ) | | Application2 = ( D ataC h ain | | ͙ ) }c ontainers { c ontainersp ec d atac h ainsp ec = < D ataC h ain[ * ] > b ox es ( Application1: d atac h ainsp ec | | Application2: d atac h ainsp ec )}Figure 4.3: The prefix of a composition can be specified in the containerspecification in remove ambiguity of process placement.4.2 Scaling to Larger or Smaller ResourcesSoftware is developed to be runnable on systems of variable size and per-formance. For SPMD programs, it is easy to scale up the number of OSprocesses by simply adjusting the nfg and n values on mpiexec. Instead ofa single point of extension as in in SPMD, MPMD has many. However, forMPMD programs, it is much more difficult because a mapping must specifythe MPI rank for each process. For example, a program requiring that everyOS process contain a master process. In this case, changing the number ofco-located MPI processes in an OS process or changing the number of OSprocesses causes a change in this mapping. It is not reasonable to expectthe mapping to be rewritten every time the machine or mapping changes.Writing it manually is error-prone and tedious and writing an automaticmapping generator every time is also not efficient.(CM)2PI provides a generalized solution to this problem by abstractingout the decisions of the program size (the number of MPI process) from thespecification of the target system resource size (the number of OS processes504.3. Heterogeneous Computation Loadto spawn). The program size is decided by changing the variables in theCPA-V while the resource size is specified in the container specification(see Figure 4.4). After any changes to these specification, running the toolautomatically generates a new mapping based on the new program size andresource constraints. Through the use of wild cards and + in the CPA-Vand container specification, it is possible to make the program fit into theavailable resources or resize the program to fit in the available resources. Ifthe wild card is used in the CPA-V then the program can resize based onthe available resources. If the wild card is used in the container specificationthen the goal is to accommodate the same program size into smaller of largeramounts of available resources.M anu ally W ritten M ap p erM anu ally W ritten C onf ig u ration M atc h ing M ap p erM anu ally W ritten M ap p erM anu ally W ritten C onf ig u ration M atc h ing M ap p erC P A C P A V C ontainer F u nc tions Generated M anag er Generated C onf ig u rationC P A C P A V C ontainer F u nc tions Generated M anag er Generated C onf ig u rationFun cti o n Map p i n g N an d NFGFG - MP ICh an g i n g s i z e o f p r o g r amFG - MP ICh an g i n g s i z e o f r es o ur cesFG - MP I w i th ( CM) 2P ICh an g i n g s i z e o f p r o g r amFG - MP I w i th ( CM) 2P ICh an g i n g s i z e o f r es o ur cesCh an g eCh an g eFigure 4.4: Separation of concerns in the specification allows concise de-cisions to be changed. The tool automatically fits the program into theresources available and generate the corresponding mapping. For large pro-grams with many processes, changing the mapper and configuration manu-ally is error-prone and grueling.4.3 Heterogeneous Computation LoadToday’s clusters may not be homogeneous in performance due to upgradesand and purchases of CPUs with different performance. With MPI, it wasdifficult to to make sure faster cores do more work than the slower cores.514.4. Glue FunctionsThis resulted in delays and bottlenecks caused by the slower cores. WithFG-MPI, it is possible to co-locate many MPI processes together. It is theneasy to adapt the number of MPI processes to amount of available computepower, memory, and other resource constraints.By time slicing between more work, it is possible to keep the faster corebusy while messages are blocked waiting to be retrieved from the slowercores. In MPI, if we want more work for a particular OS process, we need tomanually time slice the code because we can not create more OS processesthan there are cores because that would oversubscribe the cores and it is upto the decisions of the OS scheduler how these processes are scheduled.However, with FG-MPI, this becomes easier as each MPI process is in-dependent process but many of these processes can be co-located togetherto share a core. But like the placement of MPI processes into the availableresources, it is hard to modify the mapping manually by hand for everydifferent cluster where performance changes. With (CM)2PI, you can sim-ply specify more or less of a particular MPI process to be placed in an OSprocess. Furthermore, it is possible to set a limitation to the number ofMPI processes that can be put in an OS process. Just setting this valueto be lower for slower cores and higher for fast cores provides better staticallocation of resources.4.4 Glue FunctionsWe want to open up certain places in the (CM)2PI code to give programmersa chance to insert code or to tailor certain functionality. There are 7 placeswhere code can be inserted: the main function’s initialization, the mappingfunction’s initialization, the mapping function’s termination, the process-ing of function parameters, the communicator permutation, the cleanup ofparameters, and the cleanup of the main function.These glue functions can be set to run conditional on certain UNIXenvironment variables (genv) being set. Genv variables can be set in thefunction specification or at run time using as flags -genv FLAG value. Anexample of glue function injection is specified as a section in the function524.4. Glue Functionsspecification (see Listing C.16). For detailed notation see Appendix A.2.5.4.4.1 A: Main InitThe maininit() function has a callback location in the main() functionbefore the call to FGmpiexec(). This function cannot make any calls to theMPI or FG-MPI API.The following is the function prototype:void maininit( int argc, char** argv );4.4.2 B: Map InitThe mapinit() function has a callback location in the per OS process map-ping function. This function is called before FG-MPI maps functions toranks and can make calls to FG-MPI’s mapping API.The following is the function prototype:void mapinit( int argc, char** argv );4.4.3 C: Map FinalizeThe mapfinalize() function has a callback location in the per OS processmapping function. This function is called after FG-MPI maps functions toranks and can make calls to FG-MPI’s mapping API.The following is the function prototype:void mapfinalize( int argc, char** argv );4.4.4 D: Get ParametersThe getparams() function has a callback location right before the executionof the user’s co-routine (non-preemptive threads).The default function is to get the command-delimited parameters fromthe command line. The alternate is to programmatically determine theparameters to each function. This can be done in the wrapper functionwhere the argc and argv are determined. The user can provide functions toprovide custom parameters to the process. The default parameters is the534.4. Glue Functionsfunction’s name along with an parameters specified in the CPA specification.This information is provided as the argv input to the function. The updatedargv is to be pointed to by new argv.The following is the function prototype:void getparams( int argc, char *argv[], int offset,int* new argc, char*** new argv );4.4.5 E: Communicator PermutationThe permutecommunicator() function has a callback location inside theCMCMPI Init() function. This is the code for permuting MPI processeswithin the given communicator. The new communicator with the processesordered after permutation is to be pointed to by new. Permuting the pro-cesses allow for optimizing the position of processes for performance (seeSection 4.5).The following is the function prototype:void permutecommunicator(MPI Comm original, const char* name,MPI Comm* new)4.4.6 F: Free ParametersThe freeparams() function has a callback location after the execution of theuser’s co-routine. It is used to clean up any data structures created by theuser in the getparams glue function. The new argv from get params is aparameter to this function.The following is the function prototype:void freeparams(char *new argv[]);4.4.7 G: Main CleanupThe maincleanup() function has a callback location after the execution ofFgmpiexec(). It is used to clean up any data structures created by the userin the maininit() glue function. This function has no access to any MPIor FG-MPI API.The following is the function prototype:544.4. Glue Functionsvoid maincleanup( int argc, char** argv );4.4.8 Glue Code Use CasesBelow are some common use cases supported by the use of glue functions.1. Multiple mapper customizations (e.g. Prime sieve example). Theprogrammer can create a custom order of how processes are to com-municate to each other. A custom order is generated before individualMPI processes start execution. To deliver the information to each pro-cess, the programmer can invoke MPI processes with a custom set ofparameters with the ordering information.Required Input: GEnv, world size, nfg (possibly)Uses: mapinit, getparams, freeparams2. Custom parameters. The programmer can create a function to pro-gramatically provid parameters to each MPI process.Uses: getparams, freeparams3. Global data structure (e.g. Skiplist freenode information). The pro-grammer can initialize some global data structure before the executionof the program. It is then possible to clean up the global data structureat the end of the program’s execution.Uses: maininit, maincleanup4. Comma-delimited parameters from command line. The programmercan get parameters from the command line through the special comma-delimited format (i.e. function name,arg1,arg2,...). This is providedas the default functions default get params which the programmercan wrap around to manipulate the parameters before invoking theMPI processes.Uses: default get params, default free params, getparams, freeparams554.5. Permutation of Processes4.5 Permutation of ProcessesProcesses as specified in the final specification can be permuted in orderto test the performance of different configurations of the processes and howthey communicate. Rather than reconfigure the communications or channelsbetween processes, we permute the positioning of the processes into a virtualorder and then apply any channels to the new ordering (see Figure 4.5).The permutation can be done using the MPI’s communicator translationfunctions. Each permutation is to be performed on the lowest level commu-nicator for a particular type of process. Any process of the same type basedon the hierarchy (not the process type or function name) can be reorderedamongst themselves. This is done through a call back function where theprogrammer provides a function that takes in a communicator and the hi-erarchical name of the communicator and the function transforms it into anew communicator.The permutation is then flattened through the different levels of com-municators from which that process has access. This ensures the orderingto be consistent no matter if the program is using the communicators ashierarchical communicators or using them with channels. For a code sketchof how this is performed see Appendix D.4.5.1 Communicators Permutation Callback FunctionA simple call back function is to be provided by the user as a glue code in thefunction specification. The original order is provided in the original MPIcommunicator and the pointer new must point to the new communicatorcreated by this function.• void comm permute(MPI Comm original, const char* name, MPI Comm*new);564.5. Permutation of ProcessesM P I _ C O M M _ W O R LDP ermu tationC ommu nic ator L0C ommu nic ator L1C ommu nic ator L2N EW C ommu nic ator L2N EW C ommu nic ator L1N EW C ommu nic ator L0Figure 4.5: The permutation of processes using communicators. The darkblue circles represent processes and the white circles represents a rank nam-ing at that communicator. Proceses are permuted (green arrows) at thelowest level and the permutation propagates back up. Any channel informa-tion (red arrows) is based on the virtual ordering in the new communicators.57Chapter 5EvaluationWe will look at the different ways to use (CM)2PI to improve performanceof a parallel program. Then, we will discuss the different overheads thatresult from using the tool.5.1 Integration with Optimization ToolsHaving the ability to change the process mapping is great but having themapping explored and discovered automatically can expose certain unknownmappings or at least find them faster than a human.General parameter search optimization tools work by adjusting a set ofparameters each with a valid range then executing the program with thoseparameters and checking for a “goodness” value. The common parametersthat are adjusted when looking for a good mapping are the number of MPIprocesses, the nfg parameter, and the number of OS processes to spawn.The number of MPI processes can be modified in the CPAV specificationwhile the other two options are adjusted in the container specification.Depending on the amount of flexibility to be explored, the containerspecification needs to be equivalently detailed in order to provide sufficientnumber of parameters to be adjusted by the optimizer. To create parametersfor the optimizer to tweak the number of processes in each core, we needto create separate container specifications. However, if there is a distinctpattern that is known to be efficient then the container specification canbe simplified to reduce the number of parameters to search over. As withany other optimization problem, a larger parameter space increases the timerequired for the optimizer to find an efficient set of parameters.The other issue is the goodness value needed to evaluate a mapping. In585.1. Integration with Optimization Toolsgeneral the value that matters most is the total time required to run the pro-gram from start to end. However, parallel programs’ goodness is more thanjust the total time required. It also involve the amount of idle processing andthe amount of communication overhead. With extra information the opti-mizer can find the optimal point more quickly. However, most optimizersrequire a single goodness value so further research is required to determinehow to calculate this goodness value based on the evaluation techniquesavailable. There are other software out there, like Tau [31], that will pro-vide more insight into how well a program runs and the different amount ofoverhead as a result of the current mapping.5.1.1 Proof of ConceptThe optimizer SMAC [16] was used to optimize a prime sieve program spec-ified using (CM)2PI. The OS processes were binned into four types of con-tainer specifications. The optimizer was then given a range for the numberof sieve elements to put into each of these containers. The optimizer wasthen run to minimize the total run time of finding 1000 primes5. The resultwas that it was beneficial to put fewer MPI processes near the front of theprime sieve and more at the end. This is because the front of the primesieve is more busy as fewer elements get through to the rest of the sieve. Itthen makes sense to put fewer of these busy processes together and to putmore of the less busy processes together.An important use of an optimzation tool is the ability to automaticallyfit a parallel program into the available resources. This could be thought ofbeing similar to a ./configure command but rather than working on depen-dencies we adjust the mapping to fit better on the resources available whena program is first run on a new system. This ability to auto scale a parallelprogram facilities the deployment stage.5The number of primes was arbitrarily chosen.595.2. Permutation of Processes5.2 Permutation of ProcessesTraditionally, permuting processes requires the handling on all the commu-nication points or what ranks these processes need to communicate to. Mostcommonly, a look-up table is created in order to determine the communica-tion ranks after the permutation of a process. Often the permutation tableand its associated communication translations become confusing.With the addition of channels it is now possible to specify the commu-nication between processes yet be able to move them around to optimizefor performance. Processes can be permuted by creating a new communica-tor with a different ordering of the member processes. The communicatoris part of MPI’s implementation is is easy to understand as a permutationfrom one communicator to the next. FG-MPI has optimized the storage ofcommunicator information to being stored once per OS process. Permut-ing the processes allow for a balance of having more processing cores activewhile trading off for communication overhead.5.3 Adapting to Resource SizeMost programs don’t only run on the computer or cluster it was developedon. Eventually, it needs to run on larger or smaller system. There are twoaspects of a parallel program that can be adjusted to increase performancewhen running on a new system: the number of MPI processes, and thenumber of OS processes.In the original FG-MPI mapper, the mapper maps ranks to functions.For SPMD programs, this was easy to create simple patterns that workregardless of the number of ranks. However, for MPMD programs, theremay be requirements to the types of processes that must exist on an OSprocess and the mapper is not as general. If it had a pattern that couldrepeat as more OS processes with the same nfg is added, it was still notable to adapt when nfg changes. Changing this mapping manually is errorprone and is not viable to be recreated every time a system is redeployed.However, with the CPAV and container specifications in place, it is simple605.4. Integration of Separately Developed MPI Programsto change the number of MPI processes and the number of OS processesbecause given the restrictions, the tool will regenerate a new mapping.5.4 Integration of Separately Developed MPIProgramsIn software development, it is rare that all the code of a software system iswritten by a single developer and therefore it is important that integrationof code developed by different developers are easy to combine into one usablesystem. The connection points need to be clearly defined and how the restof the system works is insignificant. Using our composition method, theseparately developed components will have two different sets of specificationfiles for CPA, CPAV, containers, and functions. We then need to create anew CPA that includes all the files from before. We then need to create a newcomposition that contains the two components. Ideally, the communicationentry points between the two compositions can be done through ports andchannels. Then create a CPAV that initializes this CPA. This is the minimalamount of work required to get a composition of two separately developedprograms to work together.5.5 Performance Impact EvaluationWe will look at the overhead associated with using (CM)2PI. It falls into twomain categories: start-up overhead, and run-time overhead. Parallelism isused to improve performance so it is reasonable to assume programs that aremade parallel will run for a sizable amount of time. The start-up overheadis eclipsed as run-time increases.5.5.1 Process Mapping Storage RequirementThe FG-MPI mapper requires that a mapper function be provided wheregiven rank numbers, a function pointer is returned such that a co-routine615.5. Performance Impact Evaluationcan be started to run that function. To minimize the storage requirement,this information is stored on 3 levels.1) One level is to describe the co-located processes inside an OS process.To store this information, a loop around a statement returning a functionpointer is used to indicate the number of that function to spawn as co-locatedprocesses.2) The next level is to indicate the number of these OS-processes tospawn and a loop around the previous level is used to indicate this.3) The third level is a collection of the previous 2 levels to indicate thevariety of OS processes to be spawned.In other words, the first level indicates the nfg, the second level indi-cates the N, and the third level is a collection of those to form a completeconfiguration. The space complexity is at the same order as the MPI configfile where each line indicates the n and nfg to launch of a particular set offunctions.5.5.2 Process DiscoveryProcess discovery is finding a mapping between process names, ranks, OSprocess IDs, and node IDs. The mapping between process names and ranksuses the same information as that required by the FG-MPI mapper to mapfunctions to ranks . The OS process id is also discoverable from the infor-mation in the process mapping. The node id is discovered through commu-nication of the OS processes. Other than the node IDs (stored as a singleinteger per OS process) there is no extra amount of information stored forthis purpose. The information is extracted when it is requested through theAPI. However, this information is only required once at the beginning of theprogram and so this can be seen as a start-up overhead.The node id information is obtained by performing a single round ofring communication from which each OS process declares its lowest processid number. Since the process IDs assigned by the tool is a sequential andcontiguous range, having the lowest process id number of every node is suffi-cient to determine which node each of the remaining process IDs belong to.625.5. Performance Impact EvaluationThe discovery of node information is performed only once at the beginningof execution as part of the call to CMCMPI Init().5.5.3 Hierarchical CommunicatorsThe hierarchical communicators are created using the MPI Comm split()function. This action is only performed once at the start of execution aspart of CMCMPI Init(). Any communicator permutations from the user’sprovided glue code is also run at this point.5.5.4 Channels InformationCode that is generated for channel information has the same space complex-ity as defined in the CPA specification. That means when ranges are used inthe specification, they are translated to corresponding loops in the generatedcode. Any intermediate ports used in the CPA specification are factored out.In order for this information to be easily accessible to MPI processes, thisinformation is expanded out into structures for each MPI process. However,only the information relevant to the MPI processes within the OS processis expanded. This expansion is performed as part of the call to (CM)2PI.5.5.5 CMCMPI InitCMCMPI Init() as described before include the initialization of the nodeinformation, communicators, and channels. All of this work is performedonly once as at start up. This does not affect the run-time performanceof the user’s computation. Given a sufficiently large problem, this start upoverhead is minimized in proportion to the overall run-time. The call toCMCMPI Init() and the corresponding cleanup API CMCMPI Finalize() isoptional. If this function is not executed then the corresponding features itinitializes will not be available but the extra start up overhead is eliminated.635.5. Performance Impact Evaluation5.5.6 Flow-Controlled ChannelsThe flow-controlled channel APIs are the only functions that adds an oper-ational overhead when used. The send and receives functions are executedduring messaging. There is the overhead of having a proxy in the middle aswell as a few extra messages (token updates, synchronizations). From thesource to the proxy, there is a synchronization overhead due to occasionallysynchronized sends in order to block the sender if the proxy is full. There isalso an occasional token update message from the receiver back to the proxyto indicate how many messages can be sent to it.These two control messages can however be optimized to be sent moreor less frequently. In comparison, the traditional send and receive modeluses a single direct message from source to destination. However, the tra-ditional send and receive models have a higher potential to exhaust receivebuffers due to memory constraints and thus causing a failure. Flow controladds some overhead to avoid this problem and to also avoid messages fromflooding a receiver.An example of this is the prime sieve problem where each MPI processis a sieve element. As long as the generator does not receive a message thatthe required number of primes is found, it keeps pumping out numbers. Thisflow if uncontrolled floods the receivers down stream. Furthermore, whenthe terminate message is finally received by the sender, the terminationmessage needs to wait for all the previous messages to clear before travelingdown the sieve to terminate it. That means by the time the last prime isfound, there are many useless numbers pumped down the chain using upcompute cycles and delaying the movement of the termination message.On the UBC’s cluster Cyclops, the communication-intensive prime sieveproblem was ran using 16 compute nodes each with 8 cores. One versionused the library’s communication channels while the other was pure FG-MPI without the use of the proposed library in any way. After 3 runs6 eachcalculating 38400 primes, the version using Flow-Controlled channels usedon average 109 seconds while the non-flow controlled version used 70s (36%6The number of runs was chosen to be at least 3 in order to smooth out anomalies645.5. Performance Impact Evaluationless time when using regular send and receive). However, the actual timespend running the generator was 62 seconds using regular send and receiveswhile the Flow-controlled channels consumed 64 seconds resulting in onlya 3% penalty in performance. The extra run time (less than 45 seconds)was spent doing other one-time startup initialization such as starting upproxie processes in addition to 38400 MPI processes and setting up theircommunication channels. For a program that runs for a long time, thestartup cost may not be a significant performance lost.The OSU latency benchmark [33] was ran and found that for inter-nodecommunication on the Cyclops cluster, 4 byte messages had a latency of15.7 microseconds for FG-MPI while the channels version had a latency of26.3 microseconds (68% increase). For 4KB messages, the regular FG-MPIlatency is 62.5 microseconds while the channels version had a latency of 62.5microseconds (a 0% increase).65Chapter 6ConclusionIn this thesis, we have proposed a four stage approach to specifying parallelsoftware architecture and placement. This allows a separation of concernswith separate specifications for architecture interaction, software size, re-source capacity, and function. The architecture specification borrows ideasfrom component-based software design while the syntax is based on FSP.We’ve also implemented (CM)2PI which converts the specifications into aboilerplate that can be compiled with the user’s code to produce a runnablesystem.The four stage approach includes the Communication Process Architec-ture (CPA) for specifying the architecture design and communication in-teractions, the Communication Process Architecture Variables (CPAV) forparameterizing the architecture, the Container Specification for defining theavailable resources, and the Function Specification for assigning functionsto the processes defined in the architecture. This decoupling of decisionsallows the user to make distinct changes to the system without having tomake compatible changes to both the execution configuration and mapper.By externalizing MPI ranks, the developer does not have to edit their pro-gram.We looked at three different types of communication techniques to in-troduce channel communication to make it easier to compose systems in-dependent of MPI rank. The aim of these techniques were to make theprogramming code depend solely on the CPA architecture specification andbe fully compatible regardless of the decisions made in the other stages(especially with regards to system size and process placement).The first communication technique we looked at was process discoveryusing the names of the processes as defined in the architecture. This ser-666.1. Future Workvice is synonymous to a “white pages” service. The lookup is done at thebeginning of execution and does not incur extra per-message cost as thecommunication is still determined by the user.The second communication technique we considered was the use of hier-archical communicators. They are initialized at the start of execution andtheir naming is also based on the architecture design. These were moresuited to collective communication.The third communication technique we examined were channels. Thismethod was the most concise as the code only needs to know the local portnames defined in the architecture for that process. The actual channel isspecified between ports in the architecture. Regardless of how a processis composed, the only information it needs in order to communicate is thename of its ports.To tackle the common problem of flow control and termination we alsoproposed flow-control channels which builds on the channel communicationby providing a communication API with the flow control and termination.This method incurs a per-message overhead but the overhead is mostly instart-up so its effect is amortized over time. Flow-controlled channels wereincluded as a way to simplify the handling of common communication issuesin MPI. Even though this method does not guarantee that the receiver willnot run out of buffer space (due to there being too many senders), it reducesthe chance of that happening especially as a result of a single channel.With the decoupled specifications, parameters of system size and re-source sizes are distinct and is suitable for integration with other optimiza-tion tools. By using optimization tools to tune parameter, it is possible toautomatically fit parallel systems to its resources wherever it is deployed.This makes parallel programs more versatile and scalable.6.1 Future WorkThe current CPA architecture design allows for compositions and simplerepetitions to create commonly used patterns. However, some of the morecomplex but well studied grid networks like hypercubes and toruses were not676.1. Future Workexplored in the research. It may be interesting to see what kind of changeor support needs to be made in order to specify those topologies. The goalis still to keep the specification as general as possible because there could beendless numbers of architecture designs and complex architectures shouldintuitively require a more complex specification on the user’s part.Currently, the integration with optimization tools requires that the searchspace be recompiled ahead of time. In a way, this is efficient if a search withrandom starts is used then the search space does not have to be recompiledagain. However, in a single search, a lot of the search space might not beused and pre-compiling a large space will create a large executable and takesextra time. A better approach to this is to recreate the tool as a run-timetool developed in C. Then during execution, the specification is read andthe appropriate mapping is created dynamically and run.If the creation of mappings can be created dynamically, it may alsobe possible to perform dynamic load balancing. This way, it is possibleto make fine adjustments as the code runs. This was not implemented inour research because it will then be difficult to verify and inspect resultingmappings if code is not generated. Currently, the code generated is intendedto be understandable by users and is to be similar to what a user mighthave written as a boilerplate. Furthermore, our focus was on finding staticplacement for system software which will then use all of its run-time computepower on the task and not on the placement. However, more flexibility willresult if the tool was run-time based.The issue with receiver buffer’s finite space is partially mitigated by theflow-controlled channels so that a single channel should not overwhelm asingle target. However, if there were many channels sending to one receiver,even receiving just one message from each channel may be sufficient to over-whelm the receiver’s buffer. With FG-MPI, this issue is exacerbated as moreco-located MPI processes means there is less memory available to each MPIprocess. MPI program correctness issues are interesting future research.While integration with optimizations tools require a single goodnessvalue, the overall run time of a program might not always be the best indi-cation of performance. Considering the underlying issues are idle-processing686.1. Future Workand communication overhead, further research could investigate the calcu-lation of a better value to optimize on. This is especially important forprograms that run continuously rather than terminate.69Bibliography[1] The Go programming language. http://golang.org/. Accessed Au-gust 25, 2014.[2] Mixed integer linear programming solver lpsolve. http://sourceforge.net/projects/lpsolve/. Accessed August 25, 2014.[3] MPI: A message passing interface. In Supercomputing ’93. Proceedings,pages 878–883, Nov 1993.[4] Joe Armstrong. The development of Erlang. In ACM SIGPLAN No-tices, volume 32, pages 196–203. ACM, 1997.[5] Greg Burns and Raja Daoud. Robust message delivery with guaranteedresources. In Proceedings, MPIDC’95, May 1995.[6] Ivica Crnkovic and Magnus Larsson. Component-based softwareengineering-new paradigm of software development. Invited talk andreport, MIPRO, pages 523–524, 2001.[7] Robert H Dennard, Fritz H Gaensslen, V Leo Rideout, Ernest Bas-sous, and Andre R LeBlanc. Design of ion-implanted MOSFET’s withvery small physical dimensions. IEEE Journal of Solid-State Circuits,9(5):256–268, 1974.[8] Narayan Desai, Ewing Lusk, and Rick Bradshaw. A composition envi-ronment for MPI programs. International Journal of High PerformanceComputing Applications, 21(2):166–173, 2007.[9] The Rust Project Developers. The Rust programming language. http://www.rust-lang.org/. Accessed October 31, 2014.70Bibliography[10] Ian Foster. Designing and Building Parallel Programs: Concepts andTools for Parallel Software Engineering. Addison-Wesley LongmanPublishing Co., Inc., Boston, MA, USA, 1995.[11] David Goodell, William Gropp, Xin Zhao, and Rajeev Thakur. Scalablememory use in MPI: a case study with MPICH2. In Recent Advancesin the Message Passing Interface, pages 140–149. Springer, 2011.[12] Joseph D Gradecki and Jim Cole. Mastering Apache Velocity. JohnWiley & Sons, 2003.[13] MIT CSAIL Supertech Research Group. The Cilk project. http://supertech.csail.mit.edu/cilk/. Accessed October 31, 2014.[14] James Hanlon, Simon J. Hollis, and David May. Scalable data ab-stractions for distributed parallel computations. CoRR, abs/1210.1157,2012.[15] C. A. R. Hoare. Communicating sequential processes. Commun. ACM,21(8):666–677, August 1978.[16] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Parallel algorithm con-figuration. In Proc. of LION-6, pages 55–70, 2012.[17] F. Hutter, H. H. Hoos, K. Leyton-Brown, and K. P. Murphy. Time-bounded sequential parameter optimization. In Proc. of LION-4, 2010.To appear.[18] Costin Iancu, Steven Hofmeyr, Filip Blagojevic, and Yili Zheng. Over-subscription on multicore processors. In Parallel & Distributed Pro-cessing (IPDPS), 2010 IEEE International Symposium on, pages 1–11.IEEE, 2010.[19] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fet-terly. Dryad: distributed data-parallel programs from sequential build-ing blocks. In ACM SIGOPS Operating Systems Review, volume 41,pages 59–72. ACM, 2007.71Bibliography[20] Humaira Kamal, Seyed M. Mirtaheri, and Alan Wagner. Scalability ofcommunicators and groups in MPI. In Proceedings of the 19th ACMInternational Symposium on High Performance Distributed Computing,HPDC ’10, pages 264–275, New York, NY, USA, 2010. ACM.[21] Humaira Kamal and Alan Wagner. A Signpost on the Road toExascale. Available from http://www.westgrid.ca/westgrid_news/2013-01-14/ubc_researchers_use_westgrid_explore_exascale_computing.[22] Humaira Kamal and Alan Wagner. FG-MPI: Fine-grain MPI for mul-ticore and clusters. In 11th IEEE Intl. Workshop on Parallel and Dis-tributed Scientific and Engineering Computing (PDSEC) held in con-junction with IPDPS-24, pages 1–8, April 2010.[23] Humaira Kamal and Alan Wagner. Added concurrency to improveMPI performance on multicore. In Parallel Processing (ICPP), 201241st International Conference on, pages 229 –238, Sept. 2012.[24] Humaira Kamal and Alan Wagner. An integrated runtime scheduler forMPI. In Jesper Larsson Trff, Siegfried Benkner, and Jack J. Dongarra,editors, Recent Advances in the Message Passing Interface, volume 7490of Lecture Notes in Computer Science, pages 173–182. Springer BerlinHeidelberg, 2012.[25] Humaira Kamal and Alan S. Wagner. An integrated fine-grain runtimesystem for MPI. Computing, pages 1–17, 2013.[26] Argonne National Laboratory. MPICH high-performance portableMPI. http://www.mpich.org/. Accessed October 31, 2014.[27] Jeff Magee, Naranker Dulay, Susan Eisenbach, and Jeff Kramer. Spec-ifying distributed software architectures. In Software EngineeringESEC’95, pages 137–153. Springer, 1995.[28] Jeff Magee and Jeff Kramer. Concurrency - state models and Javaprograms. Wiley, 1999.72[29] Timothy Mattson, Beverly Sanders, and Berna Massingill. Patternsfor Parallel Programming. Addison-Wesley Professional, first edition,2004.[30] Terence J. Parr and Russell W. Quong. Antlr: A predicated-ll (k) parsergenerator. Software: Practice and Experience, 25(7):789–810, 1995.[31] Sameer S Shende and Allen D Malony. The TAU parallel performancesystem. International Journal of High Performance Computing Appli-cations, 20(2):287–311, 2006.[32] Jeff Squyres. Unsung heros: MPI run time en-vironments. http://blogs.cisco.com/performance/unsung-heros-mpi-run-time-environments/. Accessed October 7,2014.[33] The Ohio State University. MVAPICH::Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/. Accessed October 9,2014.[34] Judicael A Zounmevo and Ahmad Afsahi. An efficient MPI messagequeue mechanism for large-scale jobs. In Proceedings of the 2012 IEEE18th International Conference on Parallel and Distributed Systems,pages 464–471. IEEE Computer Society, 2012.73Appendix ASpecification NotationA.1 Composition NotationA.1.1 MPI Process DeclarationSimilar to a function prototype, each MPI process is specified with param-eters but they are dynamic input and output parameters (as ports for MPIcommunication). Each MPI process can also have static parameters (asparameters to function) provided at time of function invocation.Consider the following: The MPI process P has two input ports IN 1and IN 2 as well as two output ports OUT 1 and OUT 2 along with twostatic parameters of type PARAM TYPE 1 and PARAM TYPE 2:P = inports outports (PARAM_TYPE_1, PARAM_TYPE_2)A.1.2 Concurrent Composition of MPI ProcessesTo group a set of MPI processes to share a single communicator, a com-position is created. The multiplicity of each process can be specified as aequation (which can include variables - to be specified). The default multi-plicity is 1.In the example below, the concurrent execution of MPI process P (NUM Pinstances) and S is represented by the composition Q:|| Q = ( P[NUM_P] || S )74A.1. Composition NotationA.1.3 Concurrent Composition of CompositionsTo group a set of compositions to share a single communicator, a new com-position can be created. The multiplicity of each composition is specifiedas an equation (which can include variables - to be specified). The defaultmultiplicity is 1.In the example below, the concurrent execution of MPI process P andcompositions NUM Q x Q and R is represented by the new composition T:|| T = ( P || Q[NUM_Q] || R )A.1.4 Intermediate CompositionCompositions can be regular compositions (used for instantiation) or anintermediate compositions (to be further composed inside another interme-diate or regular composition). The minus sign (-) is used to indicated anintermediate composition.In the example below, the composition Q is an intermediate compositionwhile T is a regular composition.||- Q = ( P || S)|| T = ( P || Q || R )A.1.5 Communication ChannelsCompositions of MPI processes of other compositions can create nameddirectional communication channels. An optional tag for the channel canbe specified. The process name can be prefixed to the port to clarify whichprocess the port belongs to. If only one of the processes possess this portthen the name is optional.In the example below, the composition Q containes a concurrent execu-tion of P and S where a channel is created between port NEXT of processP to port PREV of process S with a tag MSGTAG (a variable to be speci-fied). Also a channel is created between port TERMINATE of process P tothe only process that contains the port END. When using flow-controlled75A.1. Composition Notationchannels, a buffersize value can also be set for the buffer proxy’s buffer size.The / symbol represents “rename” in [28]. We’ve adapted that symbol torepresent the declaration of channels.|| Q = ( P || S )/ ( P.NEXT -> S.PREV tag MSGTAG buffersize 10, P.TERMINATE -> END)A.1.6 Port HidingAll unconnected (no channels were created for) ports by default are notexposed beyond the current composition. To expose a port outside of thecomposition, an intermediate port must be specified. The @ symbol repre-sents “expose”.In the example below, the port S.NEXT is connected to the intermediateport EXT NEXT which is available to further compositions of Q:|| Q = ( P || S )/ ( S.NEXT -> EXT_NEXT )@ ( EXT_NEXT )A.1.7 Patterns for Communication Port DeclarationLoops can be used to assign channels to ports in a simple pattern. Anequation can be used to identify the port of the nth process of that type.In the example below, a range R is created from 0 to END inclusively.This range can be reused throughout the rest of the specification. Thecomposition Q containers NUM R instances of process R and a channel iscreated from the port NEXT of the previous process to the port PREV ofthe next process in a chain-like fashion.R = [0.. END]|| Q = ( R[NUM_R] )/ (for i in R { NEXT[i] -> PREV[i+1]} )76A.1. Composition NotationA.1.8 Forming a StructureThe MPI process declaration and compositions are to be specified as thearchitecture structure. Any uninitialized variables must be declared afterthe structure name for further declaration.A structure of STRUCTURE NAME with variables NUM P andPARAM TYPE 1 is to be wrapped as follows:structure STRUCTURE_NAME (NUM_P , PARAM_TYPE_1){P = inport <> outport <> (PARAM_TYPE_1)...|| Q = ( P[NUM_P] )}A.1.9 Variable InitializationAny uninitialized variables from a structure must be initialized. Variablescan be a simple equation of another variable.In the example below, a variable VAR 1 is initialized to the value 10 andthe variable VAR 2 is initialized to the value of VAR 1 + 5.VAR_1 = 10VAR_2 = VAR_1 + 5A.1.10 Initializing All VariablesAll variables that are uninitialized in a structure needs to be provided inan initialization. An initialization contains one or more declarations of avariable’s values. The structure to be initialized is to be specified in theinitialization specification. The declaration of extra variables that are notused by the structure is acceptable.In the example below, we have an initialization of the structure STRUC-TURE NAME by declaring the values of variable VAR 1 and VAR 2.initialize STRUCTURE_NAME{77A.2. Placement NotationVAR_1 = 10VAR_2 = VAR_1 + 5}A.2 Placement NotationThe following is a description of the notation used for specifications relatedto program placement:A.2.1 Container SpecificationA container specification states how many of each MPI process can be in anOS process. The MULTIPLICITY of each MPI process may be a solvableequation (using operators - + * /) or a wild card (*) indicating zero or more,or plus (+) indicating one or more. The maximum combined multiplicity isindicated by the optional MAX MULTIPLICITY value. An optional com-position prefix COMP PREFIX can be specified.containerspec CONTAINER_NAME = < COMP_PREFIX.PROC_NAME_1[MULTIPLICITY_1], PROC_NAME_2[MULTIPLICITY_2], ... > /MAX_MULTIPLICITYA.2.2 Container Allocation to OS ProcessesThe specification of the number of OS process to allocated to each containerspecification is of the form:boxes ( CONTAINER_SPEC_1[MULTIPLICITY_1] || CONTAINER_SPEC_2[MULTIPLICITY_2] || ...)A.2.3 Forming a Container SpecificationIf an equation is used for the multiplicity, a simple loop pattern can also beused. The above two container information is to be wrapped as follows:78A.2. Placement Notationcontainers {R = [0.. END]containerspec CONTAINER_NAME = < PROC_NAME_1[LOOPVAR*MULTIPICITY_1], PROC_NAME_2[MULTIPLICITY_2], ... > /MAX_MULTIPLICITY...boxes ( for LOOPVAR in R { CONTAINER_SPEC_1 } ||CONTAINER_SPEC_2[MULTIPLICITY_2] || ...)}A.2.4 Mapping Functions to MPI ProcessesTo specify the C function (with standard main function signature int PFUNC-TION(int argc, char* argv[]) ) name and corresponding header file, the formis as follows:functionmapping{P = { "PHEADER.h", "PFUNCTION" }...}A.2.5 Glue SpecificationGlue functions (callback functions) can be specified. All parts are individ-ually optional. The header file that has the function’s prototype is to beprovided as well as the name of the function. See Section 4.4 for functionsignature requirements.glue {gluemapping {genvcondition = { "ENV_CONDITION", "CONDITION_VALUE" }mapinit = { "HEADER_1", "function_1" }mapfinalize = { "HEADER_2", "function_2" }maininit = { "HEADER_3", "function_3" }maincleanup = { "HEADER_4", "function_4" }getparams = { "HEADER_5", "function_5" }79A.2. Placement Notationfreeparams = { "HEADER_6", "function_6" }permutecommunicator = { "HEADER_7", "function_7" }}A.2.6 Environment VariablesGlobal environments to be passed to the program can be specified as follows:genv {GENV_1 = GENVVALUE...}A.2.7 Combining SpecificationsDifferent specification files can be passed to the tool individually or importedby other specification files.include SPEC_1.containersinclude SPEC_1.functions...80Appendix BAPIB.1 Process Discovery APIThe following is the API for discovering MPI processes that a run using thegenerated mapping. Each API returns 0 if success, -1 if error.• int CMCMPI get info by rank(int rank, int* opid, const char**name);Gets the OS process ID and the name of the rank given.• int CMCMPI get ranks(int opid, char* name, int** ranks, int*size);Gets the ranks with the given OS process ID and name.• int CMCMPI get all ranks(char* name, int** ranks, int* size);Gets the ranks with the given name.• int CMCMPI get collocated ranks(char* name, int** ranks, int*size);Gets the ranks with the given name in the same OS process as thecaller.• int CMCMPI Get numnodes(int* size);Gets the total number of nodes the program is executing on.• int CMCMPI Get nodeid(int* id);Gets the current processe’s node id. Node ids start from 0 to to size-1.81B.2. Communicators API• int CMCMPI Get node pids(int nodeid, int** pids, int* size);Gets the os process ids of the processes on in the specified node.• int CMCMPI Get local node pids(int** pids, int* size);Gets the os process ids of the processes on the current node.B.2 Communicators APIA simple API is available for the user’s code to obtain the instantiatedcommunicators. Each API call returns 0 if success, -1 if error.• int CMCMPI Init();Initializes the communicators (creates them) and other process in-formation. To create communicators efficiently, they are created inparallel where possible. Since this is a hierarchical pattern, once achild communicator is created, further splitting into other communi-cators will not require the cooperation of the sibling communicator.This means that all the siblings can split in parallel. If one processcalls this function, all processes must call this function. This creates astructure container information for all the collocated processes in thecurrent OS process. This includes for each process, their communica-tors, and pointers to their name (const char).• int CMCMPI Finalize();Frees up any structures created to store communicators and otherprocess information. Also, communicators are freed up.• int CMCMPI Comm get(int levels, MPI Comm* comm);Gets the communicator for the particular level.Level 0 is MPI COMM WORLD. Communicators are then createdfor each level of the naming prefix. For example, a process namedApp.Data.Info. At level 1, a communicator contains all processes whoshare the prefix App. At level 2, a communicator contains all processes82B.3. Channels APIwho share the prefix App.Data. At level 3, a communicator containsall processes who share the prefix App.Data.Info.• int CMCMPI Comm get by name(char* name, MPI Comm* comm);Gets the communicator for the particular naming prefix. A blankname gets MPI COMM WORLD.• int CMCMPI name(const char** name);Gets the current process’s name based. The name is based on thecomposition specified in the CPA.• int CMCMPI ospid(int* ospid);Gets the current process’s process ID.B.3 Channels APIA simple API is available for the user’s code to obtain the instantiatedchannel information which was specified in the CPA.• int CMCMPI get chan procname(int id, chan procname t** channel);Gets the channel information struct. The values inside the struct areavailable to the program. For example, to get the next and previousprocess of a prime sieve process (see Listing B.1).Listing B.1: Example usage of channel structschan_primeelement_t* chan_info = get_chan_primeelement(rank);int prevproc = chan_info ->prev.rank;int prevtag = chan_info ->prev.tag;int nextproc = chan_info ->next.rank;int nexttag = chan_info ->next.tag;83B.4. Flow-Controlled Channels APIB.4 Flow-Controlled Channels APIIn the following, we give the API for the flow-controlled channel library.• Channel StatusThe return code for the API calls below are as follows:CHANNEL NEW: Indicates that the channel hasn’t been opened.CHANNEL OPEN: Indicates that the channel has been opened and isready for sending (for the outbound end) and receiving (for the in-bound end).CHANNEL CLOSING: Indicates that the receiver has requested that thechannel be closed. The channel is still effectively open at this status.CHANNEL CLOSED: Indicates that the channel is now closed and thatno further messages can be sent (for the outbound end) or received(for the inbound end).• int channel open send(CMCMPI Chaninfo t* chan);Opens an outbound channel. Returns channel status. If channel statusis CHANNEL OPEN then messages can be sent to the channel. If channelstatus is CHANNEL CLOSED then the channel is closed and no messagescan be sent to the channel.• int channel open recv(CMCMPI Chaninfo t* chan, int messagesize);Opens an inbound channel. Return channel status. If channel statusis CHANNEL OPEN then messages can be received from the channel.• int channel send(void* buf, int count, MPI Datatype type,CMCMPI Chaninfo t* chan);Sends the data in buf to the channel. Returns channel status.• int channel recv(void* buf, int count, MPI Datatype type,CMCMPI Chaninfo t* chan);Receives data from the channel to buf. Returns channel status. Ifchannel status is CHANNEL CLOSED then buf is invalid and the channel84B.4. Flow-Controlled Channels APIis now closed (all future receives will have status CHANNEL CLOSED).The reason for having the channel status piggybacking with the datais to exploit MPI’s message ordering guarantee as well as to reduce ex-tra messages purely for status information. By using the same source,destination, and tag to send data and status information, we can en-sure that the status and data messages will arrive in order. Thatmeans if the last message sent is a CHANNEL CLOSED status, then wecan ensure all previous data will arrive at the receiver before this statusmessage.• int channel close(CMCMPI Chaninfo t* chan);If a sender invokes this function, then the channel is closed to thissender and no further messages can be sent. If a receiver invokesthis function, then the channel remains open but the channel statuschanges to CHANNEL CLOSING to the sender. It is then up to the senderto invoke this function to close the channel.• int channel status(CMCMPI Chaninfo t* chan);Returns the current channel status of the channel.85Appendix CExamplesIn this appendix, we will discuss how to use (CM)2PI and three examples:• A hello world program shows how to make an MPI program run usingFG-MPI through the help of (CM)2PI.• A prime sieve program shows how to create ports and use channels.• A farm program shows how to compose farm programs to look at morecomplex usages of channels.C.1 Flow of ExecutionR U N T O O LC P A C P A - V C ontainers F u nc tionsM ap p ing f old erC O M P I LEEx ec u tab leM P I EX ECU ser c od eC O M P I LEmap p ing . o and c onf ig f ile O b j ec t f ilesLI N KFigure C.1: The order in which the tool is run from compilation to executionof the program.86C.2. HelloWorld ExampleAll the specifications specified are input into the (CM)2PI tool. The typ-ical execution scenario (see Figure C.1) involves compiling the output of thetool into a cmcmpi.a object file and configfile. Then the user compiles theirown code and links with the cmcmpi.a object file to create an executable. Ashell script is generated by the mapping’s Makefile containing the mpiexeccommand.C.1.1 Compiling the Generated CodeThe tool’s output is compiled by calling the generated Makefile and pro-viding it with the final executable name through the APP variable. Thefollowing is how the generated Makefile should be called:make APP=execnameC.2 HelloWorld ExampleThe first example shows the minimal basics required to go from specifying aprogram in MPI to FG-MPI to using the (CM)2PI tool. This basic exampledoes not expose much of the advantages of using the (CM)2PI tool.C.2.1 MPIThe MPI version of the program is shown in Listing C.1.Listing C.1: helloworld.c#include #include int main(int argc , char *argv []){int rank , size;MPI_Init (&argc , &argv);MPI_Comm_rank(MPI_COMM_WORLD , &rank);MPI_Comm_size(MPI_COMM_WORLD , &size);printf ("Hello I am MPI process %d of %d\n", rank , size);87C.2. HelloWorld ExampleMPI_Finalize ();return 0;}Compiled using:mpicc helloworld.c -o helloExecuted to run (for example, on 4 cores) using:mpiexec -n 4 ./helloC.2.2 FG-MPITo make the program work with FG-MPI, we need to basically add a map-ping boilerplate (see Listing C.2) as well as include fgmpi.h and modify thename of the main function (see Listing C.3). The main function’s nameneeds to be changed because the main function in the boilerplate needs tostart up FG-MPI.Listing C.2: boilerplate.c#include #include #include /* Forward declarations */int mymain(int argc , char **argv);/* mapping function */FG_ProcessPtr_t my_map_function(int argc , char** argv , intrank){if ( (rank == MAP_INIT_ACTION) || (rank ==MAP_FINALIZE_ACTION) ) return (NULL);else return (& mymain);}FG_MapPtr_t lookupfunc( int argc , char** argv , char* str){return (& my_map_function);88C.2. HelloWorld Example}/* MAIN FUNCTION */int main( int argc , char *argv[] ){FGmpiexec (&argc , &argv , &lookupfunc);return 0;}Listing C.3: helloworld.c#include #include #include int mymain(int argc , char *argv []){int rank , size;MPI_Init (&argc , &argv);MPI_Comm_rank(MPI_COMM_WORLD , &rank);MPI_Comm_size(MPI_COMM_WORLD , &size);printf ("Hello I am MPI process %d of %d\n", rank , size);MPI_Finalize ();return 0;}Compiled using:mpicc -c boilerplate.c -o boilerplate.ompicc -c helloworld.c -o helloworld.ompicc boilerplate.o helloworld.o -o helloworldExecuted to run (for example, on 8 cores with 80 MPI processes in to-tal) using:mpiexec -n 8 -nfg 10 ./helloworld89C.2. HelloWorld ExampleC.2.3 FG-MPI with (CM)2PIWith (CM)2PI, the boilerplace is no longer needed. Instead, we split up thespecification into 4 separate files: cpa (see Listing C.4), cpav (see ListingC.5), containers (see Listing C.6), functions (see Listing C.7). The pro-gramming (see Listing C.8 and Listing C.9) need to important cmcmpi.hand optionally execute CMCMPI Init() and CMCMPI Finalize() if their fea-tures is required.Listing C.4: cpa/hello.cpainclude hello.cpavinclude hello.containerinclude hello.functionsstructure helloworld (numHello) {HelloWorld = inports <> outports <> ()|| HW = ( HelloWorld[numHello] )}Listing C.5: cpa/hello.cpavinitialize helloworld {numHello = 10}Listing C.6: cpa/hello.containercontainers {containerspec helloWorldContainer = < HelloWorld [*] >boxes ( helloWorldContainer [8] )}Listing C.7: cpa/hello.functionsfunctionmapping {HelloWorld = { "hello.h", "mymain" }}90C.2. HelloWorld ExampleListing C.8: helloworld.c#include #include #include #include "cmcmpi.h"#include "hello.h"int mymain(int argc , char *argv []){int rank , size;MPI_Init (&argc , &argv);CMCMPI_Init (); // OPTIONAL: If communicators/channels/node information not usedMPI_Comm_rank(MPI_COMM_WORLD , &rank);MPI_Comm_size(MPI_COMM_WORLD , &size);printf ("Hello I am MPI process %d of %d\n", rank , size);CMCMPI_Finalize (); // OPTIONAL: If communicators/channels/node information not usedMPI_Finalize ();return 0;}Listing C.9: hello.h/* Processes */int mymain(int argc , char **argv);Compiled using:../release/fgmapper.sh cpa/hello.cpa(cd mapping; make APP=helloWorld)mpicc -c -Imapping helloworld.c -o helloworld.ompicc helloworld.o mapping/cmcmpi.a -o helloWorldExecuted to run (for example, on 8 cores with 80 MPI processes in to-tal) using:91C.3. PrimeSieve Example./run-mapping-helloWorld-0.shC.3 PrimeSieve ExampleThe primesieve consists of 3 types of processes: the generator, the sieveelements, and the last element (see Figure C.2).The generator’s job is to basically pump out incrementing numbers into the sieve. It has one output port that it is to send numbers out of.Generator N ex tC ontrolSieve Element N ex t Last ElementC ontrolP reviou s P reviou sFigure C.2: The generator needs to send out numbers. The sieve elementsreceive numbers from the front of the sieve and pass numbers on. The lastelement needs to terminate the program when it receives a prime. Thecontrol channel is used for optimization of the termination process.The sieve elements takes the first number it receives as a prime numberand pass on any future numbers it receives that its number does not divide.This MPI process needs an input port as well as an output port.The last element is basically the last prime number and has the job ofterminating the program. This MPI process needs an input port for receivingnumbers only.The basic ports described above are the bare minimal required for thisprogram to work. Termination request of the channels starts from the lastelement and propagate to the start which then terminates channels fromstart to end. When all channels are closed, the program is terminated. Amore efficient way is to provide a direct shortcut channel (a control channel)from the generator to the last element. This way, termination request fromthe last element can be sent directly to the generator and the resultingterminations only need to propagate from start to end.92C.3. PrimeSieve ExampleOnce these MPI processes are defined, it is time to compose them to-gether to make them interact (see Listing C.10). In any prime sieve, a singlegenerator is needed and a single last element is required. The sieve shouldhowever be resizable and be parametrized by a variable size. This can bedone through specifying a multiplicity for the number of sieve elements. Butto hide this information from the outside of the sieve, we make an inner com-position of the sieve element so that to the big overview, we see 1 generator,a sieve section, and a last element. In order to allow these to communicate,channels need to be created. In order for the sieve section to communicateto the outside, intermediate ports are created as “holes” for channels toattach through without exposing the details of the internal communication(see Figure C.3).Generator Sieve Element Last ElementSieve Element Sieve ElementFigure C.3: A description of the architecture would indicate a variable sizedsieve element section with 1 generator and 1 last element.Listing C.10: prime.cpastructure primesieve (numsieve) {PrimeStart = inports <>outports ()PrimeElement = inports outports ()PrimeEnd = inports outports <> ()R = [0.. numsieve -2]||- PrimeSieveChain = ( PrimeElement[numsieve] )/ ( for i in R { next[i] -> prev[i+1]},receive -> prev[0],next[numsieve -1] -> send)93C.3. PrimeSieve Example@ (receive , send)|| PrimeSieve = ( PrimeStart || PrimeSieveChain ||PrimeEnd )/ ( PrimeStart.next ->PrimeSieveChain.receive ,PrimeSieveChain.send -> PrimeEnd.prev ,PrimeStart.control -> PrimeEnd.kill)}The number of primes is parameterized by a variable which is placedinside the CPAV specification (see Listing C.11).Listing C.11: prime.cpavinitialize primesieve {numprimes = 5000numsieve = numprimes - 2}Then the placement of the processes is specified. There is a lot of flex-ibility here. For example, it is possible to place the generator on the firstcore and the last element on the last core (see Listing C.12).Listing C.12: prime.containercontainers {containerspec primeSieveStart = < PrimeStart [1],primeSieve >containerspec primeSieve = < PrimeElement [*] >containerspec primeSieveEnd = < primeSieve , PrimeEnd[1] >boxes ( primeSieveStart [1] || primeSieve [6] ||primeSieveEnd [1] )}It is also possible to run everything on a single core (see Listing C.13).This is good for checking for errors as everything is run concurrently insteadof in parallel.94C.3. PrimeSieve ExampleListing C.13: prime.containercontainers {containerspec primespecial = < PrimeStart [1],PrimeElement [*], PrimeEnd [1] >boxes ( primespecial [1] )}It is also possible to let the system decide completely and say there is 8cores only (see Listing C.14).Listing C.14: prime.containercontainers {containerspec primespecial = < PrimeStart [*],PrimeElement [*], PrimeEnd [*] >boxes ( primespecial [8] )}Or even limit the nfg on the earlier part of the program (see ListingC.15). This has proven to make a big performance impact on the primesieve problem when the number of primes is large.Listing C.15: prime.containercontainers {containerspec primeSieveStart = < PrimeStart [1],primeSieve > / 10containerspec primeSieve = < PrimeElement [*] >containerspec primeSieveEnd = < primeSieve , PrimeEnd[1] >boxes ( primeSieveStart [1] || primeSieve [6] ||primeSieveEnd [1] )}And finally, the functions that needs to be attached to each MPI processis assigned in the function specification (see Listing C.16). Glue code isassigned depending on the value of the GENV variable mapping.95C.3. PrimeSieve ExampleListing C.16: prime.functionsfunctionmapping {PrimeStart = { "primeSieve.h", "generator" }PrimeElement = { "primeSieve.h", "sieveElement" }PrimeEnd = { "primeSieve.h", "lastElement" }}glue {gluemapping {genvcondition = { "MAPPING", "sequential"}mapinit = { "mapping -custom.h", "init_seq" }}gluemapping {genvcondition = { "MAPPING", "strides" }mapinit = { "mapping -custom.h", "init_strides" }}gluemapping {genvcondition = { "MAPPING", "random" }mapinit = { "mapping -custom.h", "init_random" }}gluemapping {maincleanup = { "mapping -custom.h", "my_free_params" }permutecommunicators = {"mapping -custom.h", "custom_comm_permute "}}}genv {GROUP = 25SEED = 6CUTS = 20}The channels can be used in two ways, one is through the informa-tion based option, the other is through flow-controlled channels (see Listing96C.3. PrimeSieve ExampleC.17).Listing C.17: primesieve.c Section showing the usage of communicationchannels for the sieve element.int sieveElement(int argc , char **argv){int rank , size;MPI_Init (&argc ,&argv);CMCMPI_Init ();MPI_Comm_rank(MPI_COMM_WORLD , &rank);MPI_Comm_size(MPI_COMM_WORLD , &size);// Channel informationCMCMPI_Chan_primeelement_t* chan_info;CMCMPI_Get_chan_primeelement(rank , &chan_info);// Opening the channelschannel_open_recv (&( chan_info ->prev), sizeof(int));channel_open_send (&( chan_info ->next));uint32_t num ,num2 ,myprime =0;while ( TRUE ){if (channel_recv (&num , 1, MPI_INT , &(chan_info ->prev))== CHANNEL_CLOSED){// No more incoming data , close outbound channelchannel_close (&( chan_info ->next));break;}if ( myprime == 0 ){myprime=num;printf ("%u, ",myprime);}else if ( num % myprime ){/* Not divisable by this prime */97C.4. Farm Examplechannel_send (&num , 1, MPI_INT , &(chan_info ->next));}else { ; }}CMCMPI_Finalize ();MPI_Finalize ();return 0;}C.4 Farm ExampleThe farm example involves a source and a sink while there are workersthat perform the tasks. We will discuss the three architecture compositionpossible for a farm.A single farm where jobs go from source to workers to sink is shownin Figure C.4 as A) and is equivalent to that specified in Listing C.18 asFarmAppA.A more complex example involves chaining the farms together such thatthe output of one travels to the next. This is shown in Figure C.4 as B)and Listing C.18 as FarmAppB. This type of farm might be used to cre-ate a pipeline-like situation where processed data is reprocessed in anothermethod.A third example involves splitting up tasks to workers which then futhersplits up the tasks into more workers. Effectively, farms are nested together.This is shown in Figure C.4 as C) and Listing C.18 as FarmAppC. This typeof problem resembles a multiple level decomposition problem.98C.4. Farm ExampleSou rc e W ork er SinkW ork erW ork erSou rc e W ork er SinkW ork erW ork erSou rc e W ork er SinkW ork erW ork erSou rc e SinkA )B )C )Figure C.4: Three patterns of composition. A farm can be by itself A),chained B), and nested C).99C.4. Farm ExampleListing C.18: farm.cpastructure farm (numworkers , nestnum) {Source = inports outports ()Worker = inports outports ()Sink = inports outports ()R = [0.. numworkers -1]K = [0.. nestnum -1]||- Workers = ( Worker[numworkers] )/ ( for i in R { alljobs -> jobs[i]},for i in R { result[i] -> allresults})@ (alljobs , allresults)||- FarmP = ( Source || Workers || Sink )/ (allinput -> Source.input ,Source.sendjobs -> alljobs ,allresults -> getresults ,Source.sendjobs -> getresults ,output -> alloutput)@ (allinput , alloutput)||- FarmP2 = ( Source || FarmP[nestnum] || Sink )/ (for i in K { Source.sendjobs -> FarmP.allinput[i] },for i in K { FarmP.alloutput[i] -> Sink.getresults })|| FarmAppA = ( FarmP )|| FarmAppB = ( FarmP [2] )/ ( FarmP.alloutput [0] -> FarmP.allinput [1])|| FarmAppC = ( FarmP2 )}100Appendix DCode sketch - CommunicatorPermutationListing D.1 shows the implementation of how the permutation callback func-tion is called and how the permutation is propagated.Listing D.1: Permutation of communicatorsMPI_Comm original_communicator = communicators[max_index -1]MPI_Comm new_communicatorpermute_comm_function(original_communicator , &new_communicator) // Call -back functioncommunicators[max_index -1] = new_communicatorMPI_Comm prev_old_comm = original_communicator;for (current_level = lowest_level - 1; level >= 0; --level){MPI_Comm prev_comm = communicators[level +1]MPI_Comm current_comm = communicators[level]MPI_Comm new_comm;int newcommrank , prevnewrankMPI_Comm_rank(prev_comm , prevnewrank);MPIX_Comm_translate_ranks(prev_old_comm , 1, &prev_new_rank , current_comm , &newcommrank)MPI_Comm_split(current_comm , 0, newcommrank , &new_comm);communicators[level] = new_commMPI_Comm_free(prev_old_comm);prev_old_comm = current_comm;}101