UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Scalability of communicators in MPI Mir Taheri, Seyed M. 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2011_spring_mirtaheri_seyed.pdf [ 3.28MB ]
JSON: 24-1.0051999.json
JSON-LD: 24-1.0051999-ld.json
RDF/XML (Pretty): 24-1.0051999-rdf.xml
RDF/JSON: 24-1.0051999-rdf.json
Turtle: 24-1.0051999-turtle.txt
N-Triples: 24-1.0051999-rdf-ntriples.txt
Original Record: 24-1.0051999-source.json
Full Text

Full Text

Scalability of Communicators in MPI by Seyed M. Mir Taheri B.T. Information Technology, Universiti Teknologi PETRONAS, 2008 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University of British Columbia (Vancouver) March 2011 c© Seyed M. Mir Taheri 2011 Abstract This thesis offers a novel framework for representing groups and communicators in Message Passing Interface (MPI) middleware. MPI is a widely used paradigm in a cluster environment that supports communication between the nodes. In our framework, we have implemented and evaluated scalable techniques for groups and communicators in MPI. We have tested this frame- work using FG-MPI, a fine-grain version of MPI that scales millions of MPI processes. Groups in MPI are the primary means for creating communicators. A group map is the underlying structure that stores participating processes in the communication. We introduce a framework for concise representations of the group map. This framework is based on the obser- vation that a map can be decomposed into a set and a permutation. This decomposition allows us to use a compact set representation for the cases where specific mapping is not required i.e. lists with monotonically increasing order. In other cases, the representation adds a permutation as well. A variety of set compression techniques has also been used. Furthermore, the framework is open to integration of new representations. One advantage of such decomposition is the ability to implicitly represent a set with set representations such as BDD. BDD and similar representations are well-suited for the types of operations used in construction of communicators. In addition to set representations for unordered maps, we incorporated Wavelet Trees on Runs. This library is a third party state-of-the-art succinct data structure designed to represent permutation. As a final addition, we have also included general compression techniques in the framework such as BWT (i.e., bzip2). This last step allows some degree of compression in memory-constrained environments where there is no discernible pattern in the group structure. We have investigated time and space trade-offs among the representations to develop strate- gies available to the framework. The strategies tune the framework based on user’s requirements on time and space. The first strategy optimizes the framework to be fast and is called the time strategy. The second strategy optimizes the framework in regard to space and it is called the space strategy. The final strategy is a hybrid of both and tries to strike a reasonable trade-off ii between time and space. We call this the hybrid strategy. These strategies let the framework accommodate a wider range of applications and users. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Chapter Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Communicator Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Map Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Framework Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Framework’s API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Framework’s Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Framework Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Ordered vs. Unordered Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.6 Map as a Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.7 Implementation of Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.8 Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 iv 3.8.1 The Space Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.8.2 The Time Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.8.3 The Hybrid Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Discussion of Compression Techniques . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1 Set Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.1 Range and Stride Representations . . . . . . . . . . . . . . . . . . . . 24 4.1.2 Binary Decision Diagrams . . . . . . . . . . . . . . . . . . . . . . . . 25 An Example of BDD Find . . . . . . . . . . . . . . . . . . . 28 Million Node Matrix . . . . . . . . . . . . . . . . . . . . . . 29 Million Node Cube . . . . . . . . . . . . . . . . . . . . . . 31 4.1.3 Bitmap Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.4 Delta (or Gap) Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Permutation Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Wavelet Tree Representation . . . . . . . . . . . . . . . . . . . . . . 36 Million Node Matrix Transpose . . . . . . . . . . . . . . . . 37 4.3 List Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Optimized Canonical Representation . . . . . . . . . . . . . . . . . . 37 4.3.2 Burrows-Wheeler Transformation Based Compression . . . . . . . . . 38 Million Nodes Random Graph . . . . . . . . . . . . . . . . 39 5 Experiments on Single Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1 Experiment’s Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.1 Class A Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.2 Class B Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.3 Class C Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.4 Class D Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.5 Class E Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6 Experiments in FG-MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 v 6.3 Required Time to Create Communicator . . . . . . . . . . . . . . . . . . . . . 63 6.4 Required Time for Point-to-Point Communication . . . . . . . . . . . . . . . . 65 6.5 Required Time for Collective Gather Operation . . . . . . . . . . . . . . . . . 67 6.6 Saving Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 vi List of Tables 2.1 Communicator functions in MPI. . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Framework user level API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Framework Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 BDD operations that can be used to implement MPI group routines. . . . . . . 18 6.1 Description of the communicators used in applications. . . . . . . . . . . . . . 62 6.2 Required Time to Create Communicator . . . . . . . . . . . . . . . . . . . . . 65 6.3 Required Time for Point-to-Point Communication . . . . . . . . . . . . . . . . 67 6.4 Required Time for Collective Gather Operation . . . . . . . . . . . . . . . . . 69 vii List of Figures 3.1 Set and Permutation Framework: an AND-OR diagram of the representations used for storing the mapping from group rank to world rank. Rectangles with rounded edges are properties and the boxes are the data structures. The branches are OR unless specifically marked as AND. (An older version of this figure ap- peared on [21].) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1 The Binary Decision Diagram for the communicators with only even (right) and odd (left) members. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Finding a rank in BDD paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 The Binary Decision Diagram for the representing a row and a column in a matrix architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 (a) A 220 node 3D mesh with a 2D sub-mesh defined by varying dimensions X and Y, for a fixed value of Z. (b) The BDD representation of the sub-mesh. . . . 33 4.5 Bitmap lookup algorithm in presence of popcnt assembly instruction. . . . . . 34 4.6 Compares the cost of storing permutation using WTR and compare it with the current implementation. The horizontal bar represent number of rows. . . . . . 38 4.7 Comparison of the cost of storing a random graph communicator using BWT compressed array-based implementation, the current implementation and the information theoretic lower bound. The horizontal axes represents the number of communicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Experiment class A, Sequential lookup times . . . . . . . . . . . . . . . . . . 42 5.2 Experiment class A, Random lookup times . . . . . . . . . . . . . . . . . . . . 43 5.3 Experiment class A, Space function based on lookup times. . . . . . . . . . . . 44 5.4 Experiment class A, Strategies lookup times. . . . . . . . . . . . . . . . . . . 45 5.5 Experiment class B, Sequential lookup times . . . . . . . . . . . . . . . . . . . 46 viii 5.6 Experiment class B, Random lookup times . . . . . . . . . . . . . . . . . . . . 47 5.7 Experiment class B, Space function based on lookup times. . . . . . . . . . . . 48 5.8 Experiment class B, Strategies lookup times. . . . . . . . . . . . . . . . . . . . 49 5.9 Experiment class C, Sequential lookup times . . . . . . . . . . . . . . . . . . . 50 5.10 Experiment class C, Random lookup times . . . . . . . . . . . . . . . . . . . . 51 5.11 Experiment class C, Space function based on lookup times. . . . . . . . . . . . 52 5.12 Experiment class C, Strategies lookup times. . . . . . . . . . . . . . . . . . . . 53 5.13 Experiment class D, Sequential (top) and sequential (bottom) lookup times . . . 54 5.14 Experiment class D, Space function based on lookup times. . . . . . . . . . . . 55 5.15 Experiment class D, Strategies lookup times. . . . . . . . . . . . . . . . . . . 56 5.16 Experiment class E, Sequential lookup times . . . . . . . . . . . . . . . . . . . 57 5.17 Experiment class E, Random lookup times . . . . . . . . . . . . . . . . . . . . 58 5.18 Experiment class E, Space function based on lookup times. . . . . . . . . . . . 59 5.19 Experiment class E, Strategies lookup times. . . . . . . . . . . . . . . . . . . . 60 6.1 Required Time to Create Communicator . . . . . . . . . . . . . . . . . . . . . 64 6.2 Required Time for Point-to-Point Communication . . . . . . . . . . . . . . . . 66 6.3 Required Time for Collective Gather Operation . . . . . . . . . . . . . . . . . 68 ix Acknowledgements This dissertation would not have been possible without the assistance of several individuals who lent their expertise and their guidance to the preparation and completion of this study. First and foremost, I would like to thank my supervisor, Professor Alan Wagner, whose ongoing encouragement from the initial proposal to the final draft of this thesis enabled me to develop an understanding of the subject matter. I would like to express my sincere gratitude to Professor Will Evans for his very useful advice on theoretical matters. I especially want to thank Dr. Kirsten C. Uszkalo, for her guidance and help. Her perpetual energy and enthusiasm in research had motivated all her advisees, including me. I would like to thank Jeremy Barbay and Carlos Bedregal for generously sharing Wavelet Tree on Runs library and Humaira Kamal for her ongoing support and willingness to share FG- MPI. Finally, as more debts are accrued in a project like this than can be listed, I’d like to also thank the many others whose support ensured this project could reach fruition. x Chapter 1 Introduction With its high capacity, ease of portability, and quick performance, Message Passing Interface (MPI) is the dominant model for parallel programming in high performance computing. MPI is a standardized specification for a communication library based on massage passing. It provides functionalities such as communication channels between local and remote processes, virtual topology, and synchronization. Communication in MPI takes the form of point-to-point com- munication and collective communication. While point-to-point communication specifies the communication between a specific sender and a receiver, collective communication specifies communication between a group of processes. An MPI massage consists of an array of items such as integers, characters etc.. These data types known as MPI data types are predefined in MPI. Arbitrary data types can also be added to MPI via serialization. MPI is supported by a group of vendors, research laboratories and universities including IBM, Intel, TMC, Argonne National Lab, UC Santa Barbara, Syracuse University etc. [22]. Open source implementations of the MPI as well as vender-supplied implementations exist. Some of the widely used open source implementation of the MPI include MPICH [29] and OpenMPI [28]; and Some of the vender-supplied implementation of the MPI include HP- MPI [16], IBM’s MPI [18] and SGI’s MPI [34]. As a language independent specification there exist FORTRAN, C and C++ application programming interfaces (API) available to MPI. Fur- thermore, any library that can connect to these languages with an interface -such as Java, Python or C#- can also take advantage of MPI library. There exist a wide rage of libraries to support different MPI applications. For instance, some of the numerical software libraries written for MPI include PETSs (for solving nonlinear equations), ScaLAPACK (for solving linear system of equations), PLAPACK (for linear algebra calculation) and DOUG (for solving differential equations). Other examples of MPI applications include MIT Photonic-Bands (MPB) (for computing electronic band structures), QCDMPI (for simulating quantum chromo dynamics), CTSim (for simulating tomography) and MPI-PovRAY (ray tracing application). Furthermore, commercial MPI applications such as parallel engineer- 1 ing and scientific subroutine library from IBM (Parallel ESSL) are available to the users [21]. MPI acts as a middleware, hiding the communication infrastructure from the user’s view point of view across collections of clustered computers. A computer cluster operates as single computation unit designed to process complex problems across a network of computers linked via a high speed Local Area Network (LAN) technologies, such as InfiniBand or Ethernet. The power of a cluster is measured by the number of cores in it (i.e. the cluster’s size). The avail- ability of a higher number of processing units, and the increase in the core per CPU ratio, has led to larger, more powerful clusters. There are, for instance, currently several machines on the TOP500 list [38] with more than 200,000 cores. Processors with 8 cores capable of running 16 jobs simultaneously are already available for commodity servers. Intel has implemented a 48-core processor prototype and is working on the development of processors with over 100 cores [19]. This rapidity and scale of cluster growth needs to be balanced by increased scalabil- ity; where there are now clusters with hundreds of thousands of MPI processes there will soon be machines with more than one million processors. To maximize the speed needed to accompany this growth, scholars need to develop new methods to address MPI’s scalability, particularly the scalability of its communicators. Communicators are groups of MPI processes used to identify the source and the destination for each communication. As an integrated component of MPI, they are needed in any single communication and they define the scope of communication. MPI groups and communicators provide support for MPI libraries [26]. Each process inside a communicator is represented by a rank. Information about ranks is stored in a list [7]. The space required to represent a com- municator in the list is linear with respect to the number of participating MPI processes in that communicator. Since each member of a communicator stores the representation individually, the overall space required to represent communicators is O(n2) with respect to the size of the communicator (n). Because of the O(n2) space requirement, this implementation simply can not scale up for large clusters. Up to now, communicators have been one of the sources of the scalability challenges associated with the MPI middleware in super large clusters. When running an MPI middleware in a super-cluster one might note processing space bottle- necks, a problem which may be caused by insufficient memory space allocated to middleware’s processing requirements. BlueGene, for example, is one of the largest clusters available with hundreds of thousands of nodes. Each of its nodes has only 512 Megabytes of memory available to it [7], and nearly half this space goes to the operating system kernel; the MPI middleware is left with a mere 250 Megabytes of space. As such, the memory footprint must be reduced to 2 optimize the performance of MPI middleware programs. This thesis contributes to the question of how to improve MPI scalability. It focuses on tech- niques for reducing communicators’ demand for space in large clusters designed and deployed under Professor Alan Wagner’s supervision at Network Security and Systems lab at Univer- sity of British Columbia (2009/2010). While such reductions may cause an increase in time consumption of MPI communicators, this proposed framework addresses time and space con- sumption of communicators by introducing a flexible framework that offers different time/space trade-offs. The framework outlined herein is also compatible with the existing MPI implemen- tations and capable of recognizing that different programs and installations may have unique time/space demands. As such, it features built-in compression and representation techniques and remains open to the implementation of additional strategies. Moreover, it is sufficiently quick to avoid the potential of undermining overall communication performance. Future im- plementation of this approach promises additional compression techniques to support a wider range of applications, hierarchical representation of communicators to further save on space, and introduction of estimation techniques to reduce the creation time of the communicators, to accommodate communicator scalability in rapidly growing super-cluster environments. 1.1 Chapter Breakdown Following this introduction, this thesis will break down into chapters designed to summarize the theory behind, the creation of, and the implementation of the communicator with time / space trade-offs. Chapter 2 provides a detailed summary of the implementation of communicators and the underlying map in MPICH. It will explore how MPI provides functions to create and delete communicators and to retrieve information about their properties, and look at how maps are currently implemented as an array of end-point objects. It will also explore the implications of Chaarawi and Gabriel [7] proposed representation of the map by decoupling the communicator to group manipulation functions. The chapter then concludes with a brief introduction to groups. Chapter 3 describes the proposed framework structure, its API, and the strategies available to it. It will suggest that a key element of the proposed framework is the decomposition of the communicator into a set and a permutation. Inside every communicator there is a list of MPI process ranks that are the members of that particular communicator (map). The map contains a list of discrete numbers ranged from 0 to the number of all MPI processes in the system (or 3 size of the MPI COMM WORLD1) minus one. The map can represent both ordered and unordered communicators. Typically, however, communicators are ordered. In such cases, space is saved by decomposition of map into set and permutation. The chapter will outline three strategies the time, the space and the hybrid strategies which were applied on top of the representation layer in the framework to provide additional flexibility and optimize the framework in time, in space, and in time / space. The time strategy optimizes the lookup time and increases the speed of a running MPI application. The space strategy reduces the space required to represent the communicator by reducing the working space of a running MPI application, usually at the cost of a higher lookup time. The hybrid strategy, the adopted default strategy, makes a trade-off between time and space concerns by simultaneously optimizing both dimensions to increase the speed in a low working space. Chapter 4 provides an account of the implementation of the compression and representa- tion techniques. It will explore how compression techniques techniques which anticipate likely patterns, in both set and permutation to maximize map compression were implemented and integrated. Existing compression techniques, such as BWT compression, Gap encoding, as well as succinct data structures such as Bitmaps, were also explored. A number of compression inno- vations such as implicit representation of the set -based on Binary Decision Diagrams (BDDs)- and the use of popcnt assembly instruction to implement Bitmaps are offered. The frame- work is flexible due to the availability of different representations; it is open to accommodate additional techniques. Chapter 5 evaluates the framework and presents the space and time results of creating com- municators with different sizes and properties outside MPI. It will present a preliminary evalua- tion of the performance of this framework, looking at how experiments were run to compare the performances of the representations and the corresponding time/space trade-off. Chapter 6 further explores how the system was evaluated by running applications inside MPI that, without the proposed framework, are simply infeasible. To further investigate the overhead of the framework in an active MPI application, the framework was integrated into an MPICH2 implementation called Fine-Grain MPI or FG-MPI 2. It would be impossible to understand the 1A communicator that consists of all MPI processes in a system with default ordering is called MPI COMM WORLD and is predefined in all MPI applications by the middleware 2FG-MPI is developed by Humaira Kamal under supervision of Alan Wagner. FG-MPI decouples the notion of MPI process from that of the operating system. FG-MPI can execute thousands of full-fledged MPI processes on a single core; here “process”, FG-MPI is defined as a process called a “proclet” while the term process refers to either an operating system’s process or an MPICH process. 4 full limitations of a proposed framework outside of MPI and testing the framework inside an MPI is important proof that there is no hidden overhead associated with the framework. The use of FG-MPI is advantageous because it makes it possible to run MPI programs with hundreds and thousands of processes without requiring the corresponding number of cores. Although FG-MPI does not require running the application on a machine with the corresponding number of cores, the underlying mechanisms are the same. This thesis will conclude by summarizing the scholarly benefits of introducing a framework to create time/space trade-offs in MPI communicators. While there are a number of compression techniques included in the framework, along with three different strategies, the framework is light and flexible (it allows further compression techniques and strategies, with minimal changes required to the code). This flexibility renders this contribution a foundational aspect of future solutions to the problem of MPI scalability. In addition, it appears that most of MPI programs do not exploit the MPI feature that allows the library to define its own communicator. This absence of communication-based MPI programming might change in the future as MPI systems scale. The lower overhead achieved by using the proposed framework could encourage such change and lead to more efficient applications. ∗∗∗ This project illustrates the ease with which this framework can be integrated into MPI to reduce memory footprint. Designed to hide the internal details of the communicator representa- tion from MPI, it can be integrated into different implementation of MPI, such as MPICH and OpenMPI, saving a considerable core space without unduly increasing the processing time need across large clusters. 5 Chapter 2 Background Communicators are an essential aspect of Message Passing Interface (MPI). The communicator structure, MPID Comm is an opaque object 1 that every communication routine in MPI takes as a parameter. They support the development of higher level libraries [12] and act as the primary tool for exchanging messages between processes. Communicators divide communication into disjoint contexts: a message sent in one context can only be received by a routine with a commu- nicator in the same context. Furthermore, libraries can create their own communication contexts by creating new communicators, allowing the running program to interleave calls to different libraries safely: a message sent in one context cannot be mistakenly received by an unintended routine in another context. This chapter will review creation of communicators, the map structure inside communicators and groups in MPI, to suggest the benefit of considering the proposed framework and its inherent time/space trade-off in representing the map structure inside communicators. 2.1 Communicator Creation MPI supports functionalities such as creation, duplication, release, naming, retrieval, and com- parison of communicators. Table 2.1 briefly describes most-frequently-used MPI communicator manipulation functions [10]. One can create communicators using groups and MPI Comm split. Groups in MPI define and manipulate the list of processes used to create a communica- tor. Once a communicator is created, the user may safely release the involved group. Whereas group operations are local, communicator operations are global; it is possible to create a group locally that does not need to communicate with other processes. Creating a similar communica- tor, however, requires communicating the underlying structure between participating processes. Section 2.3 briefly explains groups in MPI. One may also use MPI Comm split to create com- 1An opaque object is an object that is not accessible directly by the user. Users can interact with such object only through a handle; they are unable to determine the size and other properties of the object. 6 Function Brief Description MPI Comm create Creates a new communicator from a group (comm, group, newcomm) MPI Comm dup Duplicates existing communicator (comm, newcomm) MPI Comm split Partitions comm into disjoint subgroups (comm, color, key, newcomm) MPI Comm group Returns in group a handle to the group of comm (comm, group) MPI Comm free Marks a communicator for deallocation (comm) MPI Comm compare Compares two communicators (comm1, comm2, result) MPI Comm get name Returns the last name associated with the communicator (comm, comm name, resultlen) MPI Comm rank Returns in rank, the rank of the calling process in comm (comm, rank) MPI Comm remote group Returns in group remote group corresponding to comm (comm, group) MPI Comm remote size Returns the number of processes in the remote group of comm (comm, size) MPI Comm set name Associate a name with a communicator (comm, comm name) MPI Comm size Returns in size with the number of processes in group (group, size) MPI Comm test inter Determines if a communicator is inter-comm or intra-comm (comm, flag) Table 2.1: Communicator functions in MPI. 7 municators without the intermediate group structure, minimizing space consumption. This is a better choice for small systems, while persevering MPI Comm split benefits larger systems. The following section offers elaborations on communicators and groups and representations of the communicators map structure. 2.2 Map Representation To scale MPI on large clusters one must remove duplication by sharing and reduce memory con- sumption by communicators using compression. FG-MPI follows the sharing step and reduces duplications effectively. It does not address the reduction of memory consumed by the map structure inside communicators. A storage framework can reduce memory consumption. Such a framework uses an efficient and compact representation, depending on specific properties of the underlying map. 2 Popular MPI libraries such as MPICH and OpenMPI store the map as an array of integers. MPICH2, for instance, stores the information about the destination in a communicator as an array of pointers. These pointers map group ranks to a communication end-point object called a MPID VCR. Another array stores the mapping between the local rank and the world rank in a group. This information is stored as an array of objects called a MPID Group pmap. The communication end-point object contains information used by Nemesis3 to encapsulate an MPI message to be delivered to local or remote processes. With array-based implementation, a map of sizeO(N) requiresO(N2) space in total asymp- totically. Given that MPI programs may contain communicators of various sizes, the amount of space consumed by maps can become excessive. As one attempts to scale up the number of MPI processes to one million, planning an efficient use of space becomes a challenge. One possible solution can be found in Chaarawi and Gabriels [7] investigation of a general representation of the map structure. They propose groups to be stored based on their differences from the super group they are derived from. This allows new groups to be efficiently stored using a sequence of updates to its members. Although the same compression technique can be beneficially applied to represent communicators, it would be beneficial only in cases that do not have a large number 2The framework can be integrated into FG-MPI as an external library for representing the map. Reduction of the map memory consumption is precisely the focus of this thesis 3Nemesis is one of the MPI communication channels designed to support scalability. It takes advantage of recent technologies such as shared-memory and multi-network systems [6]. Nemesis is particularly optimized when running an MPI application on multi-core CPUs. 8 of differentiations. It is also not evident from [7] that the approach scales to large clusters. Using an implicit representation provides an alternative method of representing the map structure in MPI. Implicit representations require a constant (and preferably small) memory size per map; explicit representations consume space linearly and normally with a larger constant factor in respect to the size of the map. Due to their characteristics, an implicit representa- tion of maps tends to reduce the memory overhead of a MPI application. None of the current group operations store the map in an implicit form. The non-scalable characteristics of explicit representation render groups unattractive for large clusters. Nothing in MPI specifications im- poses the necessity of an explicit representation. Hence Lusk et. al [2] suggest that introducing implicit representation of groups is an important step forwards. Although this thesis focuses on representing communicators, rather than representing groups, both groups and communicators use the map structure; all the techniques introduced here can represent both. 2.3 Groups MPI provides an opaque structure for groups. This structure offers routines to create groups and operations to add and remove group members. The purpose of groups is to provide input to MPI Comm create, which is a method of creating communicators. Processes can extract the underlying group from a communicator by MPI Comm group and then use a sequence of group operations to construct a new group. This new group in turn can be used to create a new communicator with MPI Comm create. One could use colors and keys with MPI Comm split as well to create a communicator, but MPI Comm create has certain advantages. One such advantage arises from the fact that groups are defined locally. As such, group information does not need to be distributed globally. This advantage, however, comes at the cost of a number of duplications as the group structure is recreated. There are three issues associated with group scalability. Unnecessarily large objects are created because group operators do not manage group objects by modifying their inputs and instead returned a new object with each call. Second, array-parameters -in operations such as MPI Group incl and MPI Group excl- can grow large [2], and duplicate operations performed by each of the target group members create system-wide memory usage spikes that continues until the communicator is created and the group objects are freed. Finally, the order of the parameters passed to the operations can result in unordered group from two ordered group. 9 Theoretically, where ordering matters (e.g. unordered collections or maps) the structure requires more storage than the cases where ordering does not matter (i.e. sets). These three issues relate to the MPI standard and not its implementation4. Group management operations are on the list of items to be reviewed by the MPI Forum for MPI 3.0. ∗∗∗ This thesis focuses on reducing the size of the map structure in communicators while keep- ing the communication overhead low. Chapter 3 further explains the framework architecture and different implicit and explicit techniques it may use to compress and represent the map structure. 4MPI standard requires the communicators to preserve the ordering. In the current standard the ordering matters in point to point communication. 10 Chapter 3 Framework Users have different requirements with respect to consumption of time and space, requirements defined by system infrastructure and by the level of communication-intensiveness or space- intensiveness of an application. The open framework proposed in this chapter allows for repre- sentation of the map structure inside communicators and for the software to accommodate new compression techniques where they might be advantageous1. To accomplish this, the framework chooses a compression technique to represent the map based on its properties and according to the strategies specified by the user (i.e. Space, Time or Hybrid strategy). Its flexibility allows compact representation of the map (the level of achieved compression and the amount of lookup time may vary across cases). The framework invokes each representation and monitors the out- come. Based on the outcome, and the strategy chosen by the user, the framework then chooses a representation, which can also be a composition of a set and a permutation or a general repre- sentation. This chapter will review requirements for the framework to suggest the benefits of consid- ering the proposed architecture. It then will review the framework API, its architecture, and implementation of some of its components such as the strategies layer. 3.1 Framework Requirements The framework is implemented as an independent module that hides the internals of the vari- ous representations of map from the MPI middleware. To perform the required operations on the map, MPI needs the minimum functionalities of SELECT, RANK, INSERT, and DELETE; any library that supports these operations may replace the map structure inside MPI. While MPI offers a wide range of functionalities on groups and communicators, all these op- 1Although few MPI programs currently use communicators, new frameworks must remain open to communica- tor employment. Such openness, in addition to a variety of techniques available to the framework, enable it to make appropriate trade-offs. 11 erations manipulate the map. The map structure inside MPICH, an array called lrank to lpid, maps the group rank to the world rank.2 The world rank constructs an array of MPID VCR ob- jects. Each MPID VCR object represents the routing information for a rank in the communicator, and is later used for communication between the processes.3 Other implementations of the map can follow the same procedure and construct an array of MPID VCR objects. As a result, the array-based implementation of the map is not necessarily a part of MPI middleware. Any compact representation library that replaces the map must also support fast lookup and fast iteration. Fast lookup ensures the performance of MPI Send is not affected by the choice of library, a critical factor since every sent message will invoke a lookup operation. Fast iterators are also necessary because the majority of group operations rely on iteration over the map. These additional properties further ensure that replacement of array-based implementation of the map with the framework would not weaken the overall performance of MPI. 3.2 Framework’s API MPI Invocation Category Function Brief Description Method name MPI Creation/ CMCreate Creates and initialize a map object Communicator Release CMWorldCreate Creates a special group for world Split CMKill Destroys and release resources of a map MPI Insertion/ CMInsert Insert a new member into a map Communicator CMBlockInsert Make a map from an array Creation CMRangeInsert Insert a range into a map Iteration CMInteratorCreate Create the iterator CMInteratorFree Releases iterator object CMInteratorNext Return the next rank CMInteratorRewind Reset the iterator MPI SELECT/ CMFind Performs a look up on a map Send/ CMBlockFind Performs all look up on a map Receive CMRangeFind Performs multiple look up on a map RANK CMFind Finds position of a rank in the map Installation Strategy CMSetStrategy Sets the strategy of the framework Configuration selection Table 3.1: Framework user level API. 2The world rank of a process is its rank in MPI COMM WORLD. 3 In contrast to other implementations, FG-MPI does not duplicate MPID VCR objects. It instead shares them [20]. 12 The framework provides two kinds of APIs, namely MPI-Framework API and Framework- Compression Scheme API. MPI-Framework functions are used by MPI middleware to create and destroy the map and to perform operations such as SELECT and RANK on the map. Framework- Compression Scheme functions, on the other hand, are used by the framework to operate on the compressed map structure. MPI library has access only to MPI-Framework functions and it is oblivious to Framework-Compression Scheme functions. MPI-Framework functions (described in Table 3.1) are divided into categories of Cre- ation/Release, Insertion/Iteration, SELECT/RANK and Strategy selection functions. In addition to the minimum MPI library requirements, other functionalities such as block insert are also available. These additional functionalities are useful in optimizing the middleware interaction with the framework. Block insert for instance helps during the procedure of creating a group. Block insert obviates the necessity of the frameworks having to insert members one by one and to decompress and compress the map at each step. Internally, the framework uses Framework-Compression Scheme functions to interact with the map. This layer includes a series of encoders and decoders. Each encoder/decoder repre- sents one compression technique. The framework follows the strategy specified by the user to decide which representation of the map would be the best. Once the representation is chosen, the respective encoder function stores the map in a compressed format: the corresponding de- coder function restores the decompressed map and the lookup function of the intended module performs the lookup operation. There is one encoder, one decoder, and one lookup function per representation; any new representations must support these functionalities. 3.3 Framework’s Architecture As described previously, a map can be represented either as a set and a permutation or as a list. Techniques such as Range/Strides, Binary Decision Diagrams, Gap Encoding, and Bitmaps are used to represent sets. Wavelet Tree on Runs is used to represent permutation. Finally Arrays, Optimized Arrays and BWT compressed arrays are the techniques used to represent maps in the list format, i.e. without dissecting the map into a set and a permutation block. Figure 3.1 shows the main components in the framework. 13 Composite Communicator Map (Group Rank to World Rank) Unordered Set AND Compressed (Burrows Wheeler Transform) Permutation (Wavelet Tree) Array OrderedMap Optimized Array Gap Encoding Implicit (Binary Decision Diagrams) Implicit (Range/Stride) Bitmap Figure 3.1: Set and Permutation Framework: an AND-OR diagram of the representations used for storing the mapping from group rank to world rank. Rectangles with rounded edges are properties and the boxes are the data structures. The branches are OR unless specifically marked as AND. (An older version of this figure appeared on [21].) 3.4 Framework Implementation One of the guiding principles in the structural design of our framework has been to ensure its extensibility and its openness. As a result, integrating new modules to this framework is a straightforward process. The framework is made of four classes of functions (see Table 3.2). Adding a new module to the framework requires adding a few functions to the Representation class. These functions must be able to perform creation, destruction, modification, and lookup operations on the map. Once a new module is added to the representation layer, all other layers can use it and no further change is necessary in the rest of the framework. 14 Function Type Description Framework API used by MPI middleware to use framework Strategy High level functions to choose the module Representations Functions that create, remove, modify a group and perform lookup on it using a module Internal Internal functions that other classes use Table 3.2: Framework Structure. 3.5 Ordered vs. Unordered Maps The framework distinguishes between ordered and unordered maps. For the process i, let gidA(i) denote the map rank of i with respect to the communicator A, and let wid(i) be the world rank of i. For the communicator A, the default ordering, or the ordered map, is the map where for all i and j in A gidA(i)< gidA( j) if and only if wid(i)< wid( j). In the literature, the default map is called a set or an ordered set. Maps that do not satisfy this property are called unordered sets (also known as a list or a dictionary) [3]. In MPI the monotonically increasing order is the default ordering, for in many applications it is sufficient to specify the map without the necessity of ordering the mapping. In fact, there are interesting trade-offs arising from categorization of maps into those with and without default ordering. When the mapping is a set then with a given gid(i), we need to return the gid(i)th smallest wid(i) in the set. In sets, this operation is called SELECT and its inverse is called RANK. There are various succinct data structures in the literature that, at the cost of space, and in the same asymptotic boundary, achieve a small time SELECT (constant, logarithmic, etc). A monotonically increasing map can be stored as a set with elements from the world rank (W ) where SELECT(W,gid(i)) = wid(i) selects the ith largest element in the collection. The basic operation necessary for MPI is for all process i in A, FIND(gid(i)), which is used to return the world rank of any process belonging to the communicator. There are cases, however, when the user may specify a non-default ordering for the mapping. For these maps the framework uses a composite structure that represents the map as a combina- tion of an ordered set along with a permutation (pi) of the map elements -i.e., SELECT (pi(gid(i))). A succinct representation for permutations (i.e. Wavelet Tree with Runs [3]) stores the permu- 15 tation. This results in a composite structure comprised of a permutation along with one of the previously described set representations. If the composite structure is not sufficiently compact, one can use general representation techniques such as Burrows-Wheeler Transform (BWT) [35] based compression scheme or Optimized Array representation. Unordered maps can be employed to optimize the performance of the application. Because of the underlying physical characteristics of the machine, MPI COMM WORLD is not always the most efficient communicator; it is sometimes necessary to create a communicator including all the processes in MPI COMM WORLD, ordered differently. Such strategy can increase locality and decrease communication overhead [24; 39; 1]. Non-default mappings of processes have also been used in implementation of libraries. For example, in [9] an implementation of Basic Linear Algebra Communication Subprograms (BLACS) is introduced by creating a new com- municator with permuted ranks. These ranks are derived from the column-major ordering of the communicator matrix and produced by transposing the row-major ordering of the communica- tor matrix. The proposed solution stores a matrix of varied dimensions of M and N in succinct manner while achieving good performance. A permutation library has been integrated into this framework to support users’ occasional need for unordered communicators.4 3.6 Map as a Set As a set, it will be no longer possible to define a re-mapping of the ranks in a group. However, for those cases where a re-mapping is needed it can be accomplished by the following operation, MPI Comm split(comm, 0, newrank, *newcomm) This routine produces a new communicator that efficiently maps the current rank of the process in comm to newrank by leveraging the existing set and requiring only the permutation be added if neccessary. By keeping reference counts so that both comm and newcomm can share the set, one can also avoid copying the set. This optimization of MPI Comm split can be supported by adding a MPI PERMUTE constant to the MPI standard for use as the colour parameter to signal it is a permutation of the existing set. The existing code that does the re-mapping with group operations needs to be modified. MPI Comm group now returns a set rather than map, code that depends on returning a map 4Further details about choosing the proper representation given a particular strategy is provided in Section 3.8. 16 needs to be modified. MPI Comm translate ranks cannot be used to translate the rank of a process in one map to that of another for unordered maps. This could be supported by adding MPI Comm translate ranks and MPI Comm compare. The use of MPI IDENT for MPI Group compare would need to be deprecated since all groups use the default ordering. Routines MPI Group translate ranks and MPI Group compare are still useful for obtaining ranks and comparing the underlying sets. In terms of a composite representation, the translation of gidA(i) in communicator A is gidB(i) = pi−1B (RANKB(SELECTA(piA(gidA(i))))) in communicator B, since SELECTA(piA(gidA(i))) = SELECTB(piB(gidB(i))) where RANK and SELECT are inverses. For a world of size N this operation requiresO(logN) queries for RANK5. Although the logN queries makes translation relatively costly, the operation is local, not often required, and can be cached. Finally, bulk translations can save time by using iterators. 3.7 Implementation of Groups The contiguous and mostly ordered nature of groups suggests using sets instead of maps. Group management routines in MPI are Set-like operations and set representations support these oper- ations well. This was our original motivation for investigating set representations such as Binary Decision Diagrams (BDDs) and Bitmaps. The BDD package BuDDy has efficient implemen- tations for the union, intersection, and diffenrece of two sets. The BBD operations have the advantage of proportional size to the input sizes for simple Boolean operations. The underlying BDD operations share the underlying BDD graph structure, and as such, BuDDy can avoid the problem of returning new objects for each group operation. Table 3.3 lists the group operations with their corresponding BDD operation. Bitmaps are also efficient set representations for group management operations. Union of two sets can be calculated by performing OR operation on them; intersection is merely one AND 5Performing RANK operation in O(logN) queries is rather an expensive implementation. The next chapter intro- duces a structure that performs this operation in almost a constant time. 17 MPI Group Operation Corresponding BDD Operation Union(A,B) BDD or(A,B) Intersection(A,B) BDD and(A,B) Difference(A,B) BDD and(A,BDD not(B)) Table 3.3: BDD operations that can be used to implement MPI group routines. and maximum of two masks; and difference is a combination of NOT, AND and maximum of two masks. Availability of corresponding assembly instruction in a 64 bit environment let the middleware perform such operations in a matter of microseconds for a million size system. Chaarawi and Gabriel [7] have investigated storing a group based on its differences from the group it is derived from. The resulting group can be efficiently stored as a sequence of updates about group members. The BuDDy package has the same capability and maintains a common graph to store a set Boolean expressions, which achieves the same goal as [7] in compactly storing the result of a sequence of group operations. Unlike [7], we do not couple representation of groups with that of communicators. To further compress the representation, we extract the path information from BDD during the creation of communicators. The approach in [7] is tailored to store communicators with Ranges/Strides. In storing a communicator with Range/Stride the size of BDD graph depends on the values of S1 and S2 (the interval and the off-set of the stride). This size can grow large for some strides. In such cases, using Bitmap or Gap encoding may be better strategies for creating representations. The stride size can be measured in linear time, a size independent of the group operations that created it. Once it is determined that a group is a range, it can be stored efficiently using Range/Stride representation. We avoid coupling these representations with group operations because groups, as currently defined in the MPI standard, have problematic scalability issues. Although group operations are all set-like operations, they do not always follow set-like operation principles with respect to the ordering of the underlying group. For example, given groups A= [0,1,2] and B= [1,2,3], then MPI Group union(A,B,&C) results in C = [0,1,2,3], while MPI Group union(B,A,&C) results in C = [1,2,3,0]. The latter forms the union and also produces a non-default ordering of the map. The group management operations add a non-intuitive ordering side-effect leading to non-default ordering of the map. In ordered sets, by contrast, more compact representations could be used. If set-like group operations are to become the standard type of operators, we can support them with BDDs and Bitmaps. 18 3.8 Strategies An MPI program can be bottlenecked at the points of memory allocation, computational power, or network bandwidth. The framework does not interact with the network layer directly, and as such, it has no effect on the network bottleneck. However it can exert a positive or a negative influence on memory and CPU bottlenecks by making different trade-offs in terms of time and space consumption, however. The strategies available in this framework allow the user to affect the trade-offs chosen by the system and, hence, potentially alleviate the bottlenecks. Different strategies are designed to address different bottlenecks. The Space strategy is use- ful for memory-intensive applications where space tends to be a bottleneck. The Time strategy, on the other hand, is useful for CPU-intensive applications where computational power is often the bottleneck. Finally, the Hybrid strategy is available to those users with no certain knowledge of the bottlenecks for their applications, or in the cases where time and space bottlenecks concur. It ensures that the optimization of time does not significantly undermine that of space and vice versa. As a result, this strategy reduces the chance of removing one bottleneck at the cost of creating another. The Space, Time and Hybrid strategies are further explained in sections 3.8.1, 3.8.2 and 3.8.3, respectively. 3.8.1 The Space Strategy The Space strategy is a straightforward strategy set by the user. It evaluates the space required to represent the underlying map by using every module in the framework. It releases all allocated objects at each step to avoid unnecessary consumption of the working space, and after evaluating all the modules, it chooses the smallest one and proceeds to create it. It is not always necessary to construct all the possible representations to calculate their respective space demands. By knowing its size, the framework can calculate space required to store the map with arrays. Knowing the smallest and the largest items in an ordered set, it can also calculate the space requirement for Bitmap representation. Furthermore, the space required to store the map with Optimized Arrays and Delta encoding can be calculated by finding the maximum rank and maximum gap respectively. These parameters can be determined by traversing the map once. Such traversing can also determine if the map is Range/Stride or not. The strategy layer may take advantage of these information and compute the required memory without building all the possible representations. Modules such as Burrow Wheeler Transform, Binary Decision Diagram, and Wavelet Tree 19 on Runs provide challenges in respect to calculating space demands. The framework currently constructs the maps represented by these modules in order to find the space consumption. There- fore, development of estimation techniques that can calculate the required representation space without constructing it is an important future contribution. Such techniques can improve the performance of the framework and reduce the time needed to construct communicators. Per- forming such calculations is the most challenging for BDDs, as constructing the smallest BDD graph is a NP hard problem. It is easier to do such estimation for BWT by calculating the en- tropy of the map. The only downside of such approach is that estimation function is not defined under Select/Rank dictionaries. This would render the framework less open and less extensible. 3.8.2 The Time Strategy Finding the right representation for the Time strategy is more challenging than it is for the Space strategy; the exhaustive approach used by the Space strategy might be impractical. To avoid such lengthy operations, this strategy samples 100 random lookup times from each map and calculates the estimated mean lookup time for each map. Based on the obtained information, it then chooses the best representation. Strategies that estimate the time required to perform the lookup operation, without actually creating the structures and performing the lookup operation, may be beneficial. The results explored in Chapter 5 show that certain patterns exist in terms of time consumptions, patterns that can be used to derive such estimation techniques. Some of the observed patterns include: • Range/Stride representation takes place in few nanoseconds due to its simple and straight- forward formula. • Optimized Array representation is few nanoseconds slower than Canonical Array repre- sentation. • Gap encoding generally outperforms Bitmap in lookup time unless the map is highly dense. • Gap encoding and Bitmap seems to outperform BDD in time consumption unless there is a very simple and robust BDD graph that can represent the underlying map. • BDD lookup time is closely related to its size. 20 • BWT representation is not a suitable module for the Time strategy when most of the lookups are random ones. • With iterative lookup, BWT representation can compete with WTR. One can expand this list and find a heuristic method of estimating comparative lookup times among the modules. These estimations can be used to find the best representation according to the Time strategy. The abovementioned potential improvements come at the expense of the frameworks open- ness. Similar to the Space strategy, a downside of the heuristic approaches is that for every new compression technique added to the framework, corresponding heuristics must be added too. 3.8.3 The Hybrid Strategy The Hybrid strategy attempts to optimize time consumption and space consumption simultane- ously. In certain situations, such as when BDDs or Bitmaps are used, there exists a correlation between the space consumption and the lookup time. The Hybrid strategy cannot rely on either space or time dimension, since one cannot assume the existence of such correlation. Instead, it evaluates time and space consumptions in every case. The Hybrid strategy finds the best representation of the map through the weighted sum of space and time consumptions in the logarithmic space. Based on the outcome of the algorithms explored in the Time and Space strategies, the Hybrid strategy starts with allocating scores to each module in both dimensions in the logarithmic space. It then calculates the weighted sum of the two scores by introducing an Alpha factor (α). This factor can be viewed as the bias towards either time or space and it can be adjusted by the user depending on his or her preferences.6 Once α is set, the framework uses the following formula to calculate the respective weight for each module: W = α×T +(1−α)×S Where T represents the logarithmic weight of the Time strategy and S represents the logarithmic weight of the Space strategy. Newly derived weights (i.e. W) will then be used to choose the best representation according to this strategy. 6Based on experiments, 0.25 appears to be an appropriate Alpha factor for a normal application and it is set as the default value in our proposed framework. 21 ∗∗∗ The decomposition of the map into a set and a permutation is beneficial both in compression aspects of the map, and in implementation aspects of the group operations in MPI. Chapter 4 fur- ther explains different representations used in framework and the auxiliary structures integrated into them to facilitate the lookup operation. 22 Chapter 4 Discussion of Compression Techniques The design of the framework could greatly benefit from the availability of published use-cases. However, in the absence of such information,1 scholars rely on general patterns that arise in MPI application usage to provide a sketch of usage patterns. The proposed framework has representations implemented that take advantage of such patterns based on the group functions and MPI libraries. Following on the previous chapter’s explanation of how the open portfolio of available representations allows additional representations and compression techniques to be implemented, this chapter will cover the representations inside the framework, representations common to MPI middleware and MPI applications. It will elaborate on the implementation of the compression techniques used inside the proposed framework and additional techniques incorporated to enhance the SELECT operation. The framework consists of four classes of functions: the Framework class, the Strategy class, the Representation class and the Internal class. The Representation class includes modules in three categories. These categories include set representation, permutation representation, and list representation. A set representation represents a communicator only if that communicator consists of items in a monotonically increasing order. A permutation must be added to a set representation if the map is permuted (i.e. it is not in a non-decreasing order). Alternatively one may use a list representation that stores both the set and the permutation. Details about each representation are given below: • Set Representation: This category includes Range/Stride representation (Section 4.1.1), Binary Decision Diagrams (Section 4.1.2), Bitmaps (Section 4.1.3) and Gap encoding (Section 4.1.4). • Permutation Representation: This category includes Wavelet Tree with Runs (Section 4.2.1). 1Which is currently the case, except for simple Range/Stride structures, on which case studies have been pub- lished. [7] 23 • List Representation: This category includes Optimized Arrays (Section 4.3.1) and Burrows- Wheeler Transform (Section 4.3.2). Communication storage would benefit from being stored hierarchically. Currently, MPICH2 and FG-MPI implementations store a communicator based on the most outer communicator or MPI COMM WORLD. In a hierarchical representation of communicators, if A is derived from com- municator MPI COMM WORLD, and communicator B is derived from communicator A, communi- cator B will not have any references to MPI COMM WORLD. In such a case, routing information can be obtained recursively from communicator B to the MPI COMM WORLD. In this approach the intermediate communicators tend to be smaller than MPI COMM WORLD. A smaller super-set means a smaller amount of memory is required to represent the subset. Hierarchical storage, therefore, requires less space than the current approach does. However, it requires an additional mechanism to ensure that freeing a communicator will not break its child communicators and may increase access time. 4.1 Set Representations 4.1.1 Range and Stride Representations A simple structure for maps can be constructed through a range or a stride. A stride refers to a collection of monotonically increasing or decreasing ranks with uniform distance from each other. A range is a stride with the distance of 1 between the consecutive items. A Range/Stride representation can occur by the use of MPI Comm create where the group pa- rameter is the result of certain group management operations such as MPI Group range incl and MPI Group range excl. A simple range and stride can be represented by the following function: wid(i) = gidA(i)×S1+S2 The function maps gidA(i), ranging from 0 to the communicator size minus one, to the associated wid(i) value. In the formula, S1 is the stride and S2 is the offset from zero. The formulation can be conveniently denoted by the following term: (S2,size,S1) 24 where the S2 is the smallest world rank in the communicator; size is the number of members in the communicator; and S1 is the gap between consecutive ranks. In general, range represen- tation is the ideal representation for storing a communicator since it requires constant space and constant time to perform SELECT. 4.1.2 Binary Decision Diagrams As mentioned, for many explicit representations the space requirements may become too large, even when stored close to their information theoretic lower bound. Implicit representations, however, benefit from specific patterns in the data to achieve compact representations. Sets, for instance, can be represented as a Boolean function in a Binary Decision Diagram (BDD) [5]. BDDs are frequently used for practical purposes such as circuit design and they can scale up to millions of nodes. Given their characteristics, BDDs seem to be an appropriate representation when certain patterns emerge in the data (or map) due to the underlying structure of group algorithms. Com- monly used programming patterns, such as matrices or meshes, can be represented by BDDs efficiently. Similar Cartesian topologies, such as grids or meshes, are useful in parallel algo- rithms. Such algorithms can often be simplified by the use of collective communication among the appropriate subsets of nodes. For example, in a 2D-mesh it is common to create a commu- nicator for each row and each column. In this case, assuming a standard row- or column-major ordering of the nodes, each communicator is a range, which can be represented in a constant time and a constant space. Other sub-meshes of the mesh in 2D -as well as higher dimensions- are collections of ranges and cannot be represented by a single range as defined in our frame- work. The BDD representation, however, may be able to take advantage of the existing patterns in order to compactly store the map information for such communicators. To implement BDD module we extracted a subset of the BDD routines from the BuDDy package by Jørn Lind-Nielsen [25] and integrated its data structures and operations into our framework. We then construct the map as a Boolean function that returns true (1) if the input is a member of the map and false (0) otherwise. The input to create this function is the set of binary representations of wid(i) for all the members in the map, where each bit is a binary variable. Since in our case the map is static, once the BDD is created we extract the paths that produce a true result and discard the graph structure. Figure 4.1 shows an example of a small BDD for a communicator representing even and 25 odd items in MPI COMM WORLD. The representation is optimized by discarding the BDD graph structure and keeping the paths that lead to true. Paths can be represented as a sequence of nodes where each node is either 0, 1, or X (i.e. do not care). Therefore 2 bits are required to store the state of a node. Since length of each path is dlg(N)e where N is the size of the world, representing p paths would require at most 2pdlg(N)e bits. X[n-1] 1 0 X[n-1] 1 0 Figure 4.1: The Binary Decision Diagram for the communicators with only even (right) and odd (left) members. The algorithm for the SELECT operation finds the gidA(i)th smallest fully specified path from the ordered collection of paths extracted from the BDD. The algorithm proceeds iteratively from the first node in the set of all paths down to the dlg(N)e− 1 node (i.e. the last node). In each step, it uses the current rank (initially set to gidA(i)) to determine whether or not the wid(i) is in a path with bit 0 or 1 2. Depending on the value of gidA(i), we either discard the paths with a 0 or with a 1. We then appropriately adjust the value of gidA(i), and repeat the operation for the next node on the remaining paths. The algorithm returns the gidA(i)th smallest wid(i) in at most O(p× lgN) steps. The number of paths left at each step of the algorithm depends on the size of the BDD as well as on the number of Xs encountered, which can result in long SELECT times. We can avoid this situation by limiting the working memory size of BDD and the total count of the number of paths. A BDD is then created only when the limits have not been surpassed. Procedure 1 shows the SELECT operation on a BDD structure. This operation can be effi- ciently computed. In a BDD structure every path to true represents one or more wid(s). The total number of members in a path is equal to 2|X | where |X | represents the total number of Xs in that path. Note that wid(s) represented in different paths do not overlap (i.e. no two paths can represent the same wid). This property allows us to perform a SELECT operation by constructing the target wid bit by bit in O(|p|× lg(n)) steps as follow: To find the most significant digit of the target wid, calculate the total number of wids with 0 at the current position among the remaining paths (all of the paths are considered to be remaining paths when the algorithm stars). The total number of Xs (i.e. |X |) in every line is computable 2Do not care nodes (i.e. nodes with value X) are split half and half between 0 and 1. 26 Procedure 1 BDD Find(path[], size, aux[], gid) 1: Find the biggest i where aux[i]< gid 2: f irst set bit← i; 3: for each pathi do 4: if there is a set bit before f irst set bit then 5: remove the pathi from the search space 6: end if 7: end for 8: for each pathi do 9: in pathi set all X bits before f irst set bit to 0. 10: Xi← total number of X bits in pathi. 11: end for 12: for ( j = 0; j < f irst set bit; j++) do 13: S← 0; 14: for each pathi do 15: S← S + the total number of wids with jth bit set to 0 in pathi. 16: end for 17: if S < gid then 18: remove all the pathis with jth bit set to 1 from the search space. 19: Set all jth bit pathi with jth bit set to X to 0. 20: else 21: Remove all the pathi with jth bit set to 0 22: In all pathi if jth bit = X , set the jth bit to 1. 23: gid← gid−S. 24: end if 25: end for 26: return the remaining pathi. 27 in O(|p|× lg(n)) steps. Once |X | for every line is calculated, the total number of wids with 0 at the current position can be calculated by O(|p|) steps for one has to go through every path and to check the corresponding bit (Procedure 1 lines 14−16). During this bit by bit evaluation, If the encountered bit is 0 add 2|X | to the total number of items with a 0 at this position, and if it is a X add 2|X−1| to the total number of items with a 0 at this position. In both cases |X | represents the total number of remaining Xs in the path under evaluation. If the gid of the target wid is less than 2|X |, conclude that the current bit is a 0 in the target wid, otherwise it is 1. If the bit is recognized as 0 flag all paths that assign 1 to this bit. Such paths must be put aside in the remaining steps (Procedure 1 line 18). If the bit is 1, however, flag all the paths that assign 0 to this bit (Procedure 1 line 21). Decrease the total number of removed paths from the gid -i.e. the paths with 0 and half the paths with an X in this bit position (Procedure 1 line 23). Every time an X is encountered in a path reduce the |X | by one in that particular path. Con- tinue this algorithm until the last bit of the target wid is determined. Once this bit is determined, the algorithm concludes. To further enhance the algorithm we store an auxiliary structure at the time of creation. This auxiliary structure contains an array of integers with lg(n) members. Every element Ai where 0≤ i < lg(n) contains total number of wids where each wid is less than 2i. Such structure would allow us to calculate the most significant set bit in wid in O(lg lg(n)) steps. Knowing the position of the first 1 in the wid, we can prune all paths with set bit(s) occurring before this position and continue with the abovementioned algorithm. An Example of BDD Find For example consider finding the 13th smallest item in the following set: {125,109,93,77,61,57,53,49,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16} To keep this example simple, the auxiliary structure has been omitted and the algorithm proceeds without it. For illustration purposes, the algorithm does the computations on an extra path, rather than the original BDD paths as demonstrated in Procedure 1. The abovementioned set can be represented with the following BDD paths: [0, 0, 1, 0, X, X, 0, 1] 28 [0, 0, 0, 1, X, X, X, X] [0, 1, X, X, 1, 1, 0, 1] Figure 4.2 shows the steps taken by BDD Find to find the 13th smallest item in the set. Figure 4.2.(a) represents BDD paths before starting the algorithm. The top three rows represent BDD paths, and the bottom row represents the current computed rank of the 13th smallest item in the set. Figure 4.2.(b) shows the first step taken by BDD Find. The first bit of each path is examined. Since they are all 0s, the first bit is set to 0 in the computed rank. In the next step (Figure 4.2.(c)) BDD Find is to decide whether the second bit will be 0 or 1. There are two Xs in the first row, four Xs in the second row, and two Xs in the third row. Thus there are 22 ranks (22 + 24) with the second bit set to 0, and 4 ranks (22) with this bit set to 1. Since BDD Find is looking for the 13th smallest item, it sets the second bit of the computed rank to 0 and discards the third row. In Figure 4.2.(d) BDD Find is to decide the value of the third bit. In this step there are 4 paths in the first row with the third bit set to 1, and 16 paths on the second row with the third bit set to 0. Again the targeted item (i.e. item 13) is less than the total number of paths with the third bit set to 0; so BDD Find sets the third bit to 0 and discards the first row. Since the only remaining path, i.e. the second path, has this bit set to 1, finding the value of the forth bit is not necessary (Figure 4.2.(e)). The rest of the bits in the second path are set to X; so the 13th smallest item can simply be computed as the binary representation of 13− 1 or [1,1,0,0] (Figure 4.2.(f)). Finding the value of the last bit concludes BDD Find operation and [0,0,0,1,1,1,0,0] or 28 is returned as the 13th smallest rank in the set. In the next section, some examples of MPI structures using BDDs will be presented. Million Node Matrix This section will provide two examples of a matrix architecture with a world of size over a mil- lion (Figure 4.3 illustrates a 1,048,576 node matrix architecture with ranks in MPI COMM WORLD shown on each of the nodes). The upper BDD represents all the members of the 1023th row in the matrix; the lower BDD represents all the members of the second column in the matrix. Both examples show that every row and every column can be represented with 10 nodes in a BDD and all the rows and all the columns in a matrix architecture with a million nodes can be represented with just 2,048 paths or 5,120 Bytes. The same topology in MPICH consumes 29 0 10 0 X X 10 0 00 1 X X XX 0 X1 X 1 1 10 ? ?? ? ? ? ?? 0 10 0 X X 10 0 00 1 X X XX 0 X1 X 1 1 10 0 ?? ? ? ? ?? 0 10 0 X X 10 0 00 1 X X XX 0 X1 X 1 1 10 0 ?0 ? ? ? ?? 0 10 0 X X 10 0 00 1 X X XX 0 00 ? ? ? ?? 0 00 1 X X XX 0 00 1 ? ? ?? (a) (b) (c) (d) (e) (f) 0 00 1 X X XX 0 00 1 1 1 00 Figure 4.2: Finding a rank in BDD paths. 30 2×1024×1024×sizeo f (int) or 8 MBytes in a 4 Byte integer machine. This suggests that BDD can save almost 1,638 times in terms of space consumption compared to current implementa- tions of communicators in MPICH. Million Node Cube Figure 4.4 shows an example of range of ranges represented by a BDD. Here, BDD represents a 2D sub-mesh of a 3D mesh with over one million nodes. The BDD shown in Figure 4.4.(b) is all that is required to store the sub-mesh. In this example, X [ j] denotes the jth bit in the binary representation of the wid(i). The 2D sub-mesh is defined as a range or ranges and, as Figure 4.4.(b) shows, the size of the resulting BDD is relatively small. Only 5 bytes are needed to store the paths for this sub- mesh of size 213 derived from a mesh of size 220. In other words, such a path in a world of size 1,000,000 can be represented in five bytes. This concise representation is significant because it corresponds to 27×26 nodes. 4.1.3 Bitmap Representation Bitmaps, structures widely used to store sets, are particularly effective tools for the represen- tation of dense groups. The Bitmap representation uses a bit-array to store the map of ordered groups. We set the ith bit to 1 whenever wid(i) is in the group and 0 otherwise. The number of bits required by bit-array is equal to the size of MPI COMM WORLD. This size is independent of the size of the group. One simple heuristic approach is to store the bit-array starting from the first set bit up to the last set bit. Thus wid(largest)−wid(smallest)−1 bits are needed to store the array. A disadvantage of using Bitmap is that the SELECT operation consumes time linearly with respect to the size of the outer communicator (i.e. MPI COMM WORLD in our case) with an small constant factor albeit. To alleviate this situation, we store the cumulative number of set bits in blocks of 1,000 items in an auxiliary structure. A binary search then is performed on the auxiliary structure to determine the block containing the gid(i)th set bit. Within a block, a linear search is used to find the targeted bit. This auxiliary structure leads to logarithmic lookup times (asymptotically). The overhead of such structure is d(first− last− 1)/1,000e integers -a small amount in comparison to the size of the Bitmap. Whenever possible, we use the popcnt assembly instruction to count the number of the 31 1048575104857410475531047552 1047551104755010465291046528 2047204610251024 1023102210 1024 1 0 2 4 Group B G ro u p  A X[19]X[18]X[17]X[16]X[15]X[14]X[13]X[12]X[11]X[10] 0 1 X[9]X[8]X[7]X[6]X[5]X[4]X[3]X[2]X[1]X[0] 0 1 Group A: Group B: (a) (b)Figure 4.3: The Binary Decision Diagram for the representing a row and a column in a matrix architec- ture. 32 X[12]X[11]X[10]X[9]X[8]X[7]X[6] 0 1 (a) (b) 128 64 128 (0,0,0) (0,0,127) (63,0,0) (0,127,0) Figure 4.4: (a) A 220 node 3D mesh with a 2D sub-mesh defined by varying dimensions X and Y, for a fixed value of Z. (b) The BDD representation of the sub-mesh. set bits in a 32 or 64 bit register3. As demonstrated in Figure 4.5 in the presence of popcnt assembly instruction, the Bitmap lookup algorithm first finds the right block of size 64 bits. Then it narrows the search down to 32, 16 and 8 bits. Each reduction takes place by one assembly call. It finally finds the right rank in the lookup table. With this technique we are able to perform the SELECT operation in less than 80 nanoseconds on a R© CoreTM i7 2.67 GHz workstation. There is an extensive literature on compressing techniques for Bitmaps. In addition, there 3popcnt instruction counts the number of set bits in a 32 or 64 bit register and it is part of Intel’s Stream- ing SIMD extensions version SSE4.2 and SSE4a implemented in the i7. It is also part of AMD’s Advanced Bit Manipulation instructions in the ”Barcelona” processor. 33 S0 S1 Sm-2 Sm-1 1024 1024 1024 1024 …. 1024 1024 1024 1024 64 64 64 …. 64 64 64 64 32 32 16 16 8 8 …. Prefix Sum: Bitmap (1024 Bits) Bitmap (64 Bits) Bitmap (32 Bits) Bitmap (16 Bits) Binary Search Max # of steps = log2(m) _mm_popcnt_u64 () Max # of Steps = 15 _mm_popcnt_u32() # of Steps = 1 _mm_popcnt_u32(X && 65535) # of Steps = 1 _mm_popcnt_u32(X && 255) # of Steps = 1 Table Lookup # of Steps = 1 Maximum Number of Steps:  log2(m)+ 20 Sx... ... Look up Table 64 World Rank Bitmap (8 Bits) Table Lookup # of Steps = 1 Auxiliary Structure 102 N m World size N Figure 4.5: Bitmap lookup algorithm in presence of popcnt assembly instruction. 34 are a variety of Bitmap representations that may be used to further improve lookup perfor- mance [36]. The framework can be extended to include other such representation schemes. 4.1.4 Delta (or Gap) Encoding Representation of sparse sets and matrices has been one of the areas of research in Computer Science [4; 32; 37; 11]. While Bitmaps are nearly ideal structures for storing dense sets, one of the most effective tools in dealing with sparse sets is Delta Encoding. Delta Encoding or Gap Encoding 4 relies on storing the gaps between the sequential items. In sparse sets, the largest gap tends to be smaller than the largest item in the set. Hence storing a set using gaps tends to save space when compared to storing individual items in an array. There are two ways to store gaps. One can store all the gaps in a similar number of bits, in which case the required bits per member is dlg2 max(gap)e. This scheme saves a signif- icant amount of space in storing uniformly distributed bitmaps.5 Non-uniformly distributed bitmaps on the other hand tend to have large gaps and dlg2 max(gap)e tends to get close to dlg2 max(member)e. This renders Gap encoding less efficient. To address such situations, al- ternative methods of non-uniform gap storage can be used. In addition, the numbers of bits required to store each item are also stored: this encoding consists of two layers. The first layer stores the number of bits per member; the second layer stores each member. This idea can be employed recursively by storing the number of bits required to store every item in the first layer and so on. Yet, while this structure is optimal with respect to space consumption, it can be very time-consuming to perform lookup operations required in a MPI middleware. Another alter- native is using Huffman coding to store the first layer. Representations produced by Huffman coding and Gap encoding are efficient and lie close to the information theoretic lower bound. As appealing as it seems, storing gaps comes with certain penalties. Performing lookup operation on a gap encoded set is no longer a constant time operation. Instead, it is a linear one with respect to the size of the set. An auxiliary structure such as the one described above for Bitmap can be used to alleviate this problem by storing the prefix sum of the gaps in the auxiliary structure. During the lookup operation, we perform a binary search to find the right 4The term Delta Encoding generally refers to storing the minimum amount of information required to construct an object from an earlier version of it. Since we use gaps to construct the next member knowing the previous one, the term gap encoding is also used to describe the technique. 5Uniformly distributed bitmaps are sets in which irrelevant of their positions, bits are set with the probability of d (bit density) [8]. 35 block in the gap array and then look into that block in order to extract the targeted member. To further enhance the lookup operation, we store gaps uniformly with dlg2 max(gap)e bits per member. This technique in conjunction with the abovementioned auxiliary structure brings down the lookup time to almost 40 nanoseconds using Intel CoreTM i7 2.67 GHz CPU. 4.2 Permutation Representations 4.2.1 Wavelet Tree Representation6 To store unordered group maps we take the set representation and add a permutation(pi). This permutation maps the default ordering of the set to the target ordering of the gidA(i). There are a wide range of algorithms available to encode permutations. One example is the algorithm based on entropy of the permutation. Barbay and Navarro [3] have studied several techniques to compress a permutation and have come up with three strategies namely: Wavelet Tree on Runs (WTR), Stricter Runs and Shuffled Sequences. Wavelet Tree on Runs, founded on the reverse of the Merge sort algorithm, takes advan- tage of ordered sub-sequences in pi and it performs pi(i) and pi−1(i) in Ω(1+ logρ) time using ndlgρe(1+ o(1))+Ω(lg(n)) bits, where ρ represents the number of the runs -i.e. the largest monotonically increasing subsections of the permutation. [3]. The amount of storage needed to use this technique depends on the group size instead of the world size. We use this succinct data structure to represent permutations. WTR library compresses the permutation and provides fast computation of pi(i) and pi−1(i). The algorithm performs a Hu-Tucker algorithm [17] on the runs and encodes the result in a binary tree. Further reduction in the time required to calculate pi(i) and pi−1(i) is made by re-balancing the tree. WTR is optimized to compress permutations with a few number of runs. This property is particularly interesting because there are few MPI group manipulation functions that increase the number of runs. Since groups are derived from MPI COMM WORLD, which is monotonically increasing, and most of the group management functions keep the ordering of the originating groups, permutations with small number of runs are more probable. Stricter Runs and Shuf- fled Sequences techniques support sophisticated structures and by virtue have a larger potential memory overhead. As there is likely less need for sophisticated structures (and the memory cost 6The permutation library deployed in this section has been developed by Carlos Bedregal under the supervision of Jérémy Barbay and Gonzalo Navarro. The code is used here with the permission of the owners. 36 inherent in them), WTR is a more appropriate choice for a typical MPI application. Finally, it is important to note that before creating the WTR representation, we first calculate the number of runs and only use WTR whenever the number of runs is less than a preset limit. To avoid an explosion in the working memory, the algorithm does not proceed if the number of runs exceeds this threshold. Million Node Matrix Transpose In section we considered constructing a matrix architecture with a million nodes. MPI libraries such as PSPASES use matrix manipulation libraries such as BLAS (Basic Linear Alge- bra Subroutines). Among other things, BLAS stores matrix transposes [9; 23]. In this section we illustrate storing the transpose of several matrices in a communicator, with varying row and column sizes. Figure 4.6 shows the storage cost of a permuted MPI COMM WORLD in a matrix architecture transformed from row-major to column-major. The Figure compares the storage size with that of the current implementation. As Figure 4.6 shows, the size of the structure increases slowly, meaning that it can support large number of rows at a small memory cost. 4.3 List Representations 4.3.1 Optimized Canonical Representation Arrays or Canonical representation of the map in MPICH2 consume sizeo f (int) bytes per mem- ber. Depending on the architecture, sizeo f (int)might be 32 or 64 bits. Although this representa- tion can support over a billion nodes, the largest supercomputers currently available are smaller than a million nodes, and theoretically require less than 20 bits per member. Hence dedicat- ing 32 bits per member is not an efficient use of memory at this point. The Optimized Arrays or Optimized Canonical representation requires dlgn2e bits per member where n is the size of the world. This representation has a constant lookup time, with a small constant factor, gener- ally very close to the (non-optimized) canonical representation. Since it stores items explicitly, storing the permutation in a separate structure is not necessary. Optimized Canonical representation lookup time is competitive to array lookup time. The only extra overhead in this technique is the overhead of shifting and masking the extra bits, which can be completed in a few nanoseconds (3 to 4 nanoseconds using an Intel CoreTM i7 37  0  500000  1e+06  1.5e+06  2e+06  2.5e+06  3e+06  3.5e+06  4e+06  4.5e+06  0  200  400  600  800  1000  1200  1400  1600  1800  2000 S i z e  o n D i s k  ( B y t e s ) Wavelet Trees on Runs Current MPICH Implementation (Base case) Figure 4.6: Compares the cost of storing permutation using WTR and compare it with the current imple- mentation. The horizontal bar represent number of rows. 2.67 GHz CPU). This representation tends to perform well in storing randomly permuted sets. Finally, while Optimized Canonical representation saves space in a machine with smaller than 2 billion nodes, theoretically there is no limit imposed on the number of members. 4.3.2 Burrows-Wheeler Transformation Based Compression It is often difficult to find a discernible pattern in a randomly, or a semi-randomly permuted map. The final alternative for reducing the amount of storage in such situations is to compress the ar- ray of wid(i)s in the map. Our framework uses BZ2 bzCompress and BZ2 bzDecompress from the bzip2 library that use Burrows-Wheeler transform (BWT) [35] and Huffman coding. [33] Even though BWT was originally designed to compress texts, certain bit patterns in the numer- ical data -such as ranges and sequences- increase the chance of recurring characters. In such 38 numerical data sets, therefore, BWT-based compression techniques perform reasonably well. The block compression nature of BWT lets us perform the SELECT operation efficiently. Blocks of 1,000 wid(i)s are compressed and the SELECT operation is implemented by simply decompressing the dgidA(i)/1,000eth block. As a general rule of thumb, increasing the block size potentially improves the compression ratio. [27] It, however, adversely affects the perfor- mance of an MPI application by increasing the lookup time. Block size of 1,000 has been chosen since bzip2 library uses buffer size of 5K frequency during the course of compression and decompression. Therefore, having a block size larger than this amount adversely affects the performance and may force the library to break the block into two or more. The SELECT operation takes the same amount of time regardless of the communicator size, the world size and the location of the item. However, the operation is relatively slow because the entire block needs to be decompressed. As a method of optimization, we cache the last decompressed block, which significantly reduces the time needed for selecting the same widA(i). This feature is also useful when one iterates over all the members of a map -a situation that occurs frequently in collective communications and group operations. Further enhancements to this compression technique are possible by using recent improve- ments in the computation of BWT developed by Okanohara and Sadakane. [15; 14; 13] Notably among such improvements is a proposed approach with linear time complexity [31] that seems to be particularly promising. Million Nodes Random Graph Random graphs are useful tools in simulating networks. [30] In this section we consider the case of storing a random graph with MPI and we measure the space required to store such a graph using BWT representation. We represent an Erdős-Rényi graph in which the edges associated with a node are ordered based on their Cartesian length in a G(220,0.005) random graph. Such graph consists of 220 vertices and every possible edge occurs with the probability of 0.005. To measure the length of the edges, every node is also associated with a random Cartesian coordination in a two dimensional space. MPI applications simulating social networks or recommendation systems tend to follow such a paradigm. Figure 4.7 shows the cumulative cost of storing 5,000 such communicators using BWT compression scheme and compares them with the current implementation and the information theoretic lower bound. As Figure 4.7 shows, BWT-based compression representation performs 39 better compared to the standard array-based representation, and it is reasonably close to the theoretic lower bound.  0  2e+07  4e+07  6e+07  8e+07  1e+08  1.2e+08  0  500  1000  1500  2000  2500  3000  3500  4000  4500  5000 S i z e  o n D i s k  ( B y t e s ) Currrent Implementation Compressed Representation Information Theory Lower Bound Figure 4.7: Comparison of the cost of storing a random graph communicator using BWT compressed array-based implementation, the current implementation and the information theoretic lower bound. The horizontal axes represents the number of communicators. ∗∗∗ The framework composed of variety of compression techniques used to represent map in list, or set plus a permutation structure to enhance its capability in dealing with different MPI applications and communicators. Further optimization techniques as well as auxiliary structures are integrated into different modules to expedite the SELECT operation. Chapter 5 evaluates these techniques in terms of time and space outside MPI. 40 Chapter 5 Experiments on Single Machine The lookup time for the framework and the relative space consumption of the operation are significant factors in middleware’s robust and economical function. This chapter reports on the experiments done to compare the time and space requirements for various ordered and unordered communicator maps derived from an MPI world of size one million processes. Some of these experiments were designed to evaluate different representation techniques in the framework, while the rest are designed based on general characteristics of group management operations in MPI (such as a map that represents a range of the ranges). Finally this chapter also reports the effect of the inclusion of the different strategies available to this framework (i.e. Space, Time, and Hybrid strategy) for each experiment and compares the respective lookup times. The evaluations of the experiments done in this chapter demonstrate how the framework offers a variety of trade-offs with respect to the time and space for diverse set of experiments. It will also show that, under different strategies, the time overhead of the lookup operation is near negligible. 5.1 Experiment’s Environment All the experiments have been conducted using an Intel R©CoreTM i7 2.67 GHz workstation with 6 Gigabytes of memory. Furthermore, in all the experiments, a standalone implementation of the framework module in FG-MPI has been used. Five classes of experiments were defined, each consisting of a thousand tests. These classes cover the frequent uses of communicators in MPI and provide analysis of all the representations integrated into the proposed framework. Each experiment was run ten times and in each run, the lookup time as well as the consumed space were measured. The graphs were then produced based on the lowest lookup time achieved in order to minimize the influence of Linux kernel context switching. Box plots, Scatter plots, and graphs on logarithmic space have been used to illustrate the distribution of the test results. 41 0.001 0.01 0.1 1 10 100 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD BWT Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.1: Experiment class A, Sequential lookup times 5.2 Experiment Results 5.2.1 Class A Experiments This class represents random members of the MPI world. The algorithm starts with choosing a size, and based on this size, it selects random members in the world, and stores them in the map. Figure 5.1 and Figure 5.2 depict the lookup times in class A experiments. They represent the random and iterative lookup times using each module in the framework. The vertical axis represents time in microseconds in the logarithmic space, and the horizontal axis illustrates the modules. Each box represents the lowest, the highest, the first, and the third quartile lookup time over all the experiments in class A. In most cases, the highest lookup time is the same as the third quartile and, therefore, the highest lookup time is not visible in the Figures. BDD has the largest variations due to the random size of the group, as 5.1 and Figure 5.2 demonstrate. Highly dense groups and highly sparse groups tend to produce a smaller BDD value. Canonical and Optimized Canonical representations show the smallest variations and the fastest lookup time. There are almost no variation in these representations since the lookup operation takes place in constant time, regardless of the underlying communicator. Bitmaps 42 0.001 0.01 0.1 1 10 100 1000 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD BWT Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.2: Experiment class A, Random lookup times with popcnt instruction tend to outperform Delta encoding. Generally, Delta encoding shows the most stable behavior. Furthermore, interesting cases emerge in BWT behavior where the minimum lookup time is considerably small. These small lookup times happen when the group size is smaller than the size of the cache. This means that, in such situations, all the members are stored in the cache resulting in fast lookup times. Apart from these particular cases, as the group size gets larger, the cache becomes less effective. This is especially true in random lookups. Figure 5.3 shows the space and time relation in class A experiments. It illustrates the space consumed by each module and plots them against the corresponding lookup times. The vertical axis represents the consumed space in bytes in the logarithmic space; and the horizontal axis represents the lookup times in microseconds, again in the logarithmic space. In this Figure, one can observe the trade-offs between time and space made with choosing different modules. Canonical, Optimized Canonical, and Delta encoding offer similar lookup times, regardless of the consumed space. In these representations, there is no direct relation between time and space. In contrast, Bitmaps show quite a different behavior. Lookup times vary in different experiments with Bitmaps representation, while the consumed space does not. The space con- sumed by Bitmaps is a function of the world size; lookup times are a function of the density of 43  1000  10000  100000  1e+06  1e+07  0.001  0.01  0.1  1  10  100 sp ac e (B yte s) Time per lookup (Microseconds) Class A, space required to represent the communicator (Iterative access times) BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical  1000  10000  100000  1e+06  1e+07  0.001  0.01  0.1  1  10  100  1000 Sp ac e ( By tes ) Time per lookup (Microseconds) Class A, Space required to represent the communicator BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical Figure 5.3: Experiment class A, Space function based on lookup times. 44  0.001  0.01  0.1  1  10  100  1000  0  100  200  300  400  500  600  700  800  900  1000 Ti m e pe r l oo ku p (M icr os ec on ds ) Group ID Class A, Lookup times with different strategies Space Strategy Hybrid Strategy Time Strategy Figure 5.4: Experiment class A, Strategies lookup times. the map. This density differs from one map to another in this class. Finally, BWT and BDDs show the greatest fluctuations since the members of the communicators are chosen randomly, and these two representations are susceptible to this randomness. Figure 5.4 shows the lookup times for different strategies chosen by the framework. The vertical axis represents the mean required time to perform the lookup operation using each strategy. The scale is in microseconds in the logarithmic space. The horizontal axis represents the communicators in the class. With the Time strategy, most communicators will be represented using Optimized Canonical representation. With the Space strategy, however, the framework jumps between different representations, based on the characteristics of the communicators. For instance, in cases where the size of the communicator is large enough, Bitmap consumes a smaller space compared to other representations, whereas Delta encoding tends to consume less space when representing smaller communicators. The Hybrid strategy is not as stable as the Time strategy, yet it is more stable than the Space strategy. Most of the communicators in this class are represented with Delta encoding or Optimized Bitmaps using the Hybrid strategy. 45 0.001 0.01 0.1 1 10 100 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD BWT Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.5: Experiment class B, Sequential lookup times 5.2.2 Class B Experiments This class deals with ranges and strides. Each communicator is a stride. The algorithm chooses the beginning, the end, and the interval for each communicator randomly - the resulting size of each communicator is randomized in this series of experiments. Expectedly, the Range/Stride representation is the frontrunner in all the experiments in this class. For the purpose of evalu- ation, therefore, this module (i.e. Range/Stride representation) was deactivated in this class so that other more sophisticated modules can be observed in competition against each other. Fig- ure 5.5 and Figure 5.6 show the lookup times in class B experiments, and Figure 5.7 plots the consumed space against the lookup times. In this class, the BDD module shows the greatest variations, a factor that can be attributed to the random strides. BDD can store strides with specific intervals such as 2, 4, 8 etc. ef- fectively and with a constant space requirement. BDD, however, performs dismally over other interval sizes such as 5, 83, or 135. This poor performance can be attributed to the binary nature of BDDs; each digit in a binary representation of communicator rank is represented with one decision node in BDD. Certain strides are compatible with this binary pattern and these strides 46 0.001 0.01 0.1 1 10 100 1000 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD BWT Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.6: Experiment class B, Random lookup times produce smaller graphs, while for other strides it is not possible to find a pattern in the graph. In this latter case, BDD stores each member of the communicator as a separate path in the graph. The random nature of the intervals is the root cause of this fluctuating behavior. The results further show that BWT has some variations, variations demonstrated in Fig- ure 5.7. One cause of fluctuation in BWT is the size of the communicator. The use of cache proves to be more effective for the smaller communicators. As long as the size of the commu- nicator is smaller than that of the cache BWT lookup time becomes as fast as the Canonical lookup time. Thus it is the random size of the communicators that creates the fluctuation. Bitmaps show fluctuations in both time and space. The space required to store a communi- cator using Bitmaps is the function of the smallest and the largest ranks in the communicator. However, the beginning and the end of the stride the smallest and the largest rank in the com- municator are chosen randomly and produce varied space demands. Conversely, Bitmap lookup times are a function of density of the underlying map; randomized density of the maps - because of the random intervals in communicators likewise produce similar variations in lookup times. Lookup times for different strategies can be seen in Figure 5.8. Given the chaotic behavior of the modules in this class, both the Space and the Time strategies show turbulent behavior. 47  100  1000  10000  100000  1e+06  0.001  0.01  0.1  1  10  100 sp ac e (B yte s) Time per lookup (Microseconds) Class B, space required to represent the communicator (Iterative access times) BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical  100  1000  10000  100000  1e+06  0.001  0.01  0.1  1  10  100  1000 Sp ac e ( By tes ) Time per lookup (Microseconds) Class B, Space required to represent the communicator BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical Figure 5.7: Experiment class B, Space function based on lookup times. 48  0.001  0.01  0.1  1  10  100  1000  0  100  200  300  400  500  600  700  800  900  1000 Ti m e pe r l oo ku p (M icr os ec on ds ) Group ID Class B, Lookup times with different strategies Space Strategy Hybrid Strategy Time Strategy Figure 5.8: Experiment class B, Strategies lookup times. The Hybrid strategy, however, follows Delta encoding in most cases. Delta encoding is not significantly affected by the randomized beginning, ending and intervals defined in this class. As a result, the Hybrid strategy is more stable than the other two strategies. 5.2.3 Class C Experiments This class deals with communicators consisting of range of the ranges ranks. In this series of experiments, the number of sequential ranges is chosen randomly; each range is then chosen randomly. Union of all chosen ranges in a monotonically increasing order constitutes the final communicator. This particular ordering is chosen to ensure that WTR would not be invoked. 1 Figure 5.9 and Figure 5.10 show the lookup times in class C experiments. BDD fails to represent communicator maps in this class due to the large number of paths required to store each communicator. In Bitmaps small variations in gaps lead to small variations in lookup times. In this class, apart from BDD and Bitmaps, the framework may choose any of the other 1We avoid WTR in this class of experiments because a specific class, i.e. Class E Experiments, has been devised to evaluate this module. 49 0.001 0.01 0.1 1 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.9: Experiment class C, Sequential lookup times representations because they do not show a large variation. Figure 5.11 shows the space and time relation in class C experiments. The average size of each communicator in this class is 500,000. Due to the large size of the communicators, most representations consume a large amount of space. Bitmap is an exception; the space required to represent a communicator with Bitmap is a function of the size of the world. Since communicators’ maps are fairly dense in this class of experiments, Bitmap outperforms other modules in regard to space consumption. Finally, Figure 5.12 shows the lookup times based on different strategies. The stability of the modules in this class creates stable lookup times in the strategy layer. The Time strategy chooses Optimized Canonical representation in most of the cases. The Space strategy chooses Bitmap representation and the Hybrid strategy often uses Delta encoding. 5.2.4 Class D Experiments This class deals with random faces of a hypercube. This is a typical topology in parallel com- puting architecture. For this series of experiments, random faces of a hypercube, in all the 50 0.001 0.01 0.1 1 10 100 1000 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.10: Experiment class C, Random lookup times dimensions, are chosen and the union of all the faces in a monotonically increasing order con- stitutes the communicator. Mathematically, one can also produce such maps in a less intuitive fashion by using union of range of the ranges, and strides. Figure 5.13 shows the lookup times. As we can observe in the results, BDD fails to represent the communicators in this class because BDD graphs get too large. Bitmap encoding shows a small variation in lookup times. This variation arises from different densities of the chosen communicator maps. This effect, however, is relatively normalized since ranges and strides are mingled together. Such union does not exist in class C experiments and, therefore, Bitmap results vary more in that class. Figure 5.14 illustrates the space and time relation in class D experiments. Sizes of commu- nicators vary greatly in this class. This is the case because of the random number of faces that are chosen in each dimension. The only module that demonstrates a different behavior here is Bitmap. The size of the representation is a function of the smallest and the biggest rank in the communicator. The smallest rank in this class tends to be close to the smallest rank in the world. Similarly, the largest rank tends to be close to the largest rank in the world. In other words, the range of each communicator is that of the outer world. The size of Bitmap representation re- 51  100000  1e+06  1e+07  0.001  0.01  0.1  1 sp ac e (B yte s) Time per lookup (Microseconds) Class C, space required to represent the communicator (Iterative access times) BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical  100000  1e+06  1e+07  0.001  0.01  0.1  1  10  100  1000 Sp ac e ( By tes ) Time per lookup (Microseconds) Class C, Space required to represent the communicator BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical Figure 5.11: Experiment class C, Space function based on lookup times. 52  0.001  0.01  0.1  1  0  100  200  300  400  500  600  700  800  900  1000 Ti m e pe r l oo ku p (M icr os ec on ds ) Group ID Class C, Lookup times with different strategies Space Strategy Hybrid Strategy Time Strategy Figure 5.12: Experiment class C, Strategies lookup times. mains the same throughout different experiments in this class. Figure 5.12 shows the lookup times for different strategies. The Time strategy chooses Optimized Canonical representation most of the time. The Space strategy often chooses Bitmap because it outperforms other modules in terms of the space demand. The Hybrid strategy uses Delta encoding since it offers a good trade-off between time and space. 5.2.5 Class E Experiments This class deals with un-ordered ranges of ranges communicators. The departure from default ordering in this class allows us to evaluate WTR module. Similar to class C experiments, the number of sequential ranges is chosen randomly, as is the beginning and the end of each range. Furthermore, ranges are ordered in a monotonically decreasing order and their union constitutes the communicator map. As a result, the number of runs in each communicator map is equal to the number of ranges. Generally, WTR is a slow module compared to the other modules in the framework, as an additional incentive we choose a small number of ranges and let WTR compete with other modules. 53 0.001 0.01 0.1 1 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD Lo o ku p  t im e  ( M ic ro se co n d s) 0.001 0.01 0.1 1 10 100 1000 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.13: Experiment class D, Sequential (top) and sequential (bottom) lookup times 54  100000  1e+06  1e+07  0.001  0.01  0.1  1 sp ac e (B yte s) Time per lookup (Microseconds) Class D, space required to represent the communicator (Iterative access times) BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical  100000  1e+06  1e+07  0.001  0.01  0.1  1  10  100  1000 Sp ac e ( By tes ) Time per lookup (Microseconds) Class D, Space required to represent the communicator BDD BWT Bitmap Delta Encoding Bitmap (popcnt) Optimized Canonical Canonical Figure 5.14: Experiment class D, Space function based on lookup times. 55  0.001  0.01  0.1  1  10  100  1000  0  100  200  300  400  500  600  700  800  900  1000 Ti m e pe r l oo ku p (M icr os ec on ds ) Group ID Class D, Lookup times with different strategies Space Strategy Hybrid Strategy Time Strategy Figure 5.15: Experiment class D, Strategies lookup times. Figure 5.16 and Figure 5.17 depicts the lookup times in class E experiments. Analogous to class C, BDD fails to represent communicators in this class. Among other modules, Delta encoding, Bitmap, and Optimized Bitmap have close lookup times because they all use WTR. The reason these three representations perform similarly is the overhead of WTR that hides the overhead of the set representations themselves. Figure 5.18 shows the space and time relation in class E experiments. Most of the modules have their results clustered around one spot. The overhead of WTR is more visible in this graph. Since the extra time consumed by set representations is relatively negligible, Delta encoding, Bitmap and Optimized Bitmap offer close lookup times. Finally Figure 5.19 shows the lookup times for different strategies. The Time strategy uses Optimized Canonical representation. The Hybrid and the Space strategies perform close to each other in term of lookup times. This is due to the existence of WTR module. The Space strategy is insensitive to the differences between Bitmap and Optimized Bitmap and so it uses Bitmap representation. The Hybrid representation, however, responds to such differences in lookup times due to the existence of popcnt. The Hybrid strategy, therefore, gives Bitmap with popcnt a higher weight. 56 0.001 0.01 0.1 1 10 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.16: Experiment class E, Sequential lookup times 5.3 Discussion Different representations offer different time-space trade-offs. Canonical representation, the fastest representation, consumes a generous allotment of space. Optimized Canonical represen- tation is few nanoseconds slower than Canonical representation but tends to save a considerable amount of space. Delta Encoding is slower but often saves more space than both. Bitmap is a suitable tool to store dense maps; the needed space and the lookup time are tied to the density of the map. BDDs can take advantage of binary patterns in the map. In cases where such patterns exist, BDD stores the map implicitly in a small amount of space. BDDs, however, are inefficient in the absence of binary patterns. BWTs are useful in storing maps with a high entropy such as randomly permuted maps. The downside, however, is the higher lookup time. Different strategies tend to stabilize the framework by choosing different representations. The Time strategy tends to pick Optimized Canonical representation regardless of factors such as the map entropy and is, as such, insensitive to the map characteristics. Conversely, the Space strategy is indifferent to the lookup times; it optimizes the framework according to space de- mands. This strategy tends to choose Bitmap, Delta Encoding and, in some cases, BDD repre- 57 0.001 0.01 0.1 1 10 100 1000 Canonical Optimized Canonical Delta Encoding Bitmap Optimized Bitmap BDD Lo o ku p  t im e  ( M ic ro se co n d s) Figure 5.17: Experiment class E, Random lookup times sentation. The Hybrid strategy tries to optimize both space and time. As such it tends to choose Delta Encoding, Optimized Canonical and, in some cases, Bitmap representation. ∗∗∗ The overhead of performing SELECT operation in the proposed framework is rather close to its counterpart (i.e. the array-based implementation) with respect to the time. Furthermore inclusion of strategies available to the framework makes is more stable in dealing with maps with different characteristics. Chapter 6 further evaluates the performance of the framework inside MPI middleware in the context of running MPI applications. 58  100000  1e+06  1e+07  0.001  0.01  0.1  1  10 sp ac e (B yte s) Time per lookup (Microseconds) Class E, space required to represent the communicator (Iterative access times) BDD + WTR BWT Bitmap + WTR Delta Encoding + WTR Bitmap (popcnt) + WTR Optimized Canonical Canonical  100000  1e+06  1e+07  0.001  0.01  0.1  1  10  100  1000 Sp ac e ( By tes ) Time per lookup (Microseconds) Class E, Space required to represent the communicator BDD + WTR BWT Bitmap + WTR Delta Encoding + WTR Bitmap (popcnt) + WTR Optimized Canonical Canonical Figure 5.18: Experiment class E, Space function based on lookup times. 59  0.001  0.01  0.1  1  10  0  100  200  300  400  500  600  700  800  900  1000 Ti m e pe r l oo ku p (M icr os ec on ds ) Group ID Class E, Lookup times with different strategies Space Strategy Hybrid Strategy Time Strategy Figure 5.19: Experiment class E, Strategies lookup times. 60 Chapter 6 Experiments in FG-MPI Large, complicated, and memory-sensitive applications such as MPI middleware tend to be susceptible to the slightest modifications in their internal components. One of the important outcomes of running experiments inside MPI middleware is the calculation of additional costs associated with the framework. This chapter outlines the results of investigations into the over- head of different size communicators and of messaging among their members inside a MPI environment. These experiments were done after an MPI benchmark was designed to evalu- ate the effects of different sizes of communicators and monitor the messaging time among the communicator members by using a variety of parameters including those of the map. A number of experiments were then conducted to evaluate the effects of these parameters on the required time for creating a communicator, for point-to-point communication, and for collective commu- nication - three parameters that represent the base of all communication operations in MPI. The first operation measured the required time to create a communicator of an indicated size, the second operation measured the time required to send 10,000 point-to-point messages between randomly chosen senders and receivers (and the time needed for synchronization at the end of messaging), and the third measured the time required to perform ten collective gather opera- tions (i.e. MPI Gather()) by rank 0 in the communicator. The evaluations of the experiments discussed herein demonstrate that the framework can be successfully integrated into MPI mid- dleware without undermining its performance. Moreover, there are cases where the saved space in this implementation actually renders the middleware faster than its current implementation. 6.1 Test Environment All the experiments were conducted using FG-MPI on a cluster consisting of five 2x Quad-core Xeon x5550 2.67GHz Intel processors workstations -i.e. 8 cores per CPU. These workstations have 12 GB memory and 10 Gb dual-port Ethernet high-speed network. The high-speed network is connected by 1 Gb Ethernet. All the machines run 64 bit CentOS 5.5. In these experiments 61 the MPI WORLD COMM size is 200,000 proclets.1 The proclets are evenly distributed over 40 OS processes, with 5,000 proclets in each process and 5 processes in each of the eight machines. The uniform distribution of the proclets on all machines simulates the worst-case scenarios for the framework. Uniform distribution of the processes ensures that one machine does not become a bottleneck for the whole system. Non-uniform distributions of proclets, on the other hand, create an extra overhead in the machine with the highest number of MPI processes. Although this creates a middleware bottleneck, slowing down the rest of the system, it does not directly affect the framework. Hence the constant overhead of the framework is relatively lower on non-uniformly distributed systems. 6.2 Experiments Class ID Communicator Discretion Communicator Size I 1 All even numbers between and excluding -1 and world size 100,000 2 All y such that y = 81x+11453, x > 0 and y <= size(world) 2,327 3 All items in a six dimensional cube falling in [0,*,*,0,*,*] 10,000 4 Alternative ranges of size 100, between and including 0 to size(world)−1 100,000 II 5 All the multiples of 3 between and excluding -1 and size(world) 66,667 6 All the multiples of 5 between and excluding -1 and size(world) 40,001 7 All the multiples of 7 between and excluding -1 and size(world) 28,572 III 8 Ordered 5,000 random ranks 5,000 9 Ordered 50,000 random ranks 50,000 IV 10 10,000 randomly permuted ranks 10,000 11 20,000 randomly permuted ranks 20,000 12 100,000 randomly permuted ranks 100,000 V 13 10 sequential ranges each of size 1,000 with inside items 10,000 permuted randomly 14 10 sequential ranges each of size 10,000 with inside items 100,000 permuted randomly Table 6.1: Description of the communicators used in applications. This chapter deals with frequently used communication structures in MPI applications. We have defined 5 classes and conducted 14 sets of experiments in them (a detailed description of the classes can be found in Table 6.1). The general description of classes are: 1The available version of FG-MPI has certain limitations when running experiments on a world of size one million. In this Chapter, therefore, we have had to bring down the size of the world in order to be able to run our experiments. 62 • Class I: Represents frequently used architectural patterns such as Mesh, Hypercubes etc. • Class II: Represents frequently used numerical patterns such as the one that rely on di- viding the world into equally distanced slave nodes. • Class III: Represents random sets, commonly occurring phenomena in most of the nu- merical and physical simulations. • Class IV: Represents random sets with permutation; this is a super-set of the previous class and is frequently occurring in particle physic simulations. • Class V: Represents randomly permuted ranges, where a cluster of nodes tackles a prob- lem when order is not an issue inside the cluster. Experiments were run inside FG-MPI compiled with the proposed framework and all the representations available within the framework. Section 6.3, 6.4 and 6.5 show the necessary time to create communicators, for point-to-point communication and for collective gather oper- ation respectively. 6.3 Required Time to Create Communicator This section reports the amount of time required to create communicators - Figure 6.1 and Ta- ble 6.2 show these results. The first bar (i.e. Canonical representation bar) in each set represents the base case (i.e. the current implementation). WTR cannot represent experiment sets 12 and 14 due to the large number of runs. Therefore, BDD, BWT, and Delta encoding are not appli- cable in these experiments. 2 As Figure 6.1 shows, the framework’s overhead for this operation is negligible in most cases as it is only a fraction of a second. In some cases the framework can improve performance. In most cases, for instance, Optimized Canonical representation tends to improve the middleware’s performance. There are also cases in which Delta encoding improves the performance. BWT and Bitmap, however, generally affect the performance adversely. 2In such cases the framework returns a failure value to the FG-MPI. FG-MPI in turn terminates the MPI applica- tion. It must be noted that this situation happens only because the framework is forced to pick all the representations for the purpose of the experiment. In a normal situation, however, the framework chooses a different representation when one representation fails. 63 1.5 2 2.5 3 3.5 4 t i m e   ( S e c o n d s ) Canonical rep Opt‐Cano BWT Delta Bitmap 0 0.5 1 1 2 3 4 5 6 7 t i m e   ( S e c o n d s ) Experiment ID BDD 3 4 5 6 7 8 t i m e   ( S e c o n d s ) Canonical rep Opt‐Cano BWT Delta Bitmap 0 1 2 8 9 10 11 12 13 14 t i m e   ( S e c o n d s ) Experiment ID BDD Figure 6.1: Required Time to Create Communicator 64 Exp ID Canonical rep Opt-Cano BWT Delta Bitmap BDD 1 2.967774 3.312498 3.203573 3.271372 3.184117 3.060398 2 2.010797 2.160521 1.987421 2.065695 2.201394 2.032641 3 2.316994 2.077151 2.263582 2.068744 2.274107 2.107993 4 3.561393 3.486884 3.608238 3.470796 3.399657 3.382914 5 2.668777 2.908778 2.937485 2.812214 2.628685 2.795777 6 2.681832 2.566798 2.901365 2.718617 2.764923 2.799075 7 2.280735 2.394095 2.237612 2.379437 2.255037 2.663567 8 2.017807 1.891634 1.970061 1.920574 2.08602 1.93289 9 2.5714 2.655579 2.764294 2.530936 2.878847 2.86866 10 2.600998 2.53709 3.013276 2.413477 2.565226 2.743572 11 2.730365 2.822009 3.112251 3.273176 3.270894 3.047612 12 6.866533 7.207245 6.986381 ∞ ∞ ∞ 13 2.529555 2.520782 2.674146 2.802194 2.392572 2.420681 14 6.748847 6.53138 6.905702 ∞ ∞ ∞ Table 6.2: Required Time to Create Communicator 6.4 Required Time for Point-to-Point Communication This Section reports the amount of time required for point-to-point communication. Point-to- point communication is the most important and most commonly used method of communication in MPI. Figure 6.2 and Table 6.3 show the results for this operation. Similar to Section 6.3, the first bar in each set is the base implementation and one can measure the framework’s overhead by comparing this bar to the rest of the bars in that set. The framework’s overhead, in most cases, is negligible; BWT represents an exception. This module behaves poorly due to the randomization of the senders and receivers. Indeed, BWT is an inherently expensive algorithm and it can compete with other modules in the framework only with the help of the cache. Even with the cache, random lookups increase the cache-miss making BWT slow. Class IV and V are also significantly slower compared to the rest of the classes. This slow- ness is not a result of the framework, rather it is caused by the middleware members being permuted. Re-mapping the world into a new world where communicating processes are closer can reduce this overhead. Such an approach is a good example of how using communicators on top of numerical libraries can optimize the performance of the cluster. As a potential future ex- tension to this work, one may pursue optimizing MPI libraries by re-mapping MPI COMM WORLD into a world with a higher probability of co-located processes that communicate with each other. 65 11.5 2 2.5 t i m e   ( S e c o n d s ) Canonical rep Opt‐Cano BWT Delta Bitmap 0 0.5 1 2 3 4 5 6 7 t i m e   ( S e c o n d s ) Experiment ID BDD 10 15 20 25 t i m e   ( S e c o n d s ) Canonical rep Opt‐Cano BWT Delta Bitmap 0 5 8 9 10 11 12 13 14 t i m e   ( S e c o n d s ) Experiment ID BDD Figure 6.2: Required Time for Point-to-Point Communication 66 Exp ID Canonical rep Opt-Cano BWT Delta Bitmap BDD 1 1.750144 2.16052 1.811021 1.900152 2.129779 1.861059 2 0.431204 0.451109 0.512796 0.335086 0.46859 0.273584 3 1.282751 1.282592 0.77538 1.270713 0.691343 0.677966 4 2.176636 1.931769 1.795922 2.105466 2.173837 1.718421 5 1.653736 2.019279 1.494842 1.632706 1.532018 1.724166 6 0.749722 0.781928 0.887244 0.800275 0.818214 0.840635 7 0.872786 1.049493 0.936803 1.144245 0.879126 0.648154 8 0.591226 0.742509 1.329447 0.455741 0.569415 0.570307 9 1.674722 1.089125 1.343414 1.514031 0.902362 0.941129 10 12.074792 11.550763 18.8246 12.080878 12.053472 12.366941 11 16.623487 15.589004 22.080314 16.033108 15.941364 16.221026 12 17.102782 20.505991 22.170412 ∞ ∞ ∞ 13 1.660798 1.725999 3.663296 1.876028 1.740192 1.836309 14 11.987147 12.162322 19.32781 ∞ ∞ ∞ Table 6.3: Required Time for Point-to-Point Communication 6.5 Required Time for Collective Gather Operation This section reports on the amount of time required for collective gather operation. This oper- ation gathers values from every member of a communicator. Figure 6.3 and Table 6.4 illustrate the results for this operation. In contrast to the results in the previous Section, most represen- tations perform close to each other in this series of experiments; the framework’s overhead is negligible compared to the rest of the system. The iterative nature of collective communication increases the cache effect in BWT. This renders BWT as good a choice as other representations are. There is, however, a bottleneck in the experiment set 8. One of the reasons the cache fails to help in this case is the rather small size of the communicator (i.e. 5,000). This size is larger than the BWT buffer size so the algorithm will be invoked, but the size of the communicator is not large enough, and so the overhead of the algorithm is accentuated. Apart from this exceptional case, BWT does not demonstrate a large overhead. As Figure 6.3 shows, most of the representations tend to outperform the current implementation both in time demands and in space consumptions. This is particularly evident in the Optimized Canonical representation. For this representation as well as others, the results mean that integrating the framework into MPI potentially improves the performance of the middleware. 67 68 10 12 14 16 18 20 t i m e   ( S e c o n d s ) Canonical rep Opt‐Cano BWT Delta Bitmap 0 2 4 1 2 3 4 5 6 7 t i m e   ( S e c o n d s ) Experiment ID BDD 10 15 20 25 t i m e   ( S e c o n d s ) Canonical rep Opt‐Cano BWT Delta Bitmap 0 5 8 9 10 11 12 13 14 t i m e   ( S e c o n d s ) Experiment ID BDD Figure 6.3: Required Time for Collective Gather Operation 68 Exp ID Canonical rep Opt-Cano BWT Delta Bitmap BDD 1 17.702455 16.711845 15.857622 18.118459 17.805377 17.95578 2 0.132202 0.123322 0.142819 0.208211 0.132237 0.146981 3 0.712352 0.696635 0.658908 0.432181 0.674342 0.697357 4 17.411705 16.18811 17.278832 17.311935 18.241667 14.973475 5 10.966374 2.176044 8.851018 11.695278 9.699308 9.499338 6 5.523831 2.381484 5.367255 5.321168 6.246094 6.989264 7 3.587011 3.24371 1.498214 2.037366 3.734757 1.898694 8 0.523414 0.221634 2.873808 0.422693 0.339804 0.297434 9 8.403159 9.935812 8.30859 10.482459 6.430704 3.789709 10 2.018914 2.208728 3.675661 2.351119 2.08984 2.15712 11 0.760034 0.634392 0.864605 0.767399 0.871466 0.952565 12 21.075628 20.77579 21.317054 ∞ ∞ ∞ 13 1.225378 1.188386 1.214305 1.294498 1.344456 1.276289 14 14.908722 17.962524 16.584229 ∞ ∞ ∞ Table 6.4: Required Time for Collective Gather Operation 6.6 Saving Space The primary objective of designing our framework has been to save space in MPI applications. Saving space, however, may lead to extra time consumption. This chapter demonstrated that the extra time consumptions associated with the framework are negligible. To provide a compre- hensive account, we have listed the amount of space saved by using the framework. Numbers are measured for the Hybrid strategy as it is the default strategy in the framework. Each line shows the total saving in space across all the processes in creation of 8 communicators. • Experiment 1: 121.295 MBytes • Experiment 2: 2.118 MBytes • Experiment 3: 7.159 MBytes • Experiment 4: 113.207 MBytes • Experiment 5: 72.518 MBytes • Experiment 6: 39.966 MBytes • Experiment 7: 26.015 MBytes • Experiment 8: 4.756 MBytes 69 • Experiment 9: 55.300 MBytes • Experiment 10: 5.328 MBytes • Experiment 11: 10.668 MBytes • Experiment 12: 53.393 MBytes • Experiment 13: 5.328 MBytes • Experiment 14: 53.393 MBytes ∗∗∗ This Framework saves a large amount of space (in some cases over 100 MBytes), on a world of size 200,000. This saving in the space comes at a negligible cost in term of time required to perform the operation. Inclusion of the framework thus not only turns the middleware more efficient in term of space, it also has the tendency to make it a faster piece of software. This thesis has provided an outline of the success of the framework in its current state. The final chapter will provide some potential ways it might be expanded and improved in the future. 70 Chapter 7 Conclusion Message Passing Interface (MPI), the dominant paradigm of programming in high performance computing, is widely used on computer clusters. Whereas recent advances in hardware technol- ogy have increased the ratio of nodes per cluster, insufficient optimization of MPI software has encouraged researchers to consider factors that might be successfully exploited to allow efficient scalability in high-performance computing. MPI communicators, the primary communication tools in MPI, inhibit MPI scalability. Explicit representation of communicators in the current implementations of MPI does not scale: the space requirements explicit representation grows rapidly in large clusters. This thesis offers a framework to represent communicators and spare MPI the costs of explicit representation. The proposed framework features an open representa- tion portfolio to represent communicators each representation stores communicators succinctly while supporting fast iterative and random lookup operations and guarantees new representation techniques can be integrated and that it can support a wide range of applications. This framework decomposes each communicator into a set and a permutation, represents and compresses each index (i.e. set and permutation) separately, and allows room for further compression. A variety of set-like group operations can be efficiently supported with set repre- sentations. Alternative representations that do not decompose communicators are also available. The types of group and communicator management operations available in MPI give a higher weight to certain types of communicators, among these are maps that can be represented with Range/Stride, Binary Decision Diagram (BDD), Gap Encoding, and Bitmap. We have also in- cluded a succinct permutation module in our framework. To store the permutations, we use Wavelet Tree on Runs (WTR) structure. Finally, general compression techniques such as Bur- rows Wheeler Transform (BWT) based compressed arrays and Optimized Arrays are employed in the framework. Other representations can conveniently be added to the framework. Different MPI users have different demands with respect to time and space consumption, they may therefore choose between three strategies to suit their needs: the Time, the Space, and the Hybrid strategies. While the Time strategy tunes the framework to optimize lookup 71 times, the Space strategy strives to achieve the lowest space consumption possible and the Hy- brid strategy finds trade-offs between time and space. The majority of the representations in our framework use an auxiliary structure to offer faster lookup times. BDD uses an auxiliary structure to find the first set bit of the targeted rank; Bitmap uses prefix sum array and performs a binary search on it to narrow down the search space; and BWT caches the last decompressed block. Using these auxiliary structures, the representations perform only one or two orders of magnitude below array implementation of communicators. Post-development, the framework was tested inside and outside MPI. The experiments out- side MPI demonstrate the framework can compress a wide range of communicators and support fast lookup operations. They revealed Bitmap to be a fast module when the communicator was dense, Optimized to be a good option for sparse communicators, and BDDs to be slow structures in performing lookup operations (unless the size of the underlying BDD is small). Gap encoding demonstrated a good trade-off between time and space without large variations for communica- tors of different sizes. The density of the communicator was the only influential factor in Gap encoding lookup times; communicators with larger gaps take more space per members and as such, they have a higher rate of cache miss. Testing inside the framework in FG-MI proves not only that the framework is capable of being integrated into MPI middleware, but that the time overhead is often negligible, and in some applications, faster. The approach outline herein also could be used to extract more complex behavioral patterns in representation techniques. In its current implementation, the framework tries all available modules to choose the best representation. This operation is costly, especially for representation algorithms such as BDD and BWT. To compensate, an estimation function can be added to each module which returns an estimation of the space required to represent a communicator and the amount of time it takes to perform lookup operation on it. This information, in turn, helps the framework to choose the proper representation more effectively. The future inclusion of hierarchical representation of communicators might likewise en- hance the framework. Each communicator is stored with respect to (MPI COMM WORLD), the communicator that stores all processes in the system. However, in theory, storing a communi- cator as a subset of its parent, rather than a subset of MPI COMM WORLD, requires less space the parent tends to be smaller than MPI COMM WORLD and the smaller the superset, the smaller the amount of space required to represent the subset and doing so would represent communicators succinctly. Another future improvement to this work is addition of an extension to MPI that re-maps MPI COMM WORLD. Herein, MPI COMM WORLD is re-mapped to a communicator where 72 there is a higher rate of communication among closer processes physically. Communication among distributed processes is more expensive than communication processes inside the same node. Re-mapping the MPI COMM WORLD may increase the rate of communication among closer processes. Although these, and other additions discussed above, indicate the benefits of future additions and revisions, the proposed framework offers a compressed representation of commu- nicators and is poised to optimize MPI libraries for interested users. 73 Bibliography [1] George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. Optimization of MPI collective communication on bluegene/l systems. In ICS ’05: Proceedings of the 19th annual in- ternational conference on Supercomputing, pages 253–262, New York, NY, USA, 2005. ACM. [2] Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev Thakur, and Jesper Larsson Träff. Mpi on a million processors. In Pro- ceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 20–30, Berlin, Heidelberg, 2009. Springer-Verlag. [3] Jérémy Barbay and Gonzalo Navarro. Compressed representations of permutations, and applications. In Susanne Albers and Jean-Yves Marion, editors, 26th Intl. Symp. on Theo- retical Aspects of Computer Science (STACS), pages 111–122, Dagstuhl, Germany, 2009. [4] Preston Briggs and Linda Torczon. An efficient representation for sparse sets. ACM Lett. Program. Lang. Syst., 2:59–69, March 1993. [5] Randal E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35:677–691, 1986. [6] Darius Buntinas, Guillaume Mercier, and William Gropp. Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem. Parallel Comput., 33(9):634–644, 2007. [7] Mohamad Chaarawi and Edgar Gabriel. Evaluating sparse data storage techniques for mpi groups and communicators. In ICCS ’08: Proceedings of the 8th international conference 74 on Computational Science, Part I, pages 297–306, Berlin, Heidelberg, 2008. Springer- Verlag. [8] François Deliège and Torben Bach Pedersen. Position list word aligned hybrid: optimizing space and performance for compressed bitmaps. In EDBT ’10: Proceedings of the 13th International Conference on Extending Database Technology, pages 228–239, New York, NY, USA, 2010. ACM. [9] Vaibhav Deshpande, William Sawyer, and David W. Walker. An MPI implementation of the blacs. MPI Developers Conference, 0:0195, 1996. [10] Message Passing Interface forum. MPI: A Message-Passing Interface Standard, Version 2.1. High-erformance Computing Center Stuttgart, University of Stuttgart, series = Mono- graphs in Theoretical Computer Science. An EATCS Series,, 2008. [11] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with o(1) worst case access time. J. ACM, 31:538–544, June 1984. [12] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI (2nd ed.): Portable parallel programming with the message-passing interface. MIT Press, Cambridge, MA, USA, 1999. [13] Wing-Kai Hon, Tak Wah Lam, Kunihiko Sadakane, and Wing-Kin Sung. Constructing compressed suffix arrays with large alphabets. In ISAAC, pages 240–249, 2003. [14] Wing-Kai Hon, Tak Wah Lam, Kunihiko Sadakane, Wing-Kin Sung, and Siu-Ming Yiu. A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica, 48(1):23–36, 2007. [15] Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. Breaking a time-and-space bar- rier in constructing full-text indices. SIAM J. Comput., 38(6):2162–2178, 2009. [16] HP. Hp-mpi. Available from https://h20392.www2.hp.com/portal/swdepot/ displayProductInfo.do?productNumber=MPISW. [17] T. C. Hu and A. C. Tucker. Optimal computer search trees and variable-length alphabetical codes. SIAM Journal on Applied Mathematics, 21(4):514–532, 1971. 75 [18] IBM. IBM-MPI. Available from http://www-03.ibm.com/systems/p/. [19] Intel. Available from http://techresearch.intel.com/. [20] Humaira Kamal, Seyed M. Mirtaheri, and Alan Wagner. Scalability of communicators and groups in MPI. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pages 264–275, New York, NY, USA, 2010. ACM. [21] Argonne National Laboratory. MPI applications. Available from http://www.mcs.anl. gov/research/projects/mpi/. [22] Argonne National Laboratory. MPI committee. Available from http://www.mcs.anl. gov/research/projects/mpi/whodidmpitab.html. [23] Argonne National Laboratory. Pspases. [24] Tau Leng, Rizwan Ali, Jenwei Hsieh, Victor Mashayekhi, and Reza Rooholamini. Per- formance impact of process mapping on small-scale smp clusters - a case study using high performance linpack. Parallel and Distributed Processing Symposium, International, 2:0236b, 2002. [25] Lind-Nielsen. BuDDy - A Binary Decision Diagram Package, http://vlsicad.eecs. umich.edu/BK/Slots/cache/www.itu.dk/research/buddy/index.html. [26] Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable imple- mentation of the MPI message passing interface standard. Parallel Computing, 22:789– 828, 1996. [27] Giovanni Manzini. The burrows-wheeler transform: Theory and practice. In Lecture Notes in Computer Science, pages 34–47. Springer, 1999. [28] Open MPI. Open Source High Performance Computing, Available from http://www. open-mpi.org/. [29] MPICH2. Available from http://www.mcs.anl.gov/research/projects/mpich2/. [30] M. E. J. Newman. Random graphs as models of networks. Working Papers 02-02-005, Santa Fe Institute, February 2002. 76 [31] Daisuke Okanohara and Kunihiko Sadakane. A linear-time burrows-wheeler transform using induced sorting. In SPIRE, pages 90–101, 2009. [32] Tsutomu Sasao. On the numbers of variables to represent sparse logic functions. In Pro- ceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’08, pages 45–51, Piscataway, NJ, USA, 2008. IEEE Press. [33] Julian Seward. bzip2 and libbzip2, version 1.0.5 a program and library for data compres- sion. Available from http://www.bzip.org/. [34] SGI. SGI’s MPI. Available from http://www.sgi.com/tech/evaluation.html? /evaluation.html. [35] A Block sorting Lossless, Michael Burrows, M. Burrows, David Wheeler, and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digi- tal SRC Research Report, 1994. [36] Michal Stabno and Robert Wrembel. RLH: Bitmap compression technique based on run- length and huffman encoding. Info. Sys., 34(4-5):400 – 414, 2009. [37] Robert Endre Tarjan and Andrew Chi-Chih Yao. Storing a sparse table. Commun. ACM, 22:606–611, November 1979. [38] TOP500. Top 500 supercomputing sites. Available from http://www.top500.org/. [39] Jesper Larsson Träff. Implementing the MPI process topology mechanism. In Supercom- puting ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1–14, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. 77


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items