UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

MPI collective operations over Myrinet Zhang, Qianfeng 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2002-0617.pdf [ 3.1MB ]
Metadata
JSON: 831-1.0051384.json
JSON-LD: 831-1.0051384-ld.json
RDF/XML (Pretty): 831-1.0051384-rdf.xml
RDF/JSON: 831-1.0051384-rdf.json
Turtle: 831-1.0051384-turtle.txt
N-Triples: 831-1.0051384-rdf-ntriples.txt
Original Record: 831-1.0051384-source.json
Full Text
831-1.0051384-fulltext.txt
Citation
831-1.0051384.ris

Full Text

M P I Collective Operations over Myrinet by Qianfeng Zhang M.E., Huazhong University of Science and Technology, China, 2000  A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF THE REQUIREMENTS FOR T H E D E G R E E OF M a s t e r of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science)  we accept this thesis as conforming to the required standard  The University of British Columbia June 2002 © Qianfeng Zhang, 2002  In  presenting  degree freely  at  this  the  available  copying  of  department publication  of  in  partial  fulfilment  of  the  University  of  British  Columbia,  I  agree  for  this or  thesis  reference  thesis by  this  for  his  and  scholarly  or  thesis  study.  for  her  I  further  purposes  financial  gain  permission.  Department  of  |  T h e U n i v e r s i t y o f British Vancouver, Canada  Date  DE-6  (2/88)  Columbia  Jipn^-A?, >*>"^  shall  that  agree  may  representatives.  requirements  It not  be is  that  the  Library  permission  granted  by  understood be  for  allowed  an  advanced  shall for  the that  without  head  make  it  extensive of  my  copying  or  my  written  A b s t r a c t  Collective communication is an important subset of Message Passing Interface. Improving the performance of collective communication can greatly contribute to the performance of MPI applications. MPI-NP II is an MPI specific messaging system for PC clusters. By integrating collective communication support into the communication layer of MPI-NP II, we extended the system to MPI-NP II+. MPI-NP 11+ first added those functions that are essential for supporting a complete MPI application which were not provided by MPI-NP II. These functions include Any .Source message matching, multiple local processes and multiple communicators. These functions are efficiently designed and implemented so that they are not overly costly to the performance of message passing. For sending a point-topoint message, MPI-NP 11+ still has a minimum message latency of 42 microseconds and a maximum end-to-end bandwidth of 89MB/s, comparable to the performance reached by the incomplete MPI-NP II implementation. Three collective communication operations are added. By using NIC level message forwarding, the performance benefits for these operations are obvious. On a system of 8 nodes, the NIC-based MPLBarrier is 4 times better than a host-based implementation over the same Myrinet, and for MPLComm.Create, the improvement factor is 2. The NIC-based MPI_Bcast is always better than a host-based implementation for all message sizes, and for a small message, the improvement factor of the broadcast latency is 2 to 3 times. Moreover, for all the three operations, the NIC-based implementation scales better than host-based implementation. MPI-NP 11+ extended the concept of the microchannel. By using a special microchannel, we were able to support three collective communication operations while preserving the semantics of the MPI specific communication layer.  n  Contents  Abstract  ii  Contents  iii  List of Tables  vi  List of Figures  vii  Acknowledgements  viii  1  2  Introduction  1  1.1  Motivation  1  1.2  Methodology  4  1.3  Synopsis  5  Background  6  2.1  Message Passing Interface  6  2.2  LAM  8  2.3  Myrinet .  10  2.4  MPI-NP II  12  2.5  Related Work  17  iii  3  NIC-Based Barrier over Myrinet/GM  17  2.5.2  F M / M C Multicast Protocol  19  2.5.3  L F C Multicast Protocol  20  2.5.4  Other Work on Collective Communication  22  Design  24  3.1  Extending MPI-NP II  24  3.2  Any .Source Matching  27  3.3  NIC-based Collective Communication  31  3.3.1  Host-based and NIC-based  31  3.3.2  Data Structure  32  3.3.3  MPLBarrier  35  3.3.4  Multiple Communicators  40  3.3.5  MPLBcast  42  3.3.6  Concurrent Operations and Deadlock  46  3.4  4  2.5.1  Multiple Local Processes  47  3.4.1  Process Synchronization  47  3.4.2  Loopback Communication  50  Evaluation  52 t  4.1  Bandwidth and End-to-End Latency of Unicast  53  4.2  Performance with Any .Source Matching  55  4.3  Performance of MPLBarrier and MPI.Comm.Create  57  4.3.1  MPLBarrier  57  4.3.2  MPI.Comm.Create  58  4.4  Performance with MPI_Bcast()  59  iv  4.5  4.4.1  Broadcast Latency  59  4.4.2  Host Overhead  62  Performance of Loopback Communication  5 Conclusions and Future Work  64 66  5.1  Conclusions  66  5.2  Future Work  67  Bibliography  70  v  List  of  Tables  4.1  End-to-End Performance of Unicast  54  4.2  Average Host Overhead of MPLBarrier  57  4.3  Average Host Overhead of MPI_Comm_Create  59  4.4  Broadcast Latency of MPLBcast with 8 Nodes  61  4.5  Host Overhead at Root Node of MPLBcast With a 7 Nodes System  63  vi  List  of  Figures  2.1  MPI-NP II Architecture  13  2.2  Message Progress in a Channel  14  3.1  The Organization of Messages and Requests  29  3.2  Host-based and NIC-based Message Forwarding  32  3.3  Gather and Broadcast in a Binary Structure  36  3.4  P / V Lock Method using Shared Variables  48  4.1  Communication and Synchronization Between Process 0 and n — 1 .  4.2  Improvement Factor of MPI_Barrier with Different Number of Nodes  4.3  Improvement Factor of MPI_Comm_Create with Different Number of  4.4  56 58  Nodes  60  Improvement Factor of MPLBcast with Different Number of Nodes  62  vii  A c k n o w l e d g e m e n t s  I would like to thank my supervisor, Dr. Alan Wagner, for his persistent supervision and guidance, without which it is impossible to finish this work. I would like to thank Chamath Indika Keppitiyagama. His former work formed the basis of my research. The discussion with him helped me to understand the problems better. I would like to thank Dr. Norm Hutchinson, for his time and the valuable comments. I would like to thank Miss Kirsty Barclay for her time to proofread the writing. I am also grateful to those DSG guys who had ever helped me in solving the technical problems or helped me in using the writing tools.  QIANFENG ZHANG  The University of British Columbia June 2002  viii  Chapter 1  I n t r o d u c t i o n  1.1  Motivation  In recent years, a focus of parallel computing has been on cluster computing. A cluster, or network of workstations, is a collection of personal computers connected by a local network. Due to their high performance/cost ratio, clusters have emerged as a powerful alternative to massively parallel machines. Message passing is the natural choice of a programming paradigm for most applications that are developed on clusters. As a result, improving the performance of message passing on these machines is a critical step in accelerating the use of clusters for parallel computing. Network technology has advanced to the point where there is now a variety of network technologies which can provide a bandwidth of 1 Gigabit per second or higher [8, 24]. In addition, the appearance of programmable network processors, such as Myrinet [24], provides an opportunity to further support message passing with its on-board processor and memory. By mapping the network interface card directly into user space, the operating system is removed from the critical path of message transfer, and the traditional T C P / I P protocol stack is bypassed by simpler 1  communication protocols which are more adaptable to system area networks. The network processor can perform part of the work of the underlying communication protocol. All these advances can lead to dramatic improvements in the performance of message passing applications. The challenge is to find the best way to deliver the ability of high-speed physical links and programmable network processors to the MPI applications. Many MPI libraries were developed using the low level Myrinet interface [23, 33, 13]. MPI-NP II [20, 19] is one of those systems that implemented the MPI library over Myrinet [24]. It differs from other MPI systems in that MPI-NP II uses an MPIaware Myrinet interface - the MPI envelope is recognized by the network interface level code, and the messages are matched on the network interface. A n MPI-aware low level interface has several advantages over a general-purpose communication layer in reaching the goals stated by the MPI standard of zero copy, offloading communication to a communication coprocessor, and allowing for the overlapping of computation and communication [19]. MPI-NP II has made a successful step toward realizing these goals as it provides non-blocking communication as asynchronous communication, and it can deliver large size messages without any intermediate copy (i.e., zero copy). Moreover, by using the concept of the channel and microchannel [19], MPI-NP II greatly simplified the design of its NIC-level, which tends to be more complicated if it is MPI-aware. However, there remains other opportunities to improve the performance of MPI applications. MPI libraries include a large set of collective operations. These operations usually involve dialogs between the MPI nodes. Considering the context of a network processor, it is obvious that the communication time between two Network Interface Cards (NIC) is less than that between two host processes. Therefore,  2  a basic question to answer is, is it beneficial to offload portions of the collective communication to the network processor? This question is more attractive when one observes that the physical capabilities of the network interface are growing very quickly. In particular, many of the MPI operations are not compute-intensive, so it is quite possible that offloading some of these operations into the NIC can improve the performance of these operations, while at the same time not unduly overloading the network processor. A few projects have been done to implement collective operations on the NIC-level [34, 3, 1]. Obvious performance gains are reported by these systems. However, a feature of these systems is that the communication layer on the NIC is still not MPI-aware. Compared to point-to-point message passing, collective operations are talkative, and the different collective operations have different protocols. Therefore, to execute a collective operation in the NIC-level, the NIC has to be completely aware of the operation. With a general-purpose communication layer, having this  awareness  usually leads to more overhead on the NIC. We believe that  using the MPI-aware communication layer of MPI-NP II, and further making these collective operations MPI-specific, we can implement them in the NIC-level with less overhead. Another common feature of those systems is that only one collective operation is supported by each system, they did not consider supporting multiple collective operations in an integrated way. In this thesis, we evaluate the design and implementation of MPI-aware collective communication. We wish to evaluate which collective operations can benefit most and to experimentally measure the potential performance gains.  3  1.2  Methodology  This work extends MPI-NP II and, with respect to MPI-NP II, had the following goals: • Completely consistent with MPI-NP II, the system is specific to MPI, although the technologies can be used by other message passing systems. • Our new system completely preserves the advantages of MPI-NP II. These are NIC-level message matching to promote the overlap between computation and communication and the use of channels to simplify flow control. • A focus is on those collective operations that can benefit the most from being implemented on the NIC. The contributions of this thesis can be divided into two parts. The functions Any .Source matching, multiple local processes support and multiple communicators support are direct extensions to MPI-NP II, and they are necessary to comply with the MPI standard. The collective communication support is our further probe into utilizing the ability of programmable network processors to support MPI. We call the new system based on MPI-NP II, the MPI-NP 11+. The software architecture of MPI-NP 11+ is same as MPI-NP II, which includes the LAM/MPI (upper layer and Request Progression Interface), Host interface library (HIL) and the Myrinet Control Program (MCP). All of the functions are implemented by extending on the Request Progression Interface (RPI), HIL and MCP codes. Before starting the functional extension, it was first necessary to port MPI-NP II from Linux kernel 2.0.x to Linux kernel 2.4.x, which required some changes to the Myrinet PCI driver.  4  1.3  Synopsis  Chapter 2 introduces background knowledge and related work. Chapter 3 describes the issues and solutions for design and implementation of Any-Source matching, three collective operations, multiple communicators support and multiple local processes support. Chapter 4 evaluates the performance of those functions according to their respective characteristics. Chapter 5 gives the conclusion of the thesis and discusses possible future work. fr  5  Chapter 2  Background 2.1  Message Passing Interface  Message passing has been the most widely used parallel programming paradigm in distributed memory systems since it is a good match between the distributed memory model and the distributed hardware [10]. There are two ways in which to to integrate message passing into a language. The first is to write parallel program using some specialized coordination languages such as Fortran M [12]. These languages are constructed so that concurrency, communication and synchronization are directly provided by their grammars. The second way of using message passing is to call a message passing library from within a program written in a generalpurpose sequential language such as C or Fortran. This method has proved to be very popular because of its flexibility in modeling the detailed parallel application environment and because of the existing code base and familiarity of using C and Fortran. Over the past few years a number of message passing libraries have been developed and used. Many of these message passing libraries are platform specific, thus  6  user programs developed on one implementation of an interface cannot be directly used on another platform. Thus, to develop portable applications, a non-vendor specific message passing library is necessary. Message Passing Interface (MPI) is such a message passing library. It is a result of a joint standardization effort by different industrial, governmental, and academic organizations. MPI [11, 25] has a rich collection of communication routines. These include point-to-point communication routines, global data movement routines and global computation routines. All those data operation routines take a datatype as one argument. The datatype argument can be a primitive datatype such as integer or floating point number, or it can be a complicated datatype such as an array or a structure. MPI provides a number of routines to create compound datatypes from the primitive datatypes. The rich set of communication routines and the support for compound datatype make MPI suitable for programming parallel applications that involve the frequent exchange of data and complicated operations on that data. The richness of the MPI communication supports parallel programming far more than lower level communication libraries such as T C P / U D P , which only supports the transfer of streams of bytes. Another aspect differentiating MPI from T C P / U D P is portability. Applications written using T C P / U D P sockets can only be used on T C P / I P networks. All MPI communication routines are executed in the context of an object called a communicator. When a communication routine is called, a communicator is passed as an argument. A message can be communicated only when the communicators of the send and receive routines match. Messages with different communicators are insulated from each other. Therefore, communicators provide a scoping mechanism that is necessary for supporting the independent development of third-party  7  software libraries. By using separate communicators, message passing by one library cannot be unintentionally received by a communication call of a different library. MPI communication operations can be blocking or non-blocking. For a blocking operation, the return of the function means that the user buffer can be reused; for non-blocking, that is not true, since non-blocking calls can return even if the operations are not finished. Non-blocking operations provide a possibility of overlapping the computation and communication. For point-to-point communication, the sender has to explicitly specify the receiver. However the receiver can use an Any .Source wild card to receive messages from any source in the communicator. The Any .Source wild card is especially useful for the single-to-multiple communication model. Every point-to-point routine also specifies a tag parameter. The tag differentiates different types of messages in one communicator. The use of Any.tag wild card is allowed for the receiver. In an MPI collective operation, all the processes in the specified communicator should call the same function. For example, with MPLBcast, not only the sender of the broadcast calls this function, but all other processes in the group need to call this function in order to receive or forward the message. MPI internally reserves a constant tag for every collective operation, and these reserved tags should not be used by point-to-point operation routines.  2.2 L A M There are currently two public domain implementations of MPI. One is L A M [14], developed by the Ohio Supercomputing Center; and the other is MPICH [17], developed by Argonne National Laboratory. Our work is based on L A M and we do not discuss MPICH.  8  The L A M library is implemented in two layers [22].  The portable layer  provides a direct interface for the user programs, and is independent of the details of the lower layer communication system. The lower layer is interfaced to the upper layer through the Request Progression Interface (RPI). The R P I itself is not a single layer. It is a model used by the upper layer to drive the lower layer communication system.  Different lower layer communication systems provide their separate RPI  functions, which are called by the request management functions in the upper layer. In the current L A M implementation ( L A M 6.1), there are two RPIs: LamdRPI and C2C-RPI. The Lamd-RPI and C2C-RPI represent two different lower layer communication modes. With Lamd-RPI, every communication is done via the lamd daemon, so the efficiency is lower, but it provides a way to monitor and debug the communication. C 2 C is client-to-client; with C2C-RPI, the processes transfer and receive message directly to each other. The M P I users choose which R P I to use by an option of command mpirun. Furthermore, three C2C-RPIs are provided in the L A M source distribution: tcp, sysv and usysv. The tcp C2C-RPI uses T C P / I P Internet domain sockets as the lower layer communication interface. The other two use shared memory communication, where the difference is the locking method used for mutexes. The selection of which C 2 C implementation is done when L A M is configured and compiled. M P I implements point-to-point message passing using the Request Progression Interface. Every point-to-point operation call has a M P I request to represent it. A l l the requests are linked into a list. The portability layer provides many functions to manage the M P I requests. The process of completing a point-to-point operation includes four major stages of a M P I request.  • Build a request. Completed by _mpi_xeq_build().  9  • Start the request. Completed by _mpi_req_start('). • Advance the request. Completed by _mpi_req_advance (). • Complete the request. Completed by _mpij:eq_destroy(). A l l the above request operation functions may call functions provided by the specific RPI. The _mpi_req_build() and _mpi_req_start() only finish the work necessary before the data transfer.  The actual data transfer is done by  _mpijreq_advance(). For blocking operations, _mpijreq_advance() calls the specific R P I advance function repeatedly until the request for this operation is completed. For non-blocking operations, _mpi_req_advance () also calls the specific RPI advance function repeatedly, but the R P I advance function tries to advance all the requests in the list and _mpijreq_advance() stops the repetition only when no request in the list can be advanced further. As a result, with non-blocking operations, a request can be advanced when other operations are called.  This provides an  opportunity to asynchronously advance the non-blocking communication, which is important to promote the overlap of computation and communication. In L A M / M P I , collective communication primitives are constructed by calling point-to-point communication functions. Therefore, every point-to-point call inside the collective operation is advanced separately.  2.3  Myrinet  Myrinet [24] is a high-speed, switched, wormhole routed physical network technology.  Similar to Ethernet, the physical devices of Myrinet include the cable, the  Myrinet interface cards and Myrinet switches. The Myrinet network interface is attached to the host through the system's IO bus. A l l Network interface cards are  10  connected to crossbar switches through high-speed full-duplex links. The data rate of the link is 1.28Gb/second. The Myrinet packets are routed by the switches using wormhole routing. Each intermediate switch forwards the packet to the desired output port without waiting for the entire packet to be assembled [16]'. Thus it is possible that a packet stretches across several switches and links at any time. As a result, the length of the packet is not limited by the physical network. Of course, not any length packet is permitted, other factors such as the Myrinet control software may limit the length of packets. Myrinet switches use backpressure for flow control. When the desired output port is being occupied by an other packet, the incoming packet is temporarily blocked, and the blocking information is propagated upstream using backpressure. One problem with this flow control method is the possibility of deadlock. Deadlock occurs when packets are blocked by other packets and the cycles of contention for output ports arise. In systems like MPI-NP 11+, all the Myrinet interfaces use a single routing table and the table is statically designed to avoid cycles. Therefore, in such systems, no deadlock problem exists. With Myrinet, the error rate of packets is very low. Given that the error rate is on the order of a memory error, we do not perform any error checking. The reliability of Myrinet greatly simplifies the design of message-handling software. The major components in a typical Myrinet interface card include: a network processor, three DMA engines, an on-board SRAM and several registers. The onboard memory and the registers can be mapped into the virtual space of the host processes, thus host processes can access data in the SRAM using programmed I/O. One of the DMA engines (hDMA) is used for delivering data between the  11  host memory and the NIC. The other two D M A engines (sDMA and r D M A ) are used by the NIC to send data to the network and receive data from the network, respectively. These engines are all operated by the Myrinet control program running on the network processor. The h D M A is the only way for the M C P to access the host memory. With the S R A M mapping, Myrinet provides a way to support user-level network communication [28], by which the operating system is removed from the critical communication path, and thus provides the opportunity to reduce the communication latency. By customizing the Myrinet Control Program, the ability of the network processor can be utilized. By moving the communication work from the host to the network processor, it is possible to overlap the communication and computation. Moreover, since the communication latency between two NICs is smaller than that between two host processors, the performance of the communication can be improved.  2.4  MPI-NP II  M P I - N P II  [19, 20] is a L A M - M P I port implemented over Myrinet. The original  goal of M P I - N P II is to simplify the design and improve the performance of MPINP [35], which is also a network processor based Message Passing Interface over Myrinet. M P I - N P II or M P I - N P covers the entire system, which includes the M P I library, the Host Interface Library and the Myrinet Control Program. The HIL is the interface for the host program to interact with the M C P . The HIL is used by the M P I library through the Request Progress Interface. A Myrinet specific R P I is added to integrate the HIL into MPI. Figure 2.1 shows the architecture of MPI-NP 12  II or MPI-NP. Host 2  Host 1 1 1 1 1 1  ;  Application  MPI  11 I 1.  Application  MPI  ;  i_ [  i  ;—  —  -  I  —  —  Application  MPI  II 1 L  ;  Host Level N I C Levjel  Host n  HIL  — — — —  MCP  MCP  i  Figure 2.1: MPI-NP II Architecture  In M P I - N P II, the Myrinet Control Program can recognize an M P I message according to the envelope of the message, and the envelope management and message matching are done in the NIC. MPI-NP II calls this characteristic of its communication layer MPI-aware. This feature makes MPI-NP II different from other Myrinet based M P I implementations (eg, MPI-BIP [27], M P I - F M [23]), whose low communication layers are designed to support many upper level protocols and do not have access to the envelopes on the NIC. The M P I channel is the most important abstraction in M P I - N P II. Every channel in M P I - N P II is a bidirectional connection between two specific M P I processes. Channels are built on the NIC-level and managed by the M C P s . When a MPI user program is started, a channel is set up between the local process and all other processes. So if the number of processes is n, there will be n — 1 local channel structures for each local process on the local NIC. The building of a channel  13  connection is a three-way hand-shake similar to a T C P connection. The channel pool is managed statically in NIC memory. Each allocated channel structure stores one side of a channel; there are a fixed number of send, receive and receive request slots. Every send or receive slot holds an envelope of a point-to-point message. Every request slot stores a point-to-point receive request. There is a one-to-one relationship between the send slots in the sender side and receive slots in the receiver side of a channel. A message envelope from a send slot in the sender NIC is always sent to a specific receive slot in the receiver NIC. A send-receive slot pair uniquely identifies one of the paths inside a channel. MPI-NP II calls the path a microchannel. One channel includes many micro channels. Figure 2.2 shows the various states of different microchannels inside one channel. Figure 2.2 is directly borrowed from the thesis A Network Processor Based Message Manager [19]. Send Slots  Receive Slots  Control Link  Figure 2.2: Message Progress in a Channel  14  The flow-control in MPI-NP II is very simple. Since every channel is exclusively used by two specific processes, the channel is the basic unit of flow-control. Messages passing through different channels have no effect on each other. Due to their one-to-one relationship, a free send slot on the sender side always indicates a free receive slot on the receiver side. The flow-control is receiver-side based. When sending a message, the local process checks whether or not there is a free send slot in the corresponding channel space (channel connecting it to the destination process), and if not, it is temporarily blocked. If there is a free send slot, then the local process copies the message envelope to this slot and inserts a send descriptor into the send ring on the NIC. In its control loop, the M C P checks the send ring and sends the message envelope out to the remote node. The receiver M C P receives the message envelope and saves it to the corresponding receive slot of the channel. Using the message envelope at the header of the message, the receiver M C P tries to match the new message with any pending requests in the same channel. Whenever a message is matched, its payload is uploaded to the host memory of the receiver process and a S L O T _ F R E E message is sent to the sender NIC. After which, the send slot is freed and is reusable. The allocation and release of the request slots are similar to the send slots. The message matching in MPI-NP II is done at the NIC-level. A l l unmatched messages and all unmatched requests are linked to two queues inside the channel. The matching always goes between the unmatched message queue and the new receive request, or between the unmatched request queue and the newly arriving message. The new receive request is not directly put into a request slot; it is held in a receive entry structure for immediate matching. If there is no matching, the envelope is put into a request slot and linked to the unmatched request queue to  15  wait for further matches to incoming messages. In M P I - N P II, the NIC does very little work to manage data buffers for messages. Small messages and envelopes are held in the channel slots, so the buffer space for small messages is statically divided among multiple communication processes when channels are initialized. For large messages, the payload is directly delivered from the sender memory to the receiver memory using D M A s . Before the delivery, the host buffers on both sides are pre-pinned by the host program. The NIC is only used as a D M A staging area and as a result achieves  zero copy for large  messages. Sending a large message is different from sending a small message.  The  payload of a large message is sent only after the envelope of the message has been sent to the receiver and has been matched.  For a small message, the payload is  stored inside the envelope and can be sent directly to the receiver side. M P I - N P II also provides support for M P I broadcast on the NIC level. Different from the naive point-to-point implementation of MPI_Bcast(), in M P I - N P II the broadcast message is copied from the host to the NIC only once. Once on the NIC, the message was simply added to the appropriate queues for different destinations. In M P I - N P II, M P I message passing semantics are realized and provided transparently to the user program. The messages and receive requests are inserted into their waiting queues and matched in F I F O order. Messages with different tags may be received in any order, as allowed by M P I semantics. The separate transfer of the data and envelope of a large message is also consistent with M P I semantics since matching is done according to the envelopes, and the in-order delivery is preserved.  16  2.5  Related Work  Each of the following projects has implemented one or more collective communication functions at the N I C level. A common feature of these projects is that their collective communication functions are implemented with a generic low level communication layer. However, the design and implementation of these systems make them different.  2.5.1  NIC-Based Barrier over M y r i n e t / G M  To my knowledge, Darius's NIC-Based Barrier over Myrinet [3, 5] is the first work (and the only one except ours) to implement barrier synchronization on the N I C . This NIC-based Barrier is implemented as an addition to Myricom's G M message passing system. Two procedures were added to G M , gm_provide_barrier_buf f er and gm_barrier_with_callback(). The gm_provide-barrier_buf f er () prepares a  receive token for the NIC, and then waits for the completion of the Barrier operation by checking the value of the token. The gm_barrier_with_callback() prepares a send token and queues the token in the send queue for the N I C to process. The send token contains information describing the nodes and ports with which to exchange messages. The NIC, when checking for the send queue, performs a Barrier operation according to the information provided by the send token. When the operation is finished, the N I C returns the receive token to the process. The process, which waits for the receive token, returns from polling. The system.used M P I C H / G M to test the Barrier on the M P I level. The gmpi-barrier () is the G M channel interface function provided to the upper level for Barrier operation. According to the Barrier algorithm used, this function determines the nodes with which to exchange information. It then calls MPID_DeviceCheck() to 17  poll until a send token and a receive token becomes available. Then the gm_barrier_with_callback() and gm_provide_barrier_buf f er () are called sequentially. In the test experiments, two Barrier algorithms are used. One is Pairwise Exchange (PE), the algorithm used by MPICH's host-based Barrier. The other is Gather-and-Broadcast (GB), an algorithm similar to LAM's host-based Barrier implementation. The MCP is aware of which algorithm is used, but the algorithm overhead is jointly undertaken by the host and the NIC. Especially for GB, the host determines the tree structure used. The NIC only needs to know the nodes with which to exchange messages. Judging by the experimental results, the NIC-based Barrier has obvious performance gains over the host-based Barrier using both GB and P E . For 16 nodes using LANAI 4.3 NIC cards, the GM-level Barrier latency gives a 1.78 factor of improvement over the host-based Barrier with the same algorithm; for 8 nodes using LANAI 7.2 cards, the factor of improvement is 1.83. For 8 nodes, with MPI-level Barrier, the improvement factor can be as much as 2.2. Though the advantages of the NIC-based Barrier are obvious, the performance is less than one may expect, especially when looking at the timing diagram in Article [3]. The Barrier is an operation with no computation, so better performance gains for NIC-based implementation should be possible. G M is a generalpurpose communication layer. The concept of communicator (which is required by MPI_Barrier) is not supported by the G M library and its MCP. Therefore, the system has additional overheads in executing concurrent Barriers.  18  2.5.2  F M / M C  Multicast Protocol  F M / M C [34] is the first work that tried to perform multicast at the NIC level. It is implemented by extending F M [30, 29]. The emphasis in F M / M C is on flow control and the management of buffer space. The flow control policy used by F M / M C is preventive. Because of the limited memory space on the NIC, F M / M C did not use a pair-wise flow-control protocol, which requires dividing the receive buffer among all potential senders. F M / M C uses a flow control based on centralized credits management. A multicast source has to obtain credits before it can send multicast fragments. A single credit stands for the right to multicast one fragment to all destinations. It represents the buffer space for one fragment at all of the destinations. The credits are managed by a central manager, and they are distributed dynamically according to the requests. The credits are recycled using a token protocol. Created by the credits manager, the token goes through all nodes, and at each node it discovers the number of multicasts processed since the token was last recycled. The credit manager sends out the token for recycling whenever the credits in the credit pool drops below a pre-defined level. Flow control is done by the MCP. In each NIC, there is a special counter variable used to represent the usage of the host's DMA memory. Every time the F M / M C library processes a fragment on the host D M A memory, it increases the value of the counter. Thus, through the token rotation, the M C P of the credit manager can collect the information about the host DMA memory on all nodes. In this way, the flow control with the host DMA buffers can be done by the MCPs, without interaction with the host processes. When starting a multicast, the application explicitly tells the F M / M C library 19  the structure of the multicast group. Then the F M / M C library on each node records the group membership information onto its.NIC. Each NIC implements the group as a spanning tree rooted by the multicast source, thus forwards data according to the tree structure. Data to be sent is handled as a sequence of fragments by the F M / M C library. The F M / M C library copies each fragment to the send queue on the NIC. For every fragment, the NIC of the root node marks it as a multicast fragment, and forwards replicas of this fragment to all its child nodes in the multicast tree. For every intermediate NIC in the tree, when a multicast fragment is received, it first DMAs the fragment to the local host's preallocated memory area, and then immediately forwards the fragment to the NICs of all its child nodes in the tree. Several spanning tree forwarding structures were tested by F M / M C using eight nodes. Compared to the binomial tree, iterative unicast and serial forwarding, the binary tree proved to be able to provide both low latency and high bandwidth. The drawback of F M / M C is obvious. Its flow control and buffer management are greatly complicated because the limited memory space on the NIC, which has to be shared by messages from all sources. Moreover, the F M / M C system does not scale well due to the use of a central credit manager.  2.5.3  L F C Multicast Protocol  As in F M / M C , LFC [1] implements multicast by forwarding multicast messages at the NIC level. LFC's multicast protocol is closely integrated with its unicast protocol. The multicast and unicast messages share the same buffers and flow control mechanism. LFC implements a pair-wise flow control at the network interface level. Each NI (Network Interface) partitions its receive space into equal buffer areas for all other  20  NIs in the system. Flow control is always done between the two specific NIs. When a NI wants to send a packet to another NI, it only needs a send credit representing a receive buffer on the receiver NI. When the host is sending a message, it copies the message data to a free send slot in the send buffer pool and enqueues a send descriptor to the send queue. The send descriptor contains the destination address of the message and a reference to the message data. The NI, through its control loop, polls the send queue and inspects the send descriptor. If sufficient credits for the specified destination are available, the NI moves the send descriptor to the transmit queue. All messages in the transmit queue are sent directly when the NIC polls on the queue. If there no send credits are available, the send descriptor is moved to the blocked-sends queue for that specified destination. When the NI acquires credits from some destination, descriptors in the blocked-sends queue for the destination is removed and added to the transmit queue. Thus the messages for this destination can be sent. The credits are returned to the source NI either by an explicit credit update message or by means of piggybacking. For multicast, when the NI has to forward a packet, it creates a send descriptor for each forwarding destination. Therefore, the multicast flow control is simply done by converting it to several separate unicast flow controls. One problem with LFC's multicast is that it may cause deadlock when an arbitrary spanning tree is used (Binomial tree is deadlock-free). Though a deadlock recovery method was proposed by LFC, it causes performance overhead, and the overhead may be too large to support MPI when multiple broadcast groups are used.  21  2.5.4  Other Work on Collective Communication  FM/MC and LFC's multicast are both NIC-based, where the multicast forwarding is done by the network processor. A research group at Ohio State University have also done long-term research on collective communication using network processors. Article [26] proposed many opinions on designing efficient collective communication algorithms on wormhole routed systems. Article [21] presented the concept of a smart network interface, which emphasizes that the multicast tree forwarding should be completely handled by the network processor. Two different multicast tree algorithms are studied and compared in this work. Other NIC-based multicast implementations are also presented in [15, 7]. In [4], Buntinas presented a multi-send primitive method to support multicast. It is different from a completely NIC-based multicast. Although the multi-send operation is executed by the NIC, the message needs to be received by the host before the forwarding. Their idea is that the host has more flexibility in choosing the optimal multicast tree based on the detailed requirements for a particular message. All of the previous proposed methods use the support of the network processor to send messages, which reduced the overhead of the host processor in the collective operation. Moreover, all these methods use the spanning tree structure to forward collective messages, which reduced the latency of the collective communication and made the implementation scale well with larger system size. However, all these systems were designed as generic communication layer to support different upper layer applications. We believe a MPI-specific communication layer will be better adaptable to MPI applications, especially when more complicated MPI collective operations functions are to be supported. Moreover, none of those systems supports multiple collective operations on the NIC. Each of them only 22  implemented one kind of collective operation on the NIC level. Collective communication, especially multicast can also be implemented through hardware. Article [32] presented a method to modify the switch architecture that only supports unicast message passing into a new architecture which supports multidestination message passing. Article [31] studied the performance comparison between NIC-based software multicast and switch-supported hardware multicast in an irregular network. It follows that in some situations, performance can be gained through the support of switches. However, to implement multi-destination messaging in the switch has several disadvantages. The cost is prohibitive. Moreover, the the switch hardware does not have the design flexibility that can be reached by the programmable network interfaces.  23  Chapter 3  Design This chapter describes the design and implementation of MPI-NP 11+ based on MPI-NP II. Section 3.1 explains the places where MPI-NP II needed to be extended. Section 3.2 describes the implementation of Any .Source matching added onto MPINP IPs NIC level. Section 3.3 describes the design of collective communication in MPI-NP II+. The implementation of three collective operations are described as well as the support for multiple communicators at the NIC level. Finally, Section 3.4 describes the problem of supporting multiple local processes.  3.1  Extending M P I - N P II  By building a MPI-aware message manager on the NIC, MPI-NP II successfully reduced the overhead of the host processor in processing MPI messages. It implemented non-blocking communication as asynchronous communication, which further promoted the overlap between computation and communication. MPI-NP IPs flowcontrol is channel-based where different channels are separated from each other. MPI-NP IPs channel structure provided a simple and solid foundation for imple-  24  meriting additional features. Message matching is one of the main functions that was migrated down onto the NIC. Rather than duplicate the matching functionality on the host, it should be possible to extend the matching in the NIC to include the Any-Source operation. MPI-NP's current channel structure does not easily support Any_Source matching. The major problem is that the message to be matched, assuming that receive has not been posted, could reside in any of the channels. With the current structure, the matching procedure needs to search each of the channel queues to find the message. When the receive is posted first, this strategy is even worse since all the queues are searched in their entireties before it can be determined that a matching message has yet to be received. The basic design goal was to augment the existing channel to permit faster matching of Any-Source, while at the same time not unduly impacting the performance of send/receive. Of course, any structures also need to maintain the MPI message-passing semantics; in particular, in-order delivery of messages with respect to a given message queue. MPI's collective communication operations are also obvious candidates to be supported on the NIC. MPI II's broadcast did not support large messages, and the scalability of broadcast can be further improved by using a multi-level broadcast tree on the NIC. Moreover, there is no support in MPI-NP II for other collective communication operations. The widespread use of symmetric multiprocessors as compute nodes raises problems as well. L A M over T C P / I P does support having multiple processes on a processor, and thus can take advantage of SMP nodes in the cluster. However, having migrated functionality onto the NIC, it becomes important that the commu-  25  nication layer support multiple processes in a single node. The channel structure in MPI-NP II was designed to support multiple processes, but the interface to the host only supported having one process connected to the NIC. Moreover, since the NIC is mapped to the virtual memory of every process, and some data structures are logically shared by many processes, some synchronization needs to be added to the host interface so that the multiple processes can access the NIC cooperatively. Another problem in supporting multiple local processes is the communication between them. One possible solution to this problem is using the IPC Shared Memory mechanism on Unix/Linux, but this method requires operating system kernel calls, and the communication overhead has to be completely assumed by the host processor. Moreover, to use IPC Shared Memory, a new communication interface has to be provided to the upper layer, and since this interface will be different from the existing communication interface of MPI-NP II, the entire interface becomes more complicated. Therefore, a new communication interface is needed to deliver data between two local processes. This new interface should be similar to the loopback interface of TCP/IP, so that it is adaptable to the existing protocol stack. More specifically, the interface is well suited to the Host Interface library and the link-level characteristics of MPI-NP II, and is integrated into the existing channel architecture. One set of MPI operations that were not investigated in the literature for support on the NIC are group and communicator synchronization primitives. Groups in MPI require a synchronization between the group members in order to choose a unique communicator. Obviously, once again, the underlying synchronization primitive could be supported on the NIC. Groups do provide an important structuring and scoping mechanism for applications, and we are interested in supporting those  26  applications that make extensive use of the M P I group operations. M P I - N P II only supported the single MPI_C0MM_W0RLD group and there was no support for multiple groups. A n y modestly interesting application is likely to use more than the one group and it is important to extend M P I - N P II to support multiple groups.  3.2  Any_Source Matching  Every time a receive request is posted to the N I C , the M C P immediately tries to find a matching message from the unmatched message lists, according to the M P I communication semantics. If no matching message can be found, the receive request is moved onto an unmatched request list for further matching. Similarly, every time a new message arrives at the NIC, the M C P immediately tries to find a matching request from the unmatched request lists, according to the M P I communication semantics. If no matching request can be found, this message is added into an unmatched message list for further matching. In summary, there are two types of matches: • Type 1. Matching the new receive request to the unmatched messages in the channels. • Type 2. Matching the new message to the unmatched receive requests. In the implementation, every message is transported through a specific channel, and there is a receive slot for each message in the channel; therefore, successful matching between a message Msg and a request Req can be expressed as follows: • If Req is a one-to-one receive request, the matching condition is Msg.channel = Req.channel  27  Msg.context = Req.context Msg.tag = Req.tag ( or Req.tag = MPI_Any_Tag ) • If Req is a Any .Source receive request, the matching condition is Msg.channel.ljrank = Req.process Msg.context = Req.context Msg.tag = Req.tag ( or Req.tag = MPI_Any_Tag ) Here, Msg. channel. l_rank means the local process that owns channel Msg. channel. The Req.process means the process that posted request Req. Since a one-to-one request is always attached to a specific channel, it only does matching with unmatched messages of the channel. However, an A n y .Source request may match messages from any process, so a message from any channel with the process that posted this request may be successfully matched. Figure 3.1 shows the data structures added to support Any .Source matching. Here, the channel structures are still the same as those in M P I - N P II. Each process is allocated n — 1 channel structures to build channels with all other n — 1 processes. Each channel structure still has a fixed number of send slots, receive slots, and request slots. W i t h each channel there is still an unmatched message list and an unmatched request list. However, for every process, a hash table Hash_Messages [] is added to organize all the messages sent to it. The Hash_Messages [] is indexed by the hash value of Msg.tag. Each entry in this hash table contains head and tail pointers for a doubly-linked queue. A l l the messages in this doubly-linked queue have the same hash value of Msg.tag. Thus, each message Msg is linked into two queues: the unmatched message queue of its channel and the hash queue pointed to by  28  MCP_Proc_Stract  iter^  1  Any_Source requests  Channel 2  3  Channel 3  19  17  30  23  Request_Flags 0  1  1  0  0  0  3  0  2  3  4  5  6  7  Hash_Mes sages 0  1  2  3  4  5  6  7  0  1  Figure 3.1: The Organization of Messages and Requests Hash_Messages [hash(Msg.tag)]. Also different from M P I - N P II, the unmatched message queue is doubly linked. Using doubly-linked lists makes it possible to remove matched messages from both queues in 0(1) time. A l l Any_Source requests by a process are linked into a singly-linked queue. The Request_Flags [] is a hash table indexed by the hash value of Req.tag, and the same hash method is used as with Hash_Messages [] . Entry Request_Flags li] indicates the number of Any .Source requests that have a hashed tag value equal to i. The unmatched request list of each channel is still the same as in M P I - N P II (not shown in Figure 3.1), all the requests on this list are one-to-one requests. W i t h the new data structures designed to support Any-Source matching, the procedure of the Any-Source matching is described as the follows: Matching type 1. To match a one-to-one request, the M C P scans through the unmatched message list of the channel, and compares the tags and contexts.  29  The maximum matching time is 0(m), where m is the channel width. To match an Any .Source request, the MCP uses the hash value of Msg. tag as the index to table Hash_Messages [], and searches through the hash queue of entry Hash-Messages [hash(Msg.tag)] to find a match. The hash function can be as simple as hash(tag) = tag °/, Table-length. Here, we use constant 20 as the Table-length. If a message is matched, it is removed from both its channel's unmatched message queue and the corresponding hash queue. Matching type 2. The M C P first checks the Request_Flags [hash(Msg.tag)] , to see if there are Any .Source requests with a hash tag value hash(Msg.tag). If not, only the unmatched request list of the channel needs to be searched. If there are Any .Source requests with that hash tag value, both the Any .Source request queue of the destination process and the unmatched request list of the channel are searched. If a matched request is found in both queues, the one with the earliest time value is selected. The time value of a request is the value of a global counter variable when the request was posted. In MPI-NP II+, the operation of posting a request is serialized. Therefore, we can maintain a clock by increasing the counter variable by 1 each time a request is posted. If the message is matched to an Any .Source request, the request is simply removed from its queue, and the value of the corresponding Request .Flags [] entry is decremented by 1. When no matching is found, the message/request will be added to the waiting lists for further matching by later requests/messages. The Message Msg is added to the unmatched message list and the hash queue of entry Hash_Messages [i] where i=hash(Msg .tag). The Any .Source request Req is added to the Any .Source  30  request list of its process, and when the request is added to the Any .Source request list, the value of the corresponding R e q u e s t _ F l a g s [ ] entry is incremented by 1. The message/request is added to the tail of the queues, thus the queues are maintained in F I F O order, and M P I semantics are guaranteed. From the preceding description it follows that by organizing all the messages into separate queues divided by their tags, Matching type 1 need not check all messages. By using the R e q u e s t . F l a g s [ ] , Matching type 2 can avoid unnecessary comparisons to all unmatched Any .Source requests.  3.3 3.3.1  N I C - b a s e d Collective C o m m u n i c a t i o n Host-based and NIC-based  M P I - N P 11+ implements several collective operations on the NIC level. These operations can also be done over Myrinet by executing L A M ' s algorithms on the host, which call M P I - N P II's unicast communication primitives. However, it should be possible to improve upon the unicast host-based algorithms by implementing the operation on the NIC level. Figure 3.2 shows a message broadcast using a binary tree forwarding structure, which represents a common situation in collective communication. As shown in Figure 3.2, with the host-based method the messages are always delivered to the host before they are forwarded to their next destinations, but with a NIC-based method, messages are forwarded without having to be uploaded to the host. Another difference occurs at the root node and the intermediate nodes, the NIC-based implementation only needs to copy the message from the host to the NIC once, whereas the host-based implementation copies the message for each branch.  31  Host-based  NIC-based node 0  node 0  Host NIC  [  Figure 3.2: Host-based and NIC-based Message Forwarding Therefore, by implementing a collective operation on the NIC, many expensive hostNIC interactions can be avoided. Thus the performance of the operation can be improved. A more formal description of the difference between the two methods can be found in Article [3]. 3.3.2  Data Structure  MPI-NP 11+ uses the same data structures as used by MPI-NP II to implement point-to-point communication. In particular, the data structures representing every channel remain the same and MPI-NP 11+ also uses the concept of a channel as a basic flow-control unit. In MPI-NP 11+ collective messages are also delivered through channels. At the same time, for the ease of control, MPI-NP 11+ separates the flow control of collective communication from that of unicast communication. To support collective messaging, MPI-NP 11+ adds two data structures to MPI-NP II: Collective Message Receive Structure (CMRS) and Collective Operation Control Entry (COCE). There is an CMRS on each channel structure. The CMRS receives collective  32  messages from the other side of the channel and stores the message envelope and the payload for small messages. The CMRS is similar to the receive slots of an MPI-NP II channel, which are used to hold the envelopes of unicast messages. However, unlike receive slots, there is only one CMRS per channel structure. The COCE is attached to each process, and is used to control the current collective operation of the process. There is only one such structure per process. The COCE includes the following information: • Envelope and buffer; holds the envelope and payload of a small message to be sent or forwarded, • Status information; records the progress of the current collective operation, • Tree structure information; the spanning tree structure used for forwarding collective messages.  It stores the local part of the tree for this node, the  channels for connecting to the parent node and child nodes, and the number of the children, • Buffer address; the physical address and its starting position in the page table of the send/receive buffer for a large MPLBcast message from/to this process. The information in COCE is basically a combination of the information of a send slot and a request slot by MPI-NP II. There is not a separate send slot attached to each channel, because for collective operation, the same message is sent to multiple destinations, multiple copies are not needed to do the transfer.  Similarly, there  is not a request slot attached to each channel, because during one operation, one request slot is sufficient to control the receipt of messages from multiple branches. Collective communication messages are always sent from the COCE of the sender to the CMRS at the receiver side of the channel. An CMRS uniquely identifies 33  a pathfora collective message packet. We call the path, consisting of a COCE and CMRS, a collective microchannel, a special type of microchannel. As in MPI-NP II, messages in different collective microchannels do not interface with each other. In addition, a collective microchannel is only used for collective messages. In MPI-NP II, multiple send, receive and request slots in each channel structure are necessary to support asynchronous communication. In MPI-NP 11+, there is only one CMRS in each channel structure for the following reasons: • Among the three collective operations supported by MPI-NP II+, two are synchronous, which means the caller process returns only when the global operation is done or close to done on every process. When one synchronous operation is finished, all the involved CMRSs are reusable for subsequent collective operation. • MPLBcast, as defined by the MPI standards, is not a synchronous operation. The caller process returns immediately after it sends the data to all of its children. MPI-NP 11+ implementation of MPLBcast is loosely synchronous. As described in Section 3.3.5, by loose synchronization we mean that only after the MPLBcast is finished on the NIC-level of one node, the CMRSs of the children are reusable for a subsequent collective operation. • The loose synchronization of MPLBcast does not necessarily cause more host overhead. The host can return from MPLBcast before the operation is finished at the NIC-level. Especially when collective operations are interwoven with point-to-point operations, the host overhead of MPLBcast can be reduced. This is discussed in Section 4.4. • One CMRS per channel has the advantage of simplified control. The receive 34  side N I C does not need to make a decision about which C M R S to use.  3.3.3  MPI_Barrier  Overview M P L B a r r i e r is an important operation in M P I . The Barrier synchronization itself is used not only by the message passing system, it is supported in almost all parallel and distributed environments. The MPI_Barrier operation in an M P I application is called by all processes in the specified communicator. The MPI_Barrier() returns only when all the other processes have reached the function invocation point. Thus a synchronization line exists across all the processes. No process can advance until all processes have reached the barrier. Therefore the efficiency of MPI_Barrier() directly affects the performance of an M P I application, since no further computation can be performed until after the barrier. Moreover, the efficiency of MPI_Barrier() affects the granularity of M P I applications. The lower latency of MPLBarrier() will better support finer grained applications. Considering the implementation of MPI_Barrier(), it is obvious that messages need be exchanged among all participating processes to reach the barrier point. Different algorithms can be used to control the exchange of messages. Pairwise Exchange (PE) and Gather-and-Broadcast (GB) [3] are two frequently used algorithms. In a system with a programmable network processor, the algorithm can be executed by the host or the N I C . For the host-based implementation, as shown in Figure 3.2, the messages are delivered to the host process and then transferred to the N I C to be sent out or forwarded to other nodes.  35  However, for the N I C -  based implementation, the intermediate messages do not need to be uploaded and downloaded between the host and the NIC. In L A M / M P I 6.1, MPI_Barrier() is implemented using the GB algorithm with a binomial tree structure. L A M uses standard MPI point-to-point communication operations to forward messages. The MPLBarrier of MPI-NP II is the same as L A M / M P I 6.1, except each point-to-point communication is executed over Myrinet. Algorithm  MPI-NP 11+ implements MPI_Barrier() on the NIC level. The MCP controls the exchange of messages and the messages do not need to be received by the host level in order to forward them. The Gather-and-Broadcast method with a spanning tree structure is used. The default spanning tree used is a binary tree, but any spanning tree structure is supported by the MCP. Each MCP only stores a local view of the tree, which makes it possible to support arbitrary tree structures. A n example of Gather and Broadcast using 8 nodes is shown in Figure 3.3.  Phase 2 : Broadcast  Phase 1 : Gather  Figure 3.3: Gather and Broadcast in a Binary Structure Messages are exchanged in two phases. In the Gather phase, the messages 36  start off from the leaf nodes and are merged when they are passed up the branches. In the Broadcast phase, messages start off from the root node and are replicated and sent to each of the child nodes. The shift from Gather to Broadcast occurs when the root node has collected the messages from all its child nodes. From the viewpoint of each single node, the operation can also be divided into two phases. This is recorded and controlled by the value of the status field in the C O C E . For an intermediate node in phase 1, it waits for the messages from its children, and then merges them and forwards a new message to its parent node; in phase 2, it waits for the message from its parent node, and then forwards the message to each of its children. The root node is different in phase 2 since it has no parent. The leaf node is different in phase 1 since it has no child. The processing for MPI_Barrier() on the intermediate node is simple because no computation is involved, and the messages sent are all zero byte messages.  H o s t - N I C Interface M P I - N P 11+ provides a new R P I function myr_barrier_bin() to support the N I C based MPI_Barrier. The myr_barrier_bin() employs a binary tree structure for collecting and broadcasting messages. Function myr_barrierJbin() builds the tree structure and determines the position of each process in the tree. In m y r _ b a r r i e r . b i n the tree is built according to the ranks of all the processes in the communicator. The process ranked 0 is the root of the tree. The process ranked k has parent [(k — l ) / 2 j , first child 2k + 1 and second child 2k + 2, only if these numbers do not exceed the range 0..size — 1, where size is the size of the communicator. If the rank of a process is known, the channel connecting to this process can be obtained from its _proc structure.  37  Once the spanning tree is constructed, the NIC then executes the barrier algorithm. The barrier operation interface between the host and NIC includes two functions: MPINP_Coll_CommStart () and MPINP_Coll_CommCheck(). The two functions are not only used by MPIJBarrier(), they are the host-NIC interfaces for all three collective operations implemented by MPI-NP II+. MPINP_Coll_CommStart () is used to start the collective operation on the NIC. MPINP_Coll_CommCheck() is used to wait for the completion of the collective operation on the NIC. Although myr_barrier_bin() uses a binary tree, any spanning tree structure is supported by the two functions and the MCP. In order to support other forwarding trees, only myr_barrier_bin() needs to be changed. The definition of MPINP_Coll_CommStart () and MPINP_Coll_CommCheck()  are as follows: int MPINP_Coll_CommStart(int c i d , int type, int parent_chan, int children_chan[], int num_children,  int length, char *data)  int MPINP_Coll_CommCheck(int length, char *data)  Here cid is the context id of the user communicator. The parameter type is the type of the collective operation. For the three collective operation, there are three different constant values representing them. The parent_chan and children_chan contain the indexes of the channels connecting to the parent and child nodes. The num_children indicates the number of children. The length and data parameters of MPI_Coll_CommStart () are the length and address of the data to be transferred. The length and data of MPINP_Coll_CommCheck() are the length and address of  the receive buffer. For MPI_Barrier, length and data are always zero. MPINP_Coll_CommStart executes its function by writing all the given information to the COCE structure of the caller process and setting the status field of COCE 38  to COLL JNIT. The MCP control loop checks the statusfieldsin all of the processes' COCEs. When a COLL JNIT is. found, the MCP starts the Gather-and-Broadcast process described in Section 3.3.3. Before setting up the COCE information, the MPINP_Coll_Commstart  always checks the statusfieldof COCE. If the status is not  MCPJJONE, MPINP_Coll_CommStart MCPJJONE,  simply returns with a value -1. If status is not  then the last collective operation by this process is still not finished, so  the operation has to be blocked temporarily. The myr_barrier_bin() repeatedly calls MPINP_Coll_CommStart () until it returns with a positive value. After that, the MPINP_Coll_CommCheck() is called repeatedly until a positive value is returned. What MPINP_Coll_CommCheck() does is similar to the other receive functions in the Host Interface Library. It polls on a state flag on the DMA or NIC memory to check if the operation is finished. The state flag is set when the MCP has finished phase 2 for the node. As an optimization, in MPI-NP 11+, for a small message, the flag is set on the NIC memory (shared by host); for a large message, the flag is set on host DMA memory. Of course, for MPLBarrier, the flag is set on the NIC memory because all messages in MPLBarrier have zero byte data. For a small message, if the data length is not zero, MPINP_Coll_CommCheck also copies the received data from the host D M A memory to the user's receive buffer. For a large message, when the state is set, the data has already been uploaded to the user's buffer by the MCP's DMA operation.  39  3.3.4  Multiple Communicators  Generating Communicator Context Group communication is an important function in parallel and distributed systems. In order to separate the communication in one group from the communication in another group, MPI uses the concept of a communicator. An MPI communicator is composed of a group of processes and a context id (MPI defines two types of communications, communication within a group and a restricted form of inter-group communication. In this thesis, we only consider intra-group communication). Each process in the group is uniquely assigned a rank, which is an integer in the range of 0..n — 1, where n is the number of total processes in the group. The context id is an integer that identifies the communicator. In MPI, MPI_Comm_create() is used to create a new communicator. This function takes an original communicator and some subset of the processes belonging to the original communicator as input and outputs a new communicator consisting of those processes. A basic operation inside MPLComm_create() is to generate a context id for the new communicator. This operation is global to the group as it requires the agreement of all processes in the defined group. To do this, L A M maintains a static variable topid in every process. To create this unique group context id, L A M uses the maximum value cid of the topids of all processes in the group as the context id. On return, after the new communicator is created, the topid values of all the processes are incremented to cid+1. As a result of this method, we are assured that whenever two communicators contain the same process, the context ids of them are different. This is necessary to guarantee the semantics of the MPI communicator. Currently, in L A M and MPI-NP II, MPI_Allreduce is called to generate the maximum value  40  of the topids. MPI_Allreduce() works by calling the point-to-point communication routines on the host level. As with MPLBarrier, there is the potential to improve the performance of this operation. Generating a new context id is essentially the same operation as MPI_Barrier. NPI-NP 11+ uses a NIC-level distributed algorithm to get the maximum topid. A new interface function myr_getcid_bin is used to replace MPI_Allreduce(). The other parts of MPI_Comm_Create() are preserved since they involve local operations. Each process when calling myr_getcid_bin() uses its topid as input, and the returned value is the maximum topid from all processes in the group. The implementation of myr_getcid_bin is similar to myr_barrier_bin(). It builds the binary tree using the same method and calls MPINP_Coll_CommStart () to start the collective operation on the MCP. After which, MPINP_Coll_CommCheck() is called to wait for the completion of the operation. The only difference is that myr_barrier_bin() has no data input. The execution of the algorithm on the NIC is also similar to that of MPLBarrier. A small difference is that when an intermediate node get the messages from all of its children, the maximum value of the topids will be selected from the payload of these messages, and it is passed to the parent node as the new message data. When the Gathering phase is finished, the root broadcasts the maximum value down the tree to the other nodes.  M u l t i p l e Communicators Support Multiple communicator communication also needs to be supported by the channel structures.  This is orthogonal to the issue of whether or not there is collective  communication. The following changes on MPI-NP II were necessary to support  41  multiple groups. • A context field is added to the MPLenvelope. Every message to be transmitted or received by the N I C now corresponds to some communicator. • The M C P message matching process is modified so that it also includes the comparison between the context of the message and the context of the receive request. • The channel has no correspondence to a communicator. It connects two processes from a global view. A l l messages between the two processes will use this channel, no matter what communicator they belong to. The rank ids of the two processes used to identify a channel are from the M P I _ C O M M _ W O R L D communicator. According to M P I , all processes are members of M P I _ C O M M _ W O R L D .  3.3.5  MPI_Bcast  Problem M P I - N P 11+ uses the C M R S and the C O C E to buffer and control collective messaging on the N I C . There is only one such C M R S per channel structure and as a result on a particular set of channels only one collective communication can be active. A process has only one C O C E , so the C O C E can only be used by one collective operation at any time. The MPI_Barrier and myr_getcid_bin are by themselves synchronous operations. For these two operations, when a host process returns, its C O C E and all used C M R S s are freed. This is possible because of their synchronous characteristics. For all nodes, when the host process returns from the operation, processes at other nodes either have returned or will return very shortly from the operation.  42  More specifically: when a call on a node is returned, the node must have already been in phase 2, so the NIC of this node must have already received the message from its parent node, and have sent this message out to all its child nodes. Since Myrinet's point-to-point transfer is assumed to be reliable, and for the above two operations, the message receiving and forwarding are done in one single step by the MCP, the nodes in the sub-tree of this node will return from the operation shortly. The synchronous characteristics of the operation ensure that, even when using the same collective microchannel, the message from a new operation will not overtake the message from an old operation. The situation with MPI_Bcast is not as simple.  In systems like L A M ,  MPIJBcast is implemented as an asynchronous operation. The host call of MPLBcast at the root returns once the message is sent to all its children. An intermediate node returns once the message is received and forwarded to its children. In the context of MPI-NP II+, implementing MPLBcast in this way may cause problems when two consecutive MPLBcasts involve the same channels. The reasons are as follows: • If two consecutive MPI_Bcasts use the same channel, it is possible that when a message is sent by the second MPLBcast from the source node, the CMRS at the destination node is still occupied by a message of the first MPI_Bcast. Therefore, the second MPI_Bcast may overtake the first MPIJ3cast. • In MPI-NP 11+, the transfer of a large message has two phases, first the envelope is delivered, and the data is only sent after the envelope is matched on the receiver side. As a result, at an intermediate node of a forwarding tree, large messages cannot be received and sent in one step. If messages by the first MPLBcast are large messages, when a message from the second MPLBcast arrives at the destination node, the destination node has no way to empty the 43  CMRS because it is blocked waiting to deliver the data from the first message. We could implement MPLBcast as a completely synchronous operation as MPLBarrier and myr_getcid_bin, but this increases its latency. Therefore, for an asynchronous MPLBcast, we must take some measures to prevent one broadcast from overtaking a later broadcast. Solution  The binary tree structure is still used to forward messages for the NIC-based MPLBcast. The host interface is similar to MPLBarrier, and an RPI function myr_bcast_bin() is added. The MPINP_Coll_CommStart () and MPINP_Coll_CommCheck() are called  by this function to start the operation on the NIC and wait for it to complete. myr_bcast_bin() builds the binary tree using the same method as myr_barrier_bin().  One small difference is that the rank of the root node is pre-assigned. Therefore to determine the positions of nodes in the tree, relative rank is used. The relative rank is the modulo n of the rank of the process subtracted by the rank of the root node, where n is the size of the communicator. For all nodes, the NIC-level operation can still be divided into two phases: in phase 1, the node sends/forwards broadcast data to its child nodes; and in phase 2, the node collects acknowledgements from its children. More specifically, the process is as follows: • For the root node. In phase 1, the MCP transfers the broadcast message to all its children. When both the envelope and payload of the message have been sent to all the destinations, the MCP sets a status flag to make the host process return from myr_bcast_bin(), and the root shifts to phase 2. In phase 2, the MCP waits for the acknowledgements from its child nodes. When the 44  acknowledgements are collected from all of its children, the node frees the C O C E , and the operation at the root is finished. • For intermediate nodes. In phase 1, the M C P of an intermediate node first waits for the message from its parent, and then forwards the message to all its children. When the message has been sent to all of its children, the node sends an acknowledgement to its parent and set the status flag to make the host process return from myr_bcast_bin(), and the node enters phase 2. In phase 2, the M C P waits for acknowledgements from its children. When the acknowledgements are collected from all of its children. The node frees the C O C E , and the operation at the node is finished. • For leaf nodes. In phase 1, the M C P simply waits for the message from its parent.  When the whole message is received, the M C P sends an acknowl-  edgement to its parent and sets a status flag to make the host return from myr_bcast_bin().  Then the node frees C O C E and the operation is finished  at this node. Leaf nodes have only one phase.  • When a node has received the broadcast data from its parent and has delivered it to the host buffer, the node clears the flag in the C M R S of this channel, and deallocates the collective microchannel. The  single transfer for a small or large message during the collective operation  is similar to that of point-to-point communication. The only difference is that collective operations use the special collective micro channels. The  time between the host returns from myr_bcast_bin() and when the  C O C E is freed is important. The host process returns from myr_bcast_bin() only when the node has sent the data to all of its children. The C O C E is freed when the  45  node has received the acknowledgements from all of its children at the end of phase 2. When a C O C E is freed, the C M R S s at all of the children are reusable. The release of the C O C E occurs later than the host's return from myr_bcast_bin The early return from myr_bcast_bin() can reduce the overhead of M P L B c a s t , especially when a subsequent operation is a point-to-point operation. This is shown later in Section 4.4.  The late release of the C O C E can prevent the subsequent  collective operation from being started on the N I C level. The overtaking problem is avoided. The late release of the C O C E is the loose synchronization aspect of MPLBcast.  3.3.6  Concurrent Operations and Deadlock  Deadlock is an important problem that must be addressed by a number of systems (e.g, [1]) in implementing NIC-level collective communication. Deadlock can occur when there are multiple collective operations executing concurrently on the NIC and there exists collision in using the receive buffer. In M P I - N P 11+, concurrent collective operations can occur on the N I C since multiple processes are supported on each N I C and each process can belong to multiple communicators at the same time. M P I - N P 11+ avoids deadlock for the following reasons: • The channel architecture. The design of M P I - N P II channel structure fixed the division of buffer space. As a result, messages in different channels have no affect on each other. Messages from point-to-point operations and collective operations always use separate micro channels, so they cannot affect each other. A channel is uniquely specified by a pair of processes, thus messages passed between different process pairs also do not affect each other. 46  • Serialization of collective operations. From Section 3.3.3, we can see that in function MPINP_Coll_CommStart, by serializing the access to the COCE structure, collective operations are serialized on the NIC-level, and each process can only execute one collective operation at any given time. • Avoiding overtaking of operations. From Section 3.3.5, we can see that after serializing the collective operations, The problem of one message overtaking another is avoided by imposing the loose synchronization to MPLBcast. To better understand how the above design avoids deadlock, consider the situation when two MPLBcast operations (namely, B l and B2) are concurrently executing from two communicators, and there are processes belonging to both groups. It follows: two messages passing through different branches of the spanning trees of B l and B2 have no collision, because they are using different channels. Two messages passing through the same branch of the two spanning trees also have no collision with each other, because the second message is sent from the source node (process) only after the first MPLBcast has completed on the node, and at the time, the receive buffer (the CMRS) at the destination node must be available. Therefore, no collision exists with using the receive buffer. Both B l and B2 can progress independently and no deadlock will occur.  3.4 3.4.1  Multiple Local Processes Process Synchronization  The communication in MPI-NP 11+ is user-mode communication [2] [6]. Each user process maps the NIC memory to the process's private memory space and accesses the NIC area directly in the Host Interface Library. Therefore when multiple pro-  47  cesses are to be supported on a local machine, the NIC memory is actually shared by at least three separate running threads including the MCP. Thus some kind of protection is necessary to mediate access to the shared data structures. Semaphores and spin-locks can be used for synchronization among several host processes. However, semaphores are not used since they require a kernel operation, which should be avoided in the critical communication path for user-mode communication. As a result, MPI-NP 11+ uses spin-locks. The situation on the NIC is more complicated since it is not possible to use a spin-lock. In MPI-NP II, when there is only one host process and the MCP, the following synchronization method using a shared variable can be used between the host process and the MCP. The method uses four P()/V() functions as shown in Figure 3.4. HOST_P()  NIC_P() r  {  lock_nic = 1;  lock_host = 1;  lock_used = 1;  lock_used = 0; while (host_nic && !lock_used)  .  w h i l e  (lockjiost && lock_used)  }  HOSTV0 { lockjiost = 0;  M C  - 0 V  {  lock_nic = 0;  }  >  Figure 3.4: P / V Lock Method using Shared Variables There is a P() and V() functions for both the host process and the MCP. The functions operate on three shared variables: lockjiost, lock_used and lock_nic. This locking mechanism can only be used by two threads. Therefore, in MPI-NP 48  II+, to support multiple host processes, the spin-lock and the P / V lock method have been combined in order to do the synchronization. Another problem to consider is the granularity of a parallel operation. In MPI-NP II+, different processes may only access one specific part of the shared memory, which is a result of the channel and microchannel design. Instead of locking the entire structure, it would be better to individually lock only those memory locations that need to be synchronized. This increases the amount of concurrency and minimizes unnecessary synchronizations. The locking strategy is described in the following: • Process attached data structure: on the NIC, each MPI process has its own data structure, and it is accessed only by that particular process. Therefore when this process accesses data, other host processes should not be prevented from accessing their own regions. MPI-NP 11+ uses a P / V lock array called proc_locks []. Each entry proc_lock[i] has three variables: proc_lock[z] . lockjiost, proc_lock [?] . lock_nic and proc_lock [i] .lock_used. The proc_lockH is used for the synchronization between process i and the MCP through the P ( ) / V ( ) operations. • Channel attached data structure: each channel is used by only one MPI process on the machine and again channel accesses should be able to be concurrent. MPI-NP 11+ uses a P / V lock structure array called chan_locks [] for this purpose. The chan_locks [i] is the lock of channel i, and it is used to synchronize the access to channel i between the M C P and a host process. • Globally shared data structures: at any time, these data structures may be accessed by any process or the MCP. The control message queue is an example  49  of such a structure. The spin-lock array glob_spin_locks [] and P / V lock array glob_pv_locks [] are used for these structures. The glob_spin_locks U] and glob_pv_locks [i] stand for ith structure. Whenever a process wants to access global structure i, it first uses spin lock glob_spin_locks [«] to prevent other processes from entering the critical region, then it uses P() operation on glob_pv_lock[i] to synchronize with the MCP. • In some special situations, the access of a global data structure is inherently synchronized between the host process and the M C P because of the control logic between the Host interface library and the MCP. For these data structures, only a spin lock is needed to synchronize different host processes. 3.4.2  Loopback Communication  MPI-NP 11+ calls the communication between two local processes loopback communication because the implementation is similar to TCP/IP's loopback interface. As in T C P / I P loopback, the message from the source process goes down the protocol stack to the link level and then backs up the protocol stack to the destination process. The communication between the two local processes emulates the communication between two remote processes. The host interface library is identical as communication between remote processes. The difference occurs at the NIC level. A channel is still used and it is set up in the initialization phase. Except in this case both sides of the channel are located on the same NIC. Every channel structure has an attribute remote indicating the id of the machine at the other side of the channel. At the NIC, Whenever a message envelope is to be sent through a channel, remote is compared with the id of the local machine. If they are same, then the  50  destination process is a local process, and local send and receive operations are called. For small messages, the data is directly copied together with the envelope from the send slot in the sender side channel structure to the receive slot in the receiver side channel structure. For large messages, the host DMA send/receive operations are needed and a buffer on the NIC is used as a staging area. Unlike the case of a remote send, the host-NIC DMA engine now has to perform both the download and upload. If the size of the message is larger than the NIC stage, multiple download and upload DMAs are required.  51  Chapter 4  Evaluation MPI-NP 11+ was tested on an eight node cluster of PC's connected by a Myrinet network. Four of these nodes are 266MHz Pentium lis with 256MB R A M , and the remaining four nodes are 552MHz Pentium Ills with 512MB R A M . All of the Myrinet network interface cards have an LANAI 4.x processor. The operating systems for all nodes are Linux kernel 2.4.x (some 2.4.2 and some 2.4.9). Time was measured using the RDTSC [9] instruction, which is defined in /usr/include/asm/msr .h. MPI-NP 11+ added several new functions. Each of these functions were tested in isolation, using appropriate techniques which show the benefit of that particular function. In addition, one important test was to measure the impact of the changes on the original performance of MPI-NP II. We measured the Bandwidth and End-to-End latency of unicast, and compared it to the original result. Also we measured the overhead of Any_Source message matching to evaluate the computation overload on the network processor and to evaluate whether or not it would scale with the number of nodes. Third, we tested the two synchronous collective operations, MPLBarrier() and MPI_Comm_Create(). We measured the average host overhead, including all the 52  nodes in the operation, and compared the NIC-based implementation with the hostbased implementation. Note that in the host-based implementation, the messages are still sent over the Myrinet. The difference is that, for host-based implementation, the collective message forwarding is started by the host level. In order to give a reasonable comparison, for MPLBarrier, MPLComm_Create, and MPI_Bcast, the host-based implementations all use the same binary tree structure as their NICbased implementations to forward messages. The tree is different from the original L A M code, which uses a binomial tree forwarding structure. Finally, the latency and computation node overhead of MPI_Bcast were measured at the root node, and were also compared with the host-based implementation. For MPI_Bcast, the host-based control process is different from the NIC-based implementation because our NIC-based implementation uses loose synchronization.  4.1  Bandwidth and End-to-End Latency of Unicast  Table 4.1 shows the end-to-end latency and bandwidth of unicast. The result is obtained by using a ping-pong program identical to that used to evaluate MPINP II. We tested various message sizes and the reported time in microseconds is the median of 256 measurements. The minimum latency for MPI-NP 11+ is 42 usee, and the maximum bandwidth is 89Mbytes/sec. The same measurements for MPI-NP II are 22 usee and 94Mbytes/sec, respectively. The result shows that our addition of new functions does not unduly increase the overhead of unicast message passing. The conclusion is more correct with larger messages, because with the larger message the time of delivering the data bytes of the message takes a larger portion of the total end-to-end latency than with the smaller message. This is also the reason that with a larger size, the end-to-end throughput is bigger. 53  Data (bytes) 0 4 16 64 128 256 512 516 640 768  Latency (us)  42 44 46 49 50 59 72 89 90 92  Throughput (bytes/sec) 0 89888 344086 1306122 2560000 4302521 7111111 5765363 7111111 8347826  Data (bytes) lk 2k 4k 8k 16k 32k 64k 128k 256k 384k  Latency (us) 95 114 133 180 269 449 809 1529 2970 4414  Throughput (bytes/sec) 10722513 17964912 30681648 45511111 60907063 72898776 81008653 85695979 88263973 89073734  Table 4.1: End-to-End Performance of Unicast The larger latency by MPI-NP 11+ is caused by some extra operations added to the critical path for unicast. These extra operations include: channel locking by the sender host when allocating a free send slot, checking if the message destination is local or remote by the sender NIC, determining if there are Any .Source requests at the receiver NIC, channel locking by the receiver NIC when searching the unmatched message list, and determining if the message is matched by an Any .Source request. The channel locking at the receiver NIC when searching the unmatched message list can be removed from the critical path because currently only the M C P operates on the unmatched message list. However, the other operations are all necessary in our MPI-NP 11+ implementation. Table 4.1 shows that the throughput for 516 bytes is worse than that of 512. In our implementation, 512 is the maximum size of a small message. For a small message, the message is downloaded from the user buffer using an PIO (Programmed I/O) copy; for a large message, host-to-NIC DMA is used. A message size of 512 bytes is not optimal for deciding between using PIO and DMA. We can increase the  54  maximum size of small messages to a point modestly larger than 512 bytes. The throughput of the messages, which have a length between 512 bytes and this size, can be modestly improved.  4.2  Performance with Any-Source Matching  It is difficult to measure the performance with Any .Source matching because in general, it depends on the application program and the extent to which Any-Source is used. Any-Source was evaluated by considering a specific case and measuring the overhead of MPLRecv with Any_Source. In the test program, n processes are started, and the last n — 1 processes iteratively send zero byte messages with tag Recv_tag to process 0. Process 0 receives these messages using MPI_Recv with Any_Source. We measured the overhead of MPLRecv at process 0. In total, we iteratively executed 512 experiments and the median value is used as a final result. After each Any .Source matching on the receiver side, we synchronized the system to ensure that each matching was independent. Thus, every time process 0 posts an Any_Source receive request using MPLRecv, the matching message has actually already arrived at its side. Figure 4.1 shows the communication and synchronization between process 0 and process n — 1. Moreover, before the 512 (TIMES=512) iteration is executed, all other processes (from 1 to n — 2) send chan_width messages with tag Other_tag to process 0, and process n — 1 sends chan_width-2 messages to process 0. Variable chan_width is the number of receive slots available in every channel. As a result of sending these messages, all channels between process 0 and processes 1 to n — 2 are filled with messages, and the channel between process 0 and process n — 1 are left with only two receive slots. These two slots are used for holding the control message 55  and the zero byte data message in the test. process_0() { for (i=0; i < TIMES; i++) { // receive control message from process n-1 MPI_Recv(recv_buf, 0 , MPIJNT, n-1, Ctrl_tag, MPI_Comm_World, &status); rdtscl(start_time); // receive zero byte data from Any_Source MPI_Recv(recv_buf, 0, MPI_INT, MPI_ANY_SOURCE, Recv_tag, MPI_Comm_World, &status); rdtscl(end_time); // send back a control message to process n-1 MPI_Send(Send_buf, 0, MPIJNT, n-1, Ctrl_tag, MPI_Comm_World); time = (end_time - start_time) /266; records[i] = time; }  process_n-l() for (i=0; i< TIMES; i++) { // send zero byte data to process 0 MPI_Send(send_buf, 0, MPIJNT, 0, Recv_tag, MPLComm_World); // send control message to process 0 MPI_Send(send_buf, 0, MPIJNT, 0, Ctrl_tag, MPI_Comm_World); // receive control message from process 0 MPLRecv(recvj5uf, 0, MPIJNT, 0, Ctrl_tag, MPLCommJJVorld, &status);  Figure 4.1: Communication and Synchronization Between Process 0 and n — 1 The final test result is that, for any n = 2..8, and for chan_width=8 or 16, the overhead of MPLRecv using Any-Source is 44 usee. This shows that our Any .Source matching is reasonable in that the increase of message sources and the channel queue length does not increase the overhead of Any-Source matching. If the Any-Source matching was done by simply searching the messages on all the channels, the overhead is not acceptable. The overhead can be determined by another experiment: if process 0 receives the zero byte data message explicitly indi-  56  eating the source as n — 1, the overhead of MPLRecv is 38 usee for chan_width=8, and 46 usee for chan_width=16. The result implies that the time for checking 8 additional messages is 8 usee, a large overhead.  4.3  4.3.1  P e r f o r m a n c e of M P I J B a r r i e r and  MPI_Comm_Create  MPI_Barrier  The performance of our NIC-based barrier routine was evaluated by using a test program to perform 1000 consecutive MPLBarrier calls. We reported the average host overhead of each MPLbarrier at all nodes. The experiments were performed for both NIC-based and host-based implementation over Myrinet. Table 4.2 shows the overhead data for both NIC-based and host-based implementation with 2 to 8 nodes. Notice that since MPLBarrier is a synchronous operation, the host overhead is an indication of the time required to synchronize all the nodes of the cluster. On 8 nodes, the host overhead of a NIC-based barrier is 123 usee compared to 506 usee for the host-based barrier. Number of nodes 2 3 4 5 6 7 8  NIC-based (us) 31 45 50 78 90 114 123  Host-based (us) 89 147 209 327 366 426 506  Table 4.2: Average Host Overhead of MPLBarrier Our results show that the NIC-based implementation is uniformly better and, for any number of nodes, the overhead of the host-based barrier is approxi57  mately 3 to 4 times worse than that of the NIC-based barrier. Figure 4.2 shows the improvement factor for the NIC-based barrier over host-based implementation with various numbers of nodes. The improvement factor increases in general with the depth of the tree, showing that the NIC-based barrier scales better than the host-based barrier. In addition, with our 8 nodes, the MPI level barrier shows an improvement of 4.1, which is far better than the same MPI level result obtained by [5], which was 2.2. 5  I  1  1  1  1  1  1  2.5 -  2  I 2  1  I  3  4  l 5 Number of Nodes  l  l  6  7  8  Figure 4.2: Improvement Factor of MPLBarrier with Different Number of Nodes  4.3.2  MPI_Comm_Create  The method used to evaluate the performance of MPLComm_Create was identical to that of MPLBarrier. Table 4.3 shows the overhead data of MPLComm.Create for both NIC-based and host-based implementation for 2 to 8 nodes. Figure 4.3 shows the improvement factors. Similarly, NIC-based MPLComm.Create performs better 58  than the host-based implementation. For 8 nodes, the overhead of the NIC-based implementation is 404 usee, whereas the overhead of the host-based implementation is 817 usee. Also the improvement factor increases with the depth of the binary tree and therefore the NIC-based scheme scales better than the host-based scheme. In comparison with Table 4.2, for the same number of nodes, MPI_Comm_Create has a larger overhead than MPLBarrier. The main reason for this overhead is because in addition to the same synchronization process as MPLBarrier, MPI_Comm_Create has to perform more local computations such as dynamic memory allocation for the communicator structure. Number of nodes 2 3 4 5 6 7 8  NIC-based (us) 244 256 263 322 359 384 404  Host-based (us) 344 396 468 604 670 724 817  Table 4.3: Average Host Overhead of MPLComm.Create  4.4 4.4.1  Performance with MPI_Bcast() Broadcast Latency  The latency for a tree broadcast is the time required for a message originated at the root to be propagated to every leaf in the broadcast tree. We measured the time it takes from the start of the broadcast to the time until node n — 1 received the message. For this benchmark, in every iteration of our test routine, node n — 1 would send back a zero byte message to the root node after receiving the message. 59  3  2.5 h o o  2h o  Q. E 1.5 h  2  3  5  4  6  7  8  Number of Nodes  Figure 4.3: Improvement Factor of MPLComm_Create with Different Number of Nodes The root node measures the duration from starting MPLBcast to the time when it receives the zero byte acknowledgement from node n — 1. The broadcast latency is obtained by subtracting the minimum unicast latency from this measured time value. Note that node n-1 is not necessarily the node that receives the message last, and it is difficult to determine which leaf actually is the last one to receive the message. However it is reasonable to measure the broadcast latency in this way because this is a comparison-based evaluation. Table 4.4 shows the result of a 8 node system comparing the NIC-based implementation with the host-based implementation. The times reported in Table 4.4 are the median of 512 repeated measurements. The results are given for various sized messages. Table 4.4 shows that for all the message sizes, the NIC-based implementation always performs better than the host-based implementation. For all the small messages (less than 512 bytes), the NIC-based implementation is 2 times 60  better than the host-based one. For large messages, the performance gain by the NIC-based implementation decreases gradually, however, for message sizes less than 16k, there is still a obvious benefit to use the NIC-based implementation. These results are explained by considering the characteristics of our implementation: • For small messages, more PIO data copying is executed by the host-based method. • For large messages, there are more envelopes copied and more pinning operations executed by the host-based method; however, the same number of D M A s are done to copy the message to the N I C . For large messages it is this D M A time that dominates the overall time of the operation.  Message Size (bytes)  NIC-based (us)  Host-based  4 32 128 256 512 lk 2k  50 51 64 80 92 164 219  140 143 158 177 217 195 347  («0  Message Size (bytes)  NIC-based (us)  Host-based (us)  4k 8k 16k 32k 64k 128k 256k  292 423 761 1440 2808 5502 10927  407 545 864 1446 2812 5597 11216  Table 4.4: Broadcast Latency of M P L B c a s t with 8 Nodes Figure 4.4 shows the Improvement Factor of the broadcast latency of NICbased MPI_Bcast over host-based M P L B c a s t with various system sizes (2-8 nodes). Two sizes of small messages and two sizes of large messages were tested. As Figure 4.4 shows, for all four message sizes, the improvement factor increases with the size of the system. As a result, we conclude that NIC-based M P L B c a s t scales better than host-based MPI_Bcast.  61  3  64 512 2k 8k  bytes bytes —X— bytes — bytes E J -  2.5  o o  (0  •B-  2  3  4  5  B'  6  •a.  •4 7  8  Number of Nodes  Figure 4.4: Improvement Factor of MPLBcast with Different Number of Nodes 4.4.2  Host Overhead  The host overhead of MPLBcast on a node is the duration from starting MPLBcast to the return from this call. In Chapter 3, we mentioned that the MPLBcast is serial on the NIC level. As a result, the operation on the host has to wait until the former MPLBcast is completed on the NIC. This design decision has an affect on the host overhead of MPLBcast. To evaluate the affect, in this section we measured the overhead of MPLBcast at the root node in two situations: • In the first case, several MPI_Bcasts are called consecutively. • In the second case, MPLBcast is also executed repeatedly, but a unicast operation is inserted between the two calls. Table 4.5 shows the result of the experiment comparing the NIC-based im62  plementation with the host-based one. For this experiment, the system size is 7 nodes. We choose 7 because there is no difference in the root's MPLBcast overhead between 7 or more nodes, since 7 nodes is sufficient for a full binary tree of depth 3, and a larger tree depth has no effect on the overhead of the root node. The data reported in Table 4.5 is the median of 512 measurements. Message Size (bytes) 8 32 128 512 lk 2k 4k 8k 16k  NIC-based (case 1) (us) 60 61 68 97 150 186 285 465 797  NIC-based (case 2) («s) 18 18 24 45 86 103 155 250 454  Host-based (us) 19 20 26 46 178 195 251 343 520  Table 4.5: Host Overhead at Root Node of MPLBcast With a 7 Nodes System In case one, Table 4.5 shows that for most message sizes there is more overhead with the NIC-based implementation in comparison to the host-based one. However, in the second case, Table 4.5 shows that for most message sizes, the NIC-based implementation performs better than the host-based one. This result shows that although the NIC-based implementation adds an additional synchronization to MPI_Bcast, it can still result in less overhead when there are not consecutive collective operations. The case of non-consecutive collective operations is more frequent because one could, if necessary, package the data into a single operation. Thus, for the more interesting case, our implementation results in lower host overhead. In addition, Table 4.5 shows that with message sizes of lk and 2k, the over63  head of NIC-based implementation is comparable to the host-based implementation. For the two message sizes, the overhead of case 2 is only about half that of hostbased, and even the overhead of case 1 is less than that of host-based implementation. This result occurs because for large messages, the host-based implementation has more memory-pinning operations than the NIC-based implementation, and with a smaller large message, the overhead of memory-pinning dominates the overheads.  4.5  Performance of Loopback Communication  We made an attempt to measure the performance of communication between two local MPI processes. However, since there is not an SMP machine available, we can not make accurate measurements of the end-to-end communication latency and the bandwidth between two local processes. On a single processor machine, we ran the same ping-pong program used to test the latency and bandwidth of unicast in Section 4.1. We obtained a latency of 110005 usee for sending 256k bytes. This result is far worse than that of using TCP/IP, which is 21638 usee. However, when nice(40) was used to decrease the schedule priority of the sender and receiver processes, we obtained a latency of 19992 usee for 256k bytes. Clearly the difference between the two results corresponds to the scheduling of processes by the operating system. For one process to communicate with the second, there has to be a context switch. The context switch may not occur immediately for a non-blocking send operation. As well, both the sender and receiver have to poll memory waiting for the completion of its operation. Thus when the receiver and sender run on the same processor, the one side has to wait a long time to be scheduled and respond to the arriving of the message from other side, so the 64  message latency is high. On a single-processor machine, scheduling will be a problem for our loopback communication. However, although we were not able to evaluate it, to support the communication between local processes, our loopback implementation may be a reasonable solution for a SMP machine where the scheduling problem can be avoided.  65  Chapter 5  Conclusions and Future Work 5.1  Conclusions  In this thesis, we described the design, implementation, and performance of MPI-NP II+, an enhanced version of MPI-NP II. MPI-NP 11+ added new functionality to MPI-NP II while maintaining the channel-based communication architecture of the original design. It further extends the idea of a MPI-aware M C P on the network interface card. There were several essential MPI functionalities that were not supported in MPI-NP. These included support for Any_Source messages, communicators other than M P L C O M M . W O R L D , and multiple local processes in the same compute node. The performance results indicate that these functionalities can be added without unduly affecting the latency and end-to-end bandwidth of the system. The successful integration of these operations further illustrates the benefits of supporting MPI directly on the network interface card. MPI-NP 11+ also extended support on the NIC to three collection communication operations; MPI_Bcast(), MPI_Barrier(), and MPI_Comm_Create(). The im-  66  plementation clearly illustrated the advantages of performing these synchronization primitives on the N I C rather than the host. It simplified the interactions between the host and N I C and made it possible to exploit the relatively tightly-coupled collection of network interface cards. In all cases the NIC-based versions of the collective communication routines performed better than the host-based versions. Operators MPI_Bcast(), MPI_Barrier(), and MPI_Comm_Create() were supported by extending the concept of a microchannel by introducing a special microchannel for collective communication messages. In the case of MPI_Bcast only minor restrictions had to be imposed, while preserving the simplicity of the original design. The current design is also flexible in the sense that it can support any tree forwarding topology.  5.2  Future Work  Although M P I - N P 11+ added several of the most essential features of M P I there remain several issues. Zero-copy messaging, i.e. D M A i n g messages directly from the N I C to the user buffer requires that the user buffer be pinned to a physical memory location. In M P I - N P 11+, this is achieved by calling a pin and release kernel primitive when a message is sent or received. The added system call needed to pin pages is a source of extra overhead and increases the latency for large messages. There are a number of possible approaches to this problem. Some systems have avoided the kernel calls by only pinning and not releasing memory.  This  approach works as long as the number of user communication buffers do not consume too large a portion of the main memory. However, there are programs such as 67  numerical codes operating on large matrices which can easily result in too much of memory being pinned. An obvious solution to this problem was explored by [18] who proposed a pin-down cache to reduce the number of pin/release operations. This technique postpones the release of physical pages and can avoid re-pinning of user communication buffers while also ensuring that only a fixed proportion of memory is pinned. The challenge for any implementation of a pin-down cache is to minimize the cost of keeping it consistent. If the cache is maintained on the NIC, then there is the overhead of additional NIC/host synchronizations, which can be costly. To my knowledge, MPI-NP 11+ is the only Myrinet-based MPI implementation that considered the underlying synchronization overhead of group communication. However, our current implementation only supports the creation of intra-group communicators and not the creation of inter-group communicators.  Inter-group  communication is more restricted in MPI, for example collective communication such as anycast is not defined, and requires the creation of two communicators for send and receive in each direction. Supporting inter-group communication would require further extending the channel structure on the NIC to allow message matching with respect to the two communicators. One problem not addressed in this thesis is the structure of the optimal forwarding tree to be used for collective communication. The design does support the use of any pre-defined tree topology. But in our case, given the small size of the cluster, the topology of the tree did not have a major impact on performance. Any small depth tree topology would achieve similar performance. However, the choice of topology is itself a research topic and is important for larger or more irregular topologies. There are other opportunities for exploiting the network processor for col-  68  lective communication. Two operations that could benefit are MPI_Gather() and MPIJScatter ().. There is no computation required for either of these operations, but unlike broadcast, the data sizes vary as one is gathering or scattering the data. The choice of topology and how to aggregate or segregate messages is an interesting issue to explore. For small messages, it would be easy to implement these operators within the current channel structures, however, for larger messages managing the available buffer space on the N I C would be a problem. Operations such as MPI_Combine () which involve both communication and message combining based on a predefined or user-defined operator could also be supported. However, since there is computation involved it will depend to larger degree on the speed of the network processor and has the potential to slow-down other communications. User-defined combining operations are difficult to support on the N I C since it would require static or dynamic loading of code for the network processor and runs the additional risk of introducing programming errors on the N I C . Finally, it would be interesting to study the performance of M P I - N P 11+ using a collection of benchmark programs to test the overall performance of the operations and their potential interaction with each other. Our current performance evaluation only tested the functions independently from each other. Research in this direction could better determine the characteristics of applications that could benefit most from M P I - N P 11+.  69  Bibliography [1] R. Bhoedjang, T. Riihl, and H. Bal. Efficient multicast on myrinet using linklevel flow control. In Proceedings of the 1998 International  Conference on Par-  allel Processing (ICPP '98), pages 381-391, Washington - Brussels - Tokyo, August 1998. IEEE USA. [2] Raoul A. F. Bhoedjang, Tim Riihl, and Henri E. Bal. User-level network interface protocols. Computer, 31(11):53—60, 1998. [3] D. Buntinas, D. K . Panda, and P. Sadayappan. Fast NIC-level Barrier over Myrinet/GM. In Proceedings of Int'I Parallel and Distributed Processing Symposium (IPDPS), 2001.  [4] Darius Buntinas, Dhabaleswar K. Panda, Jose Duato, and P. Sadayappan. Broadcast/multicast over myrinet using NIC-assisted multidestination messages. In Communication,  Architecture,  and Applications for  Network-Based  Parallel Computing, pages 115-129, 2000. [5] Darius Buntinas, Dhabaleswar K. Panda, and P. Sadayappan. Performance benefits of NIC-Based barrier on Myrinet/GM. In Proceedings of the 15th International  Parallel & Distributed Processing Symposium (IPDPS-01), pages  166-167, Los Alamitos, CA, April 23-27 2001. IEEE Computer Society. [6] C. Dubnicki, A. Bilas, Y . Chen, S. Damianakis and K . Li. VMMC-2: Efficient Support for Reliable, Connection-Oriented Communication. In Proceedings of Hot Interconnects, August 1997. [7] Cosimo Anglano Claudio. Network interface multicast protocols for wormholebased networks of workstations. Technical report, DISTA, Universita' del Piemonte Orientale, February 2000. [8] Cisco Systems Corporation. Introduction to Gigabit Ethernet. http://www.cisco.com/warp/public/cc/techno/media/lan/gig/tech/gigbt_tc.htm 1997. 70  [9] Intel Corporation. Using the RDTSC Instruction for performance monitoring. http://developer.intel.com/drg/pentiumII/appnotes/RDTSCPMl.HTM, 1997. [10] D. W. Walker.  An Introduction to Message Passing Paradigms. In C. E.  Vandoni, editor, Proceedings of the 1995 CERN School of Computing,  CERN  95-05, pages 165-184, Aries, France, August 1995. [11] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, http://www.mpi-forum.org/docs/mpi-ll-html/mpi-report.html, June 1995. [12] Ian T. Foster and K. Mani Chandy. FORTRAN M : A language for modular parallel programming. Journal of Parallel and Distributed Computing, 26(l):24-35,  1995. [13] Frederique Chaussumier, Frederic Desprez, Loic Prylli. Asynchronous Communication in MPI - The BIP/Myrinet Approach. In Jack Dongarra and Emilio Luque and Tomas Margalef, editor, Recent advances in parallel virtual machine and message passing interface: 6th European PVM/MPI Users' Group Meeting, Barcelona, Spain, September 26-29, 1999: proceedings, volume 1697 of  Lecture Notes in Computer Science, pages 485-492, Berlin, Germany / Heidelberg, Germany / London, U K / etc., 1999. Springer-Verlag. [14] G. Burns, R. Daoud and J. Vaigl. L A M : An Open Cluster Environment for MPI. In Supercomputing Symposium '94, Toronto, Canada, June 1994. [15] Mario Gerla. Multicasting in myrinet — a high-speed, wormhole-routing network. In IEEE Global Communications  Conference (GLOBECOM'96),  London,  UK, Nov 1996. [16] Mario Gerla, Prasasth Palnati, and Simon Walton. Multicasting protocols for high-speed, wormhole-routing local area networks. In SIGCOMM, pages 184193, 1996. [17] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. Highperformance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22(6):789-828, 1996. [18] H. Tezuka, F. O'Carroll and A. Hori. Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication. In 12th International Parallel Processing Symposium and 9th Symposium on Parallel and Distributed  Processing, Orlando, USA, March 1998. 71  [19] C . Keppitiyagama. A Network Processors based Message Manager for MPI. Master's thesis, University of British Columbia, Vancouver, Canada, 2000. [20] Chamath Keppitiyagama and Alan Wagner. on myrinet.  Asynchronous M P I messaging  In Proceedings of the 15th International Parallel & Distributed  Processing Symposium (IPDPS-01),  pages 50-50, Los Alamitos, C A , April 23-  27 2001. I E E E Computer Society. [21] R. Kesavan and D. K . Panda. Optimal multicast with packetization and network interface support. In Proceedings of The 1997 International Conference on Parallel Processing, pages 370-377, Bloomingdale, IL, August 1997. [22] L A M Team. Porting the L A M 6.3 communication layer. Technical Report T R 00-01, University of Notre Dame, August 1999. [23] M . Lauria and A . A . Chien. M P I - F M : High Performance M P I on Workstation Clusters. Journal of Parallel and Distributed Computing, 40(1):4 - 18, January 1997. [24] N . J . Boden, D. Cohen, R . E . Felderman, A . E . Kulawik, C . L . Seitz, J . N . Seizovic and W . Su. Myrinet - A Gigabit-per-Second Local-Area Network. IEEE  Micro,  15(1):29 - 36, February 1995. [25] Peter Pacheco. Parallel Programming with MPI. Morgan Kaufmann, San Francisco, C A , 1997.  http://www.usfca.edu/mpi  (source programs available) and  http://www.mkp.com. [26] Dhabaleswar K . Panda. Issues in designing efficient and practical algorithms for collective communication on wormhole-routed systems. In Proceeding of the ICPP  Workshop on Challenges for Parallel Processing, pages 8-15,  [27] Loic Prylli, Bernard Tourancheau, and Roland Westrelin.  1995.  The design for a  high-performance M P I implementation on the myrinet network. In  PVM/MPI,  pages 223-230, 1999. [28] R. Bhoedjang, T . Ruhl and H . E . Bal. Design Issues for User-Level Network Interface Protocols. IEEE  Computer, 31(11):53 - 60, November 1998.  [29] S. Pakin, V . Karamcheti and A . A . Chien.  Fast Messages (FM): Efficient,  Portable Communication for Workstation Clusters and Massively-Parallel Processors. IEEE  Concurrency, 5(2):60 - 73, June 1997.  72  [30] Scott Pakin, Mario Lauria, Andrew Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet. In Proc. of the Supercomputing'95, New Orleans, December 1995. [31] R. Sivaram, R. Kesavan, D. Panda, and C. Stunkel. Where to provide support for efficient multicasting in irregular networks: Network interface or switch? In Proceedings of the 27th Int'l Conf. on Parallel Processing (ICPP '98), pages  452-459, August 1998. [32] Craig B. Stunkel, Rajeev Sivaram, and Dhabaleswar K . Panda. Implementing multidestination worms in switch-based parallel systems: Architectural alternatives and their impact. In ISCA, pages 50-61, 1997. [33] Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa. PM2: A high performance communication middleware for heterogeneous network environments. In Supercomputing, 2000. [34] K . Verstoep, K . Langendoen, and H. Bal. Efficient reliable multicast on MYRINET. In Proceedings of the 25th International  Conference on Parallel  Processing, volume III, Algorithms and Applications, pages 111:156-165, Boca Raton, FL, August 1996. CRC Press. Vrije University, The Netherlands. [35] A. Wijeyeratnam. An MPI Messaging Layer for Network Processors. Master's thesis, University of British Columbia, Vancouver, Canada, 1999.  73  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051384/manifest

Comment

Related Items