UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Towards MPI progression layer elimination with TCP and SCTP Penoff, Bradley Thomas 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2006-0093.pdf [ 2.55MB ]
JSON: 831-1.0051173.json
JSON-LD: 831-1.0051173-ld.json
RDF/XML (Pretty): 831-1.0051173-rdf.xml
RDF/JSON: 831-1.0051173-rdf.json
Turtle: 831-1.0051173-turtle.txt
N-Triples: 831-1.0051173-rdf-ntriples.txt
Original Record: 831-1.0051173-source.json
Full Text

Full Text

Towards MPI Progression Layer Elimination with TCP and SCTP by Bradley Thomas Penoff A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M a s t e r o f S c i e n c e in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Computer Science) The University of British Columbia January 2006 © Bradley Thomas Penoff, 2006 Abst rac t M P I middleware glues together the components necessary for execution. A l -most all implementations have a communication component also called a message progression layer that progresses outstanding messages and maintains their state. The goal of this work is to thin or eliminate this communication component by push-ing the functionality down onto the standard IP stack in order to take advantage of potential advances in commodity networking. We introduce a TCP-based design that successfully eliminates the communication component. We discuss how this eliminated TCP-based design doesn't scale and show a more scalable design based on the Stream Control Transmission Protocol ( S C T P ) that has a thinned communi-cation component. We compare the designs showing why S C T P one-to-many sockets in their current form can only thin and not completely eliminate the communication component. We show what additional features would be required of S C T P to enable a practical design with a fully eliminated communication component. i i Contents Abstract ii Contents iii List of Tables v List of Figures vi 1 Introduction 1 2 MPI Middleware 9 2.1 Middleware 9 2.2 M P I Runt ime Execution Environment 9 2.3 M P I Library 11 2.4 Message Progression Layer 12 2.4.1 Message Matching 14 2.4.2 Expected/Unexpected Messages 14 2.4.3 Shor t /Long Messages 15 3 TCP Socket-based Implementations 17 3.1 The L A M - T C P Message Progression Layer Implementation 18 3.1.1 Message Matching 19 3.1.2 Expected/Unexpected Messages 20 i i i 3.1.3 Shor t /Long Messages 21 3.2 Socket per Message Stream M P I Implementation 22 3.2.1 Message Matching 24 3.2.2 Expected/Unexpected Messages 25 3.2.3 Shor t /Long Messages 27 3.3 Crit ique of the Socket per Stream Implementation 28 3.3.1 Number of Sockets 28 3.3.2 Number of System Calls 29 3.3.3 Select System C a l l 30 3.3.4 Flow Control 30 4 SCTP-based MPI Middleware 32 4.1 S C T P Overview 32 4.2 SCTP-based Implementation of the Message Progression Layer . . . 36 4.2.1 Message Matching 37 4.2.2 Expected/Unexpected Messages 38 4.2.3 Shor t /Long Messages 39 5 Related Work 41 6 Conclusions 46 Bibliography 48 iv List o f Tables 6.1 Summary of presented M P I implementations List of Figures 1.1 Communication component of M P I middleware 5 1.2 IP based implementation of the middleware 5 2.1 Overview of an executing M P I program 10 2.2 Split t ing the communication middleware between the library and the transport protocol 11 3.1 M P I and L A M envelope format 19 3.2 Comparison of M P I send/recv and socket library send/recv 22 3.3 Motivat ion for expected queues 26 4.1 S C T P association versus T C P connection 33 4.2 One-to-many sockets contain associations which contain streams . . 35 4.3 Mapping of S C T P streams to M P I tag, rank and context 36 v i Chapter 1 Introduction Performance has always played a vi ta l role in the analysis of computer applica-tions. Advances in processor speeds have resulted in a corresponding increase in application performance. In order to gain further performance improvements, com-puter scientists have researched other optimization techniques than increasing the processor's clock speed. One area of research has involved the use of parallelism. Initially, systems were developed that could themselves automatically exploit par-allelism through, for example, the concurrent use of multiple functional units. In addition to systems-based parallel approaches, it has become apparent that further performance improvements could occur through user-level parallelism. A n example of this is the use of parallel programming models. Such models let applications be designed so that they can exploit multiple resources at once in a distributed manner. One parallel programming model is to have independently executing pro-cesses, communicating by explicitly passing messages between them. W i t h i n the message passing programming model, the Message Passing Interface (MPI) is one A P I defined to implement parallel applications. It has been in existence for over a decade now. In that time, M P I has become a de-facto standard for message-passing programming in scientific and high-performance computing, which has resulted in the creation of a large body of parallel code that can execute unchanged on a variety 1 of systems. Middleware is an essential part of any M P I system. M P I was designed from the onset to be independent from the execution environment. Its communication primitives were designed general enough so that implementations could allow for al l possible interconnect technologies. Since no assumptions can be made about the execution environment nor the properties of the interconnect, M P I has needed mid-dleware for mediating amongst the operating system and the interconnect hardware. Overall, the middleware is responsible for coordinating between the application and the underlying components; components exist for starting the application execu-tion, for runtime support wi th managing processes, and for communicating within the application across different interconnect technologies. The focus of the work in this thesis is primarily on the communication middleware component of the M P I middleware. The public domain versions of M P I ( M P I C H [54], L A M [9], and Open-M P I [15]), implement communication components for several interconnects. For example, each have communication component implementations that work on shared memory for S M P s . Also, components exist for specialized interconnects common in dedicated clusters; examples of such interconnects include technologies like Infini-band and Myrinet . In addition to specialized interconnects, the public domain versions have always supported the IP-based transport protocol T C P in their available commu-nication components. T C P / I P is used to allow M P I to operate in local area net-works. More recently the use of M P I with T C P has been extended to computing grids [29], wide area networks, the Internet and meta-computing environments that link together diverse, geographically distributed, computing resources. The main advantage to using an IP-based protocol (i.e., T C P / U D P ) for M P I is portability and ease with which it can be used to execute M P I programs in diverse network environments. 2 One well-known problem wi th using T C P , or U D P , for M P I is the large laten-cies and difficulty in exploiting al l of the available bandwidth. Al though applications sensitive to latency suffer when run over T C P or U D P , there are latency tolerant programs such as those that are embarrassingly parallel, or almost so, that can use an IP-based transport protocol to execute in environments like the Internet. In ad-dition, the dynamics of T C P are an active area of research where there is interest in better models [11] and tools for instrumenting and tuning T C P connections [33]. As well, T C P itself continues to evolve, especially for high performance links, wi th re-search into new variants like T C P Vegas [14, 35]. Finally, latency hiding techniques and exploiting trade-offs between bandwidth and latency can further expand the range of M P I applications that may be suitable to execute over IP in both local and wide area networks. In the end, the ability for M P I programs to execute unchanged in almost any environment is a strong motivation for continued research in IP-based transport protocol support for M P I . T C P and U D P have been the two main IP protocols available for wide-spread use in IP networks. Recently however, a new transport protocol called S C T P (Stream Control Transmission Protocol) has been standardized [49]. S C T P is mes-sage oriented like U D P but has T C P - l i k e connection management, congestion and flow control mechanisms. In S C T P , there is an ability to define streams that allow multiple independent message subflows inside a single association. This eliminates the head-of-line blocking that can occur in TCP-based middleware for M P I . In addi-tion, S C T P associations and streams closely match the message-ordering semantics of M P I when messages with the same context, tag and source are used to define a stream (tag) within an association (source). S C T P includes several other mechanisms that make it an attractive target for M P I in open type network environments where secure connection management and congestion control are important. It makes it possible to offload some M P I middleware functionality onto a standardized protocol that wi l l hopefully become 3 universally available. Al though new, S C T P is currently available for al l major op-erating systems and is part of the standard Linux kernel distribution, Mac OS X , FreeBSD as well as Sun Solaris 10. In previous work [26], we investigated the effect latency and loss had on benchmark programs using T C P . After this init ial work, we designed a communi-cation component for M P I within L A M atop of S C T P [27] and later implemented and evaluated it, comparing the design and performance to T C P [28]. Our ini t ial experiments were conducted using standard benchmark programs as well as a syn-thetic task farm program. The task farm was designed to take advantage of our component's specific features which in turn leverage the features offered by S C T P . Our tests introduced loss through the use of Dummynet. Loss in Dummynet re-sulted in out-of-order packet delivery, a characteristic that could result in real wide area networks during congestion or when intermediate routes change during a flow. Generally, our module's performance behaved similarly to T C P under no loss but as loss increased, the strengths of S C T P versus T C P became apparent showing large performance benefits. In later work [25], we further studied a latency tolerant task farm template and incorporated real applications into its framework so that they could take advantage of the additional features S C T P provides. Using this tem-plate, we modified two real applications: the popular protein alignment search tool, m p i B L A S T [13], as well as a robust correlation matrix computation. This demon-strated the benefits our SCTP-based communication component can have for real applications under loss and latency. Using S C T P for M P I has performance benefits 1 but use of S C T P required a different design for the communication component. One repeated theme when concentrating on the design of SCTP-based middleware is how one could push func-tionality previously within the communication component down on to the protocol lln he r m a s t e r s t hes i s , m y p a r t n e r H u m a i r a K a m a l f o c u s e d o n t he p e r f o r m a n c e resu l t s o f o u r S C T P - b a s e d m o d u l e . T h i s t hes i s focuses o n t h e d e s i g n o f t h e m o d u l e a n d c o m p a r i s o n t o o t h e r M P I des igns . 4 A p p l i c a t i o n i Middleware t r a n s p o r t I IP i Ethernet I n f i n i b a n d Myrinet Figure 1.1: Communication component of M P I middleware stack, effectively thinning the communication component within the M P I middle-ware. Traditionally, middleware has had their communication component designed as in the hourglass-type network architecture shown in Figure 1.1. In this thesis, we focus on the alternative network architecture for M P I ' s communication middleware shown in Figure 1.2; such a design attempts to not only thin the communication component, but to eliminate it altogether. A p p l i c a t i o n Transport IP Ethernet I n f i n i b a n d Myrinet Figure 1.2: IP based implementation of the middleware Rather than including extensive support for different interconnects in the middleware, we investigated middleware designs that could take advantage of stan-dard transport protocols that use IP for interoperability. One advantage to the network architecture shown in Figure 1.2 is that it simplifies the middleware, which could lead to a more standardized M P I library that could improve the interoper-ability between different M P I implementations. M P I middleware has never been 5 standardized and each implementation uses its own design. For example, M P I C H ' s A D I [53, 52] and L A M ' s R P I [44] describe their layering of the middleware. Open-M P I [16] is a more recent project wi th the goal of modularizing the middleware to support mixing and matching modules according to the underlying system and interconnect. The use of IP substantially reduces the complexity of designing M P I middleware in comparison to designs that try to exploit the particular features of an interconnect, which is one source of incompatibilities between implementations 2 . The advantage of the usual implementation of the communication middle-ware component is that it can fully exploit the performance of specialized intercon-nects. However, these interconnects all have implementations of IP and the afford-ability and performance of Ethernet make T C P / I P widely used for M P I . W i t h the increased use of more commodity processors and commodity OSes, an interesting research topic becomes how much the use of commodity networking can be exploited if this commoditization trend is to continue. One init ial concern is performance. If one compares microbenchmarks, T C P / I P achieves less bandwidth and significantly higher latency than customized stacks and interconnects. Bu t microbenchmarks can be misleading so they should be carefully interpreted; improved bandwidth and latency do not always produce the same gains at the application level. For exam-ple, the performance of T C P / I P over Ethernet is within 5% of the performance of Myrinet on the N A S benchmarks and even closer wi th more efficient T C P imple-mentations [32]. The advantage of using commodity networking (i.e., T C P / I P ) is that it can continue to leverage the advances in mainstream uses of networking that share many of the same performance demands as M P I programs. Given the network architecture shown in Figure 1.2, then it is interesting to investigate re-designs of the middleware that can leverage standardized transport protocols over IP wi th a view to simplifying and eliminating the communication 2 R e c o g n i z i n g t h i s is a n i m p o r t a n t p r o b l e m , o t h e r a t t e m p t s t o s t a n d a r d i z e t h e i n te r -o p e r a b i l i t y o f M P I i m p l e m e n t a t i o n s h a v e a lso b e e n m a d e s u c h as t h e i n d u s t r y - l e d I M P I ef for t [55]. . . . -6 middleware. This approach differs from the design of current public domain imple-mentations of M P I in that it attempts to push functionality down into the transport layer rather than pul l it up into the middleware. For example, O p e n M P I sequences all its data within the middleware and stripes it across all available interconnects, managing message assembly and connection failures within the communication com-ponent. If one were to limit the communications to those only atop IP, it could be possible to prevent doing such management in the middleware if networking re-search such as Concurrent Mul t ipa th Transfer [24] were exploited, where the data is sequenced and striped within the transport layer instead. We compare and contrast three different designs of the M P I middleware. We briefly describe an existing TCP-based M P I , we then consider a T C P socket per message stream design, and finally after a brief S C T P introduction, we consider an SCTP-based design. We wi l l compare and contrast the SCTP-based design wi th the original TCP-based and socket per message stream designs described in the first part of the thesis. The contribution of this work is that it identifies the extent to which we can take advantage of standardized transport protocols to simplify and eliminate M P I middleware functionality. We discuss the requirements of M P I messaging in terms of demultiplexing, flow control, and the communication management properties of the underlying transport protocol. The socket per message stream design leads to a simple implementation of M P I that completely eliminates the communication middleware layer, however, it does not scale. We show how using S C T P thins the middleware and avoids the scalability problems, however S C T P one-to-many sockets in their current form present limitations that make it difficult to completely eliminate the communication middleware 3 . We discuss what would be required of S C T P one-to-many sockets for full e l imination 4 . E l i m i n a t i o n is t h r o u g h t h e use o f t h e m o r e s c a l a b l e o n e - t o - m a n y socke t A P I d e s c r i b e d i n S e c t i o n 4 . 1 . S C T P o n e - t o - o n e s t y l e e l i m i n a t i o n is p o s s i b l e b u t w o u l d h a v e t h e s a m e p r o b l e m s p resen t i n S e c t i o n 3.3 so i t is n o t e x a m i n e d i n t h i s t hes i s . 4 A con fe rence p a p e r [39] d e s c r i b i n g these resu l t s w i l l b e p r e s e n t e d i n A p r i l 2006 a t 7 In Chapter 2, we look at the M P I library and its requirements for correct implementation. In Chapter 3, we introduce a standard TCP-based implementation and later look at a socket per message stream design comparing and contrasting it wi th the standard TCP-based design. In Chapter 4, we look at our S C T P -based design of the middleware to see how this brings us closer to the goal of eliminating the communication middleware, and what would be required of S C T P for full elimination. In the end, after surveying related work in Chapter 5, we finish by offering some conclusions in Chapter 6. H I P S ( H i g h - L e v e l P a r a l l e l P r o g r a m m i n g M o d e l s a n d S u p p o r t i v e E n v i r o n m e n t s ) . 8 Chapter 2 MPI Middleware 2.1 Middleware Middleware is an essential piece of M P I . As with any parallel programming model, an M P I application requires a variety of services in order to properly run. These services are each glued together by the middleware [10]; they can be conveniently split into three interacting components: • a job scheduler to determine which resources wi l l be used to run the M P I job, • a process manager for process initialization/shutdown, signal delivery, error detection as well as process stdin/stdout/stderr redirection, and finally • a parallel library such as M P I to provide a mechanism for parallel processes to effectively exchange messages. Al though the focus wi l l be on the M P I library, we briefly discuss the remaining pieces. 2.2 MPI Runtime Execution Environment A s shown in Figure 2.1, an M P I message-passing program is a collection of processes executing on one or more processors exchanging messages using an underlying com-9 munication medium. The M P I standard was purposely designed to avoid assump-P r o c e s s i manager i i m a n a v j c i . i | ger•Job S c h e d u l e r OS OS O p e r a t i n g System Communication Figure 2.1: Overview of an executing M P I program tions about the runtime environment. As a result, M P I applications themselves need an environment to run i n and thus M P I releases typically provide middleware to manage the interaction of three components: job scheduler, process manager, and the parallel library ( M P I itself). Some M P I releases implement the components directly themselves; others make use of externally developed versions of the compo-nents and tie their functionalities into their own framework. For example, L A M has implementations of many components including all three of these yet alternatively leverages external components by interfacing job schedulers and process managers like Globus [19], OpenPBS [36], and B P r o c [5]. For a communications component, L A M implements it as a library linked to by applications as pictured in Figure 2.2; this library makes calls into the other two aforementioned components as well as to another communication component that is within its middleware managing all M P I messages. M P I C H and O p e n M P I also leverage external components yet similarly always manage the state of M P I messages within communication components in their middleware. A l l of these M P I releases can run over a variety of interconnects and follow the network architecture that was shown in Figure 1.1. As mentioned, the focus of this work is on the M P I library and the communication middleware that is used to support it. 10 2.3 MPI Library The M P I library offers a variety of routines to help application programmers effec-tively develop parallel programs. The middleware exports the A P I as defined by the M P I standard and the application links to this library. The M P I A P I routines are typically implemented with the help of other functions which in turn call the appropriate lower layer system and device routines. Often M P I implementations further modularize these other functions into components to allow for maximum flexibility of hardware and algorithm choice. For example, amongst other things, O p e n M P I allows you to select at runtime which hardware to run on, which datatype implementation to use, and which reduction operation algorithm to perform. As shown in Figure 2.2, one typically obtains a layered system where the User Code MPI l i b r a r y Job Scheduler Process Manager r-1 1 1 1 L i Operating System Transport i A i Ne twork/ Communication middleware f implemented over the transport layer and the network Figure 2.2: Spli t t ing the communication middleware between the library and the transport protocol. M P I library code calls into lower level modules for job scheduling, process man-agement and communication (represented by the dashed line). The communication middleware consists of one or more modules that call the operating system in the case of T C P and possibly lower level network layer functions in the case of special-ized interconnects like Infiniband. Our intent is to remove the direct access to the 11 network layer and remove the state maintained in the communication middleware by dividing its functionality between the M P I library and an IP transport protocol. The management of active M P I messages and their progress wi l l now be handled by the transport protocol. The message progression layer within the communication middleware is re-sponsible for progressing messages to ensure that all messages sent are eventually received. The existence of non-blocking communication, in addition to other types of messaging, and the need to match sends and receives makes it difficult to com-pletely execute the M P I communication routine as a simple library function. Typ-ically message progression state has to be maintained in-between M P I calls. One specific question then becomes what state has to be maintained and to what ex-tent can message progression be accomplished by the library versus a middleware layer versus the transport protocol. Another question is when exactly should mes-sage progression occur. These questions are answered in the next section. We later eliminate this communications component by storing some state in the arguments passed into the M P I communication routine and then rely on the transport layer protocol to actively progress messages. 2.4 Message Progression Layer The message progression layer copies the message, allocates and frees memory when necessary, enforces the M P I semantics for the different message types, performs message matching, executes the lower level communication protocol, constructs the appropriate message header, and finally sends and receives data from the underlying transport layer. How the message progression layer is invoked depends on an implementa-tion's interpretation of the Progress Rule for M P I [7]. This rule is defined in the M P I Standard for asynchronous communication operations. Unfortunately, it is ambiguous so therefore its exact meaning is left for interpretation. When a weak in-12 terpretation is taken, messages are advanced only during an M P I call implying that this layer is only invoked then. However, wi th a strict interpretation, the message progression layer can be entered at any point in the application's execution. In this thesis, we focus on implementations that assume a weak message progression rule where messages are progressed only during M P I calls and there is not a separate thread, process, or processor that progresses messages. Additionally, we also do not discuss interrupt driven schemes [51] that use signals or some other type of call-back mechanism. The various public domain versions of M P I define their interfaces for message progression in a variety of ways. In L A M , there is one layer for this purpose called the request progression interface (RPI) , while in M P I C H it is a combination of two layers, an abstract device handling M P I semantics and a lower level channel that acts as a byte mover. Similarly, O p e n M P I splits the M P I semantics off into its own layer called the P M L (point-to-point management layer). However, underneath this layer it has two other layers for message progression, a simple byte transfer layer ( B T L ) as well as a B T L management layer ( B M L ) that attempts to stripe data across al l underlying interconnects ([56, 43]). A communication is initiated in M P I when a send or receive communication routine is called (i.e., posted). The actual completion of the call depends on the semantics of the communication primitive. In the case of non-blocking communica-tion, the call returns immediately and the user must test or wait for the completion of the communication. The send call completes once the user send buffer can be re-used; the receive call completes once the message has been copied into the user receive buffer. In general, the message progression layer manages the life-cycle of a message. The three main tasks we wi l l focus on in the message progression layer are the message matching, managing of expected and unexpected messages, and the long/short message protocols that are typically used. We wi l l describe the M P I 13 calls in terms of actions required by the transport layer. 2.4.1 Message Matching Message matching in M P I is based on three values specified in the send and receive call: (i) context, (ii) source/destination rank, and (iii) tag. The context identifies a group of processes that can communicate with each other. These values all appear inside the message envelope, which is part of the header sent from source to destina-tion. W i t h i n a context, each process has a unique identification called rank that is used to specify the source and destination process for the message. Final ly there is a user defined tag value that can be used to further distinguish messages. A receive call matches a message when the tag, source rank and context ( T R C ) specified by the call matches the corresponding values in a message envelope. M P I semantics dictate that messages belonging to the same T R C must be received (i.e., completed) in the order in which the sends were posted. Message matching is a form of demultiplexing that identifies the particular receive call that is the target of the sent message. The destination rank identifies the target process whereas the combination of context, source rank and tag identifies the particular receive call. Because the T R C is provided dynamically at runtime, a receive call can potentially receive any message sent to that destination matching the T R C . There are also wildcards that can be used in a receive call to match messages from any source M P I _ A N Y _ S O U R . C E and/or any tag M P I _ A N Y _ T A G . Wildcards provide a form of non-deterministic choice where a process can receive from some collection of messages sent to the process. 2.4.2 Expected/Unexpected Messages One way to view messaging in M P I is in terms of send and receive requests that need to be matched. We ignore the payload for now and also consider non-blocking communication where there can be any number of send and receive requests posted. 14 Some protocol must be used to ensure that two matching send and receive requests, after being posted, eventually synchronize and complete the communication. Most implementations of M P I do this by managing two structures; one for expected mes-sages (local receive requests) and one for unexpected messages (remote send re-quests). When the message progression layer processes a receive request, it can search the unexpected queue for a matching request and, if not found, add itself to the expected message structure. In the case of a send request, the message progres-sion layer sends the request to the remote process where the message progression layer on the remote process checks its expected message structure for a match and, if not found, adds the request to the unexpected message structure. The management of the expected and unexpected message structures is sim-ple in principle but there are several subtle issues that complicate it. First , in order to avoid potential deadlock, the message progression layer must be able to accept all requests. The M P I standard defines a "safe program" as a program that correctly finishes when, every communication is replaced by a synchronous communication, which effectively implies that the progression layer needs to be able to accept only a single request. The guaranteed envelope resource protocol ( G E R ) [8] extends this guarantee to a bounded number of requests. Second, because of weak message pro-gression, each M P I call needs to progress messages by updating the state of the expected and unexpected message structures and sending and receiving data to and from the transport layer below. The message progression layer needs to maintain state between the M P I calls since it wi l l need to progress messages for other calls; for example, it needs to keep checking the transport layer for remote send requests from other processes to be matched or added to the unexpected message structure. 2.4.3 Short/Long Messages In terms of network level functions, long and short messages in M P I introduce problems associated with message framing and flow control. 15 The previous discussion only considers requests and assumes that once re-quests have been matched the transfer can now be completed. In M P I this is com-plicated by the fact that messages can in principle be arbitrarily long. Since the system does not have an unlimited amount of buffers, at some point in the transfer, parts of the message must use the user level buffer space given as a parameter to the receive call. As a result, for a sufficiently large message, the communication becomes synchronous where the send call cannot complete before the receive has been posted. Furthermore, depending on the transport layer used, large messages may need to be fragmented and reassembled on the receive side. Fragmentation may also be necessary for fair message delivery to ensure that the message progression layer doesn't block for long periods of time while engaged in the transfer of one large message, especially given that the transfer could occur unexpectedly (i.e., advanced by the message progression layer) on any M P I call. Given that the message progression layer must accept requests and also man-age the transfer of large messages, then a natural solution is to handle short and long messages differently. Short messages are bundled with the request and thus can be immediately copied into the user's memory when matched. Long messages, which may require a synchronous communication, use a rendezvous protocol that first matches the two requests and then arranges a transfer of the message from the user's send buffer on the remote side to the user's receive buffer. Thus, short messages are sent eagerly in most M P I implementations, whereas long messages use a rendezvous [6]. There is an additional benefit to this approach for performance. As Gropp describes for M P I C H [53], if one considers memory copying and the cost of the rendezvous, then when message copying costs exceed rendezvous costs it is more efficient to use rendezvous, assuming it can avoid the extra copying, rather than using eager send. 16 Chapter 3 T C P Socket-based Implementations In the first part of this chapter, Section 3.1, we focus on the L A M - T C P R P I in order to illustrate a typical implementation of the message progression layer. In the second part, Section 3.2, we describe a TCP-based implementation of the M P I library that eliminates the message progression layer altogether, following Figure 2.2. B y elimination, we mean that such a layer is no longer necessary to maintain the hidden state required to progress messages. Since messages sti l l need to be advanced, we eliminate the message progression layer by moving some functionality into the M P I library routines and the remainder down into T C P as illustrated in Figure 2.2. Al though this design is not very scalable, for reasons to be discussed, it serves to illustrate the features that are needed in the library and transport layer when having an M P I implementation where the message progression layer is eliminated. After sketching this design, in Section 3.3 we discuss problems and possible techniques to alleviate some of these problems. 17 3.1 The LAM-TCP Message Progression Layer Imple-mentation L A M - M P I is a popular open source implementation of M P I . It supplies message progression layers for various interconnects, one of which is T C P . Our work in [27] and [28] involved carefully analyzing this particular T C P message progression layer (referred to as L A M - T C P from here forward); we summarize L A M - T C P here to demonstrate a typical message progression layer implementation. When an M P I application begins, each M P I process executes MPI_Init(). W i t h i n this call for L A M - T C P , each process initializes a struct that contains the state between itself and each other process. A given process wi l l know its own rank and how large its MPI_COMM_WORLD is. Other than setting ini t ial states between itself and other processes, it also connects to each other process by way of a socket. This is accomplished by opening a listener port and then making this port number available by way of the out-of-band daemons offered by L A M 1 . If a given process is creating a connection with a process whose global rank is less than its own, then it acts as a client issuing a connect () call to the other process's listening port. O n the other hand, if the other global rank is greater, then the lesser ranked process wi l l issue an accept() creating a new connection on the resulting socket and port once the client connects. A t the end of MPI_Init, each of the N processes has opened N - l additional T C P sockets; the total topology is fully connected from the beginning of the M P I application. The listening socket is then closed. Whi le performing connection es-tablishment, a global map is created going from each socket's file descriptor value to the other process's global rank value. This map is an array of size FDJ3ETSIZE which is typically defined to be on the order of 1024 for standard users on most systems. 1 T h e L A M d a e m o n s serve t h e r o l e o f t h e j o b s c h e d u l e r a n d p r o c e s s m a n a g e r c o m p o n e n t s i n t r o d u c e d i n C h a p t e r 2. 18 Having outlined how connections amongst processes are established in L A M -T C P , next we consider the message progression layer functionality with respect to its main tasks described in Section 2.4. 3.1.1 Message Matching When messages are passed from one process to another, they are sent together with the appropriate control information, or envelope, containing information about the associated message. As mentioned earlier in Section 2.4.1, message matching in M P I is based on three values specified in the send and receive call: (i) context, (ii) source/destination rank, and (iii) tag. In L A M , as shown in Figure 3.1, these are included in the envelope in addition to the message length, a sequence number, and a flag field that indicates what type of message body follows the envelope. Each of these fields are 32 bits in L A M . Not all M P I implementations use the same envelope structure. This is in part because different interconnects can specify more or less information. The Interoperable M P I (IMPI) project [55] attempts to standardize the envelope packet format so that M P I implementations can interact. Context Rank Tag Payload Envelope Message Length Tag Flags Rank Context Sequence Number Format of MPI Message Structure of Envelope in LAM Figure 3.1: M P I and L A M envelope format Since there is a connection per M P I process pair, for a standard receive call it is clear where the progression layer should read a message from: the appropriate socket for the source rank. The information read from the envelope can be compared 19 to the tag, rank, and context ( T R C ) of the receive call. The presence of a wildcard for the tag ( M P I _ A N Y _ T A G ) doesn't differ from the standard case except that the arrived message's tag value is not important. However, the presence of a wildcard for the source ( M P I _ A N Y - S O U R C E ) complicates matters. W i t h such a wildcard, messages from any of the ranks in MPI_COMM_WORLD are legitimate. The implementation handles this by adding all sockets to a set and then calling s e l e c t () on that set. 3.1.2 Expected/Unexpected Messages As introduced in Section 2.4.2 for general message progression layer implementa-tions, the expected and unexpected queues are also vi ta l wi th in L A M - T C P . For L A M - T C P , the queues are each implementated as a linked list wi th in the message progression layer. When a request is posted, it is added to the end of a linked list of expected messages which have not arrived yet. When the progression layer is en-tered, the expected message queue is traversed and the socket associated with each request's rank is inserted into one of two file descriptor sets, depending on whether the request is for a send or receive operation. In the end, if the sum of the sets' sizes is greater than one, the sets are passed in to a s e l e c t () call and afterwards the appropriate read( ) and w r i t e () calls are made based on its results. If the sum of the sets' sizes is one then no s e l e c t () call is necessary so the one call can operate directly on the socket. When the progress layer is entered, then se t sockop t is used prior to these I / O calls to set the socket to blocking or non-blocking depending on the whether the M P I call was blocking or non-blocking. In L A M - T C P , all tags are sent over the same connection. The ordering semantics of M P I could result in a scenario where a socket is being read for a message wi th one tag but the obtained message envelope shows that it contains a different tag that has yet to be pre-posted. Here, this message is considered unexpected. When a message is unexpected, it is buffered and stored inside an internal hash table. Later, whenever a new request is posted, before being added to the expected queue, it is 20 first checked against al l of the buffered unexpected messages for a possible match, in case the message had already arrived. Unexpected messages can arrive other ways too. For example, collective operations share the same connection but have a different context than point-to-point messages; when expecting one context, it could receive another. Similarly, if MPI_ANY_SOURCE is used for a specific tag and it is the only outstanding request, data that has arrived on any connection for any tag wi l l be read since the select () call w i l l return their sockets as ready to be read. Messages read for tags other than the specified tag wi l l be considered unexpected. 3.1.3 Short/Long Messages L A M - T C P splits the messages into three types internally: asynchronous short mes-sages which are of size 64K bytes 2 or less, synchronous short messages, and long messages which are greater than 64K. The type of message is specified in the flag field in the L A M envelope (see Figure 3.1). Short messages are passed using eager-send where the message body imme-diately follows the envelope. When received, they are handled as mentioned in the previous section depending on whether or not it is an expected or unexpected message. However, while synchronous short messages (i.e., MPI_Ssend) also com-municate using eager-send, the send is not complete until the sender receives an acknowledgment ( A C K ) from the receiver. The state of these short protocols is maintained within the message progression layer. Long messages are sent using a rendezvous scheme to avoid potential re-source exhaustion problems and also in order to have the potential of a zero-copy as mentioned in Section 2.4.3. Initially only the envelope of the long message is sent to the receiver. If the receiver has posted a matching receive request then it sends back an A C K to the sender to indicate that it is ready to receive the message 2 64K bytes is the default cutoff point however this can be specified by the user at runtime. 21 body. Once the A C K is received, the long message is sent. If no matching receive request was posted at the time the ini t ia l long envelope was received, it is treated as an unexpected message. Later when a matching receive request is posted, it sends back an A C K and the rendezvous proceeds. Again, the state of the rendezvous is stored within the message progression layer. 3.2 Socket per Message Stream MPI Implementation Now that we've introduced a typical implementation of a message progression layer, we'l l describe a TCP-based approach of the M P I library that attempts to eliminate the message progression layer altogether. The standard socket library provides routines that can be used to implement the M P I communication primitives. Figure 3.2 compares the standard M P I send and receive calls to the corresponding socket library calls. Some of the differences i n t Mei_Send.{yoid*:. msg, i n t count','. MEIJDa'tatype. datatype, i n t dest, ••in't:-:-t.ag>/ MPI_Conua coaun) i n t kEI_Re'tv;(yoii|* rcvBuf, infe count, MM_6a£a.tYP« datatype, •int io.uice:,-. i n t tag, ^i_Coa»m cpmm, MPl_Status *.atatus')' i n t send(int socket, const void *msg, i n t msgLen, i n t flags>; i n t recv{int socket, void *rcvBuf, i n t bufLen, i n t flags) Figure 3.2: Comparison of M P I send/recv and socket library send/recv can be easily accommodated. For example, message length is expressed in terms of M P I types, which within the M P I library can be converted into its actual byte length using packing information contained in the MPI .Datatype. The other parts of the message are the communicator, source/destination, and tag fields. The design attempts to map these into a socket descriptor. As a result, every T R C wi l l have its own socket and T C P connection, which is consistent wi th the message ordering semantics of M P I . 22 In general, since tags are user denned variables, sockets wi l l need to be cre-ated dynamically when communication occurs. Al though this may require a large number of descriptors, it is possible to mitigate this problem by closing connections that are not being used. Typically, depending on the program, there are few T R C combinations active at any one time. The M P I _ I n i t ( ) routine can be used to exchange the information neces-sary to initiate the system. In particular, at a minimum, at runtime we need the IP addresses of al l machines running M P I processes. We assume there is a pre-assigned control port for each M P I process. This is used for the accept () socket call when creating new connections initiated by a connect () call to that port from another M P I process. When a new connection is created on this control port us-ing accept (), a new socket using its own port is created and the control port can continue to be reused as a control port to initiate other connections. The con-trol IP and port are made available for each rank by way of the L A M out-of-band daemons. The M P I _ I n i t ( ) routine creates the MPI_COMM_WORLD communicator, an opaque data structure, which for each process, contains the table of the machines in MPI_COMM_WORLD. More specifically, the table gives a mapping from M P I ranks to IP addresses and control port. Since the communicator object is an argument to al l communication calls, every call has access to this table. As well as the rank-IP table, communicators wi l l also maintain a separate "connections" table that is a mapping from a T R C to a socket descriptor. Each M P I send or receive uses the T R C value as a key to find the corresponding socket for the send and receive socket call. If there is no entry for that key, then it uses the rank to determine the IP address and port. The control port is then used to connect to the remote machine and create a new connection for that T R C . To create a new connection, one end must execute accept () whereas the other end must execute connect (). Because of wildcards, the receive side executes the accept () and the send side executes a corresponding connect (). Note, this is only done for the first 23 connection. Having sketched out the basic scheme, we wi l l now consider how the message progression layer functionality requirements described in Section 2.4 wi l l be handled in this socket per stream implementation of the M P I Library that has eliminated the message progression layer. 3.2.1 Message Matching The basic scheme works for the standard send and receive since in this implementa-tion, a T R C defines a message stream in M P I and a T C P socket per T R C provides a full ordering between endpoints. However, matching is complicated by the existence of wildcards. Wildcards are only used in M P I receive calls and are MPI _ANY.SOURCE and/or MPI_ANY_TAG. The select 0 socket call allows us to create a set of socket descriptors that can be used to block while waiting for data on one of the connec-tions. The M P I library code scans the connections table to create a set of sockets that can match the receive and can then be used in a select () call. It is possible that there is a matching T R C for which a connection does not yet exist so a new connection wi l l need to be established. New connections can be handled by adding the socket associated with the control port to the select 0 call. However, there may be connect requests from one or more processes not associated wi th the receive call. As a result, once a new connection is accepted, the T R C information from the remote send side needs to be sent to the receive side in order for it to process the connection request and update the connections table associated with the communicator. This was the rationale for having receive do accept(), since these calls can receive from multiple processes and need to use s e l e c t ( ) , whereas the sends do not use wildcards and they either already have a pre-assigned socket or the send needs to create a new connection using connect(). The receive call that executes the select () may be notified of data on an existing connection or notification for a new connection, which must be added to the 24 connection table. Depending on the implementation, it is possible that the request is for a different communicator, which implies that the tables for MPI_C0MM_W0RLD and all communicators derived from it must be accessible. A n alternative approach exists that avoids requiring access to all communicators. When creating a new com-municator, an old communicator is provided to the MPI_Comm_create call; the new communicator is some subset of the old communicator. Creating a new communi-cator is a global operation over the ranks in the old communicator, even if they are not going to-be within the resulting new communicator. As a result, it is during this global operation that it is possible to negotiate a new control port to be stored for each communicator. This eliminates the possibility of receiving a connection request for a communicator different from the one specified in the receive call since every communicator wi l l have its own control port and communicators can never be wildcarded. 3.2.2 Expected/Unexpected Messages The global expected and unexpected message structures are no longer needed under this design. Because each M P I message stream ( T R C ) now has its own connection, for the unexpected queue, the implementation relies instead on the socket buffers and the T C P flow control mechanism to manage the flow of messages on a connection once it has been created. Non-blocking is possible wi th sockets by setting the non-blocking option on the socket (0JJ0NBL0CK) . For most socket libraries, it is possible to set the option on a per call basis, which corresponds to M P I , rather than having it as part of the connection [46]. Non-blocking communication in M P I returns an opaque message request object (MPI-Request) that is then passed to all calls that check for com-pletion. The request objects can be used to maintain the information about what is needed to complete the M P I call. For example, in the case of a synchronous communication, the request object stores whether the message corresponding to the 25 send or receive has been posted. The presence of blocking and non-blocking calls in M P I creates potential scenarios where the state of a particular message's progress wi l l be dependent on the progress of other previously posted messages. As a result, a local form of the expected message queues must be kept. To motivate the necessities for these queues, let's assume we had no expected message queues and we had the following scenario illustrated in Figure 3.3. Say messages A and B were successfully put on the network Process 0 network Process 1 MPI_Send(msgA, p r o d , tag45) MPI_Send(msgB, p r o d , tag45) Time msgA nisgB MPI_lrecv(msgA, procO, tag45) MPI_Recv(msgB, Underlying read() returns EAGAiN -nothing in socket buffer yet msgA in socket buffer so MPI.call needs to know which buffer to put it in Figure 3.3: Motivat ion for expected queues from process rank 0 to process rank 1 on tag 45 using blocking sends. Process rank 1 has a non-blocking receive for message A and a blocking call for message B . When process rank 1 pre-posts a request for message A , if that message isn't in the socket buffer, an E A G A I N would be returned and the application would proceed. Next, during the blocking receive call for message B , process rank 1 is going to have to be aware of the request for message A since it is going to arrive to the socket first due to T C P ' s full sequencing of a connection. The library must know which buffer to put it in since in this design, no global expected queue exists 3 . In order to process requests correctly, another table is maintained within the 3 T h i s e x a m p l e m a y s e e m c o n t r i v e d b u t t he s a m e s c e n a r i o c o u l d b e i l l u s t r a t e d w i t h a n y c o m b i n a t i o n o f n o n - b l o c k i n g c a l l s t o g e t h e r w i t h MPI_Wait. 26 opaque communicator object. This table maps a T R C value to the head of a linked list of request objects. L A M - T C P uses an expected message queue with messages from all T R C s in it. In this socket per T R C design, effectively the table contains more local expected message queues. A linked list itself is outstanding requests on a given T R C ; due to M P I semantics, these requests must be completed in order. For receive requests, access to this linked list is required so that data obtained from the socket buffer can be placed in the correct user-space buffer as specified in the posted M P I receive call. For send requests, messages need to be sent in the order in which they were posted so this linked list is also required to be available. This table is essentially an expected queue within the opaque object, the only difference being that instead of a linked list of requests for all T R C s as in standard message progression layers, it is a table indexed by the T R C returning a smaller, more local linked list. 3.2.3 Short/Long Messages In this design, a connection corresponds to a particular T R C . T C P ' s in-order se-mantics ensure that messages are completed in the order they are sent. As well, T C P ' s flow control mechanism ensures that the eager sending of messages wi l l not over-run the receiver's buffer and exhaust memory resources. This eliminates one of the motivations for short and long messages and the use of a separate protocol for each kind. The important point is that we can take advantage of T C P ' s flow control mechanisms. This design simplifies the middleware since it is no longer necessary to have a short and long protocol that restricts the eager sending of larger messages. In the message progression layer, flow control is handled by complicating the protocol, restricting the eager sending of large messages. Memory resources in our case are allocated on a per T R C basis in the socket buffer and not on a per message basis in the middleware. 27 It is st i l l necessary to do message framing for T C P streams to determine message boundaries since otherwise one send may receive part of a message destined for some other user level receive buffer wi th the same T R C . This is accomplished by passing the envelope first, which contains the message's length. 3.3 Critique of the Socket per Stream Implementation There are several issues that arise in using a socket for each T R C . These issues gen-erally arise in any T C P implementation of the middleware, including the standard T C P implementation used in public domain implementations of MPI([34, 28]). The major issues are the following: • limitations on the number of sockets, • large number of system calls, • the select () system call, and • T C P flow control. 3.3.1 Number of Sockets The number of allowed file descriptors is determined by FD.MAX (typically 1024 on Linux) . O n larger systems, even the standard T C P implementation can exhaust the supply of file descriptors, when it opens one socket for each M P I process in the system. For this reason, both M P I C H and O p e n M P I open connections as they are needed rather than opening all connections during a call to MPI_Init(), as is the case for L A M - T C P . The socket per T R C design makes this problem even worse. The design may be acceptable for small clusters, but has difficulty scaling to a large number of processes or with applications that may use a large number of tags and contexts. For example, consider a simple farm program wi th a master and N-l workers. The master would need N-l sockets to communicate between the workers. 28 Typically, three tags are used with task farms: task requests, tasks, and results. Thus, a simple farm would require 3(N-1) sockets. The number of file descriptors can be reconfigured in the kernel, however, there is also a significant memory cost associated with each T C P connection, which again prohibits having large number of connections. Recent work by Gilfeather and Maccabe [18] addresses the problem of the scalability of connection management in T C P and introduces some techniques for alleviating these problems in clusters. The origin of their work was from similar problems that arise in web servers, which also have to manage a large number of connections and where techniques have been developed to efficiently handle these connections. This work is a good example of an advantage of using commodity transport protocols like T C P for M P I since one can take advantage of state-of-the-art research on other applications (like web servers) that place similar demands on the underlying operating system. Even with scalable connection management, there is no easy way to know when a connection is no longer needed. A simple approach would be to time out connections, or to release them shortly after a new context is detected. B u t again, tags that are used rarely or for a short amount of time would result in extremely large communication times waiting to create connections. For example, if tags are used as job numbers in a task farm, then it is quite possible that every message in the system results in a new connection. 3.3.2 Number of System Calls The second issue is a performance issue that arises because of the way in which the middleware must pol l the sockets that results in a large number of system calls. As described in [34], the message progression layer usually needs to execute a s e l e c t () call to determine the next socket wi th available data and then needs one or more r e a d O calls to obtain the message. The same is true of the implementations outlined above. This results in several system calls for each M P I call, which not only requires 29 a context switch, but also extra processing because each system call is a call into the kernel [46]. Matsudo et al. [34] addresses this problem by managing the expected and unexpected message structures inside the kernel. 3.3.3 Select System Call A third issue is the performance of the s e l e c t () call, which is known to scale linearly with the number of sockets. For example, as shown in Matsudo et al. [34] on a 3.06GHz dual Xeon machine, the cost of s e l e c t () increases linearly from under 100 microseconds for less than 200 sockets to 900 microseconds for 1000 sockets. Newer system calls such as e p o l l O for Linux, and similar calls in other operating systems, have tried to improve the event handling performance of web-servers [17]. The socket per T R C design st i l l requires frequent s e l e c t () calls particularly with M P I calls having wildcards. 3.3.4 Flow Control The final issue is flow control in T C P , which is a concern in any TCP-based M P I implementation. Since buffering is being done by the transport layer, it is possible that when M P I receive calls are delayed, the socket buffers fill and as a result trigger T C P flow control to close the receive window. Closing and opening the advertised receive window in T C P restricts bandwidth because of T C P buil t- in timers and slow-start mechanisms. This is particularly serious for high bandwidth connections where only a small fraction of the bandwidth could be utilized. There is a mismatch between the event driven operation of the transport layer and the sequential control operation of the M P I program. Flow control at the transport layer attempts to match the sender's data rate to that of the receiver. Messages in M P I are bursty and can easily trigger T C P ' s flow control mechanism. Flow control is an end-to-end mechanism and thus not only reduces bandwidth but also significantly adds to message latency. The standard T C P implementation of 30 the progression layer constantly empties the socket buffers to reduce the chances of it filling. Also, since al l traffic between two processors is aggregated on a single connection it tends to smooth out some of the burstiness. The socket per T R C does not aggregate traffic and is far more likely to trigger T C P flow control. O n the other hand, our socket per T R C design does make it possible for the M P I program to take advantage of T C P ' s flow control mechanism since it possible to push back on an individual flow by delaying receiving data on a particular socket. Under this design, flow control allows the user to configure and throttle the data rates on a particular T R C . However, the end-to-end nature of T C P ' s flow control mechanism makes such fine-grain flow control expensive in high latency network environments. For L A M - T C P , research has been conducted where they have flow control mechanisms within the message progression layer itself rather than only using the flow control of the underlying transport protocol [8]. Such schemes attempt to do a user level version of push back on message flows in order to guarantee resources and to prevent exhaustion. B y pushing this functionality down into T C P , the socket per T R C design prevents the M P I implementor to have to design a user level flow control mechanism and instead lets them leverage the transport layer's flow control mechanism, including potential advances that may occur here by research in the networking community. 31 Chapter 4 SCTP-based M P I Middleware S C T P is a general purpose unicast transport protocol for IP network data commu-nications; it has been recently standardized by the I E T F [49, 57, 48]. S C T P is well suited as a transport layer for M P I because of some its distinct features not present in other transport protocols like T C P or U D P . In this chapter, we first give an overview of the S C T P protocol in Section 4.1. After this, we present the implemen-tation of an SCTP-based message progression layer in Section 4.2. We show that use of S C T P one-to-many sockets thins the message progression layer, taking advantage of S C T P ' s additional features. We show that in order to completely eliminate the progression layer's functionality using a one-to-many socket, additional features of S C T P would be required. 4.1 S C T P Overview S C T P is a general purpose unicast transport protocol for IP network data communi-cations [49]. It was init ially introduced as a means to transport telephony signaling messages in commercial systems, but has since evolved for more general use to sat-isfy the needs of applications that require a message-oriented protocol wi th al l the necessary T C P - l i k e mechanisms. S C T P provides sequencing, flow control, reliabil-ity and full-duplex data transfer like T C P , however, it provides an enhanced set of 32 capabilities not in T C P that make applications less susceptible to loss. fi1-NIC 1 Endpoint SCTP N I C 3 Association NIC 2 ndpoint Y N I C 4 I I f n i i . NIC 1 Endpoint X rr TCP Connections NIC 3 NIC 2 Endpoint Y 11 NIC 4 Figure 4.1: S C T P association versus T C P connection Like U D P , S C T P is message oriented and supports the framing of applica-tion data. However, more like T C P , S C T P establishes a reliable session between a pair of endpoints. S C T P calls its sessions associations whereas T C P terms them connections. There is however a significant difference between a T C P connection and an S C T P association. A s shown in Figure 4.1, an S C T P association can simul-taneously be between endpoints with multiple IP addresses, i.e., multihomed hosts. O n the other hand, a T C P connection is only capable of being between exactly one pair of IP addresses. In S C T P , if a peer is multihomed, then an endpoint wi l l select one of the peer's destination addresses as a primary address and all other addresses of the peer become alternate addresses. During normal operation, al l data is sent to the primary address, wi th the exception of retransmissions, for which one of the active alternate addresses is selected. Congestion control variables are path specific. When 33 the primary destination address of an association is determined to be unreachable, the multihoming feature can transparently switch data transmission to an alternate address. Currently S C T P does not support simultaneous transfer of data across interfaces, but this wi l l likely change in future. Researchers at the University of Delaware are investigating the use of S C T P ' s multihoming feature to provide simul-taneous transfer of data between two endpoints, through two or more end-to-end paths [24, 23], and this functionality may become part of the S C T P protocol. In addition to the multihoming feature S C T P offers, there is also a multi-streaming feature where it is possible to have multiple logical streams within an association. Each stream is an independent flow of messages which are delivered in-order. If one of the streams gets delayed, it doesn't affect the other streams so, in other words, it avoids head-of-line blocking. In contrast, when a T C P source sends independent messages to the same receiver at the same time, it has to open multiple independent T C P connections that operate in parallel. Each connection would then be analogous to an S C T P stream. Having parallel T C P connections can also improve throughput in congested and uncongested links. However, in a congested network, parallel connections claim more than their fair share of the bandwidth, thereby affecting the cross-traffic. One approach to making parallel connections TCP-fr iendly is to couple them all to a single connection [20]. S C T P does precisely this by ensuring that all the streams wi thin a single S C T P association share a common set of congestion control and flow control parameters. It obtains the benefits of parallel T C P connections while keeping the protocol TCP-fr iendly [41]. S C T P supports both one-to-one style and one-to-many style sockets. A one-to-one socket corresponds to a single S C T P association and was developed to allow porting of existing T C P applications to S C T P with little effort. In the one-to-many style, a single socket can communicate with multiple S C T P associations, similar to a U D P socket that can receive datagrams from different U D P endpoints with the 34 association One-to-many socket Figure 4.2: One-to-many sockets contain associations which contain streams added benefit of consisting of reliable associations. A one-to-many socket wi th two associations is depicted in Figure 4.2. Figure 4.2 shows how multiple associations and their contained streams are al l funneled into a single one-to-many socket. The S C T P A P I provides a sctp_recvmsg() method that obtains data from the one-to-many socket. It returns a message and a populated s c tp_snd rcv in fo s t r u c t that contains information about which association and stream the message came in on. The A P I does not provide a means of specifying which association or stream to obtain a message from. So for applications that previously used multiple T C P sockets, porting it to use a single one-to-many socket requires more of an effort because the application needs to be more reactive since the sc tp j recvmsgO call itself does not let you specify exactly where to get the next message from. 35 4 .2 SCTP-based Implementation of the Message P ro -gression Layer We have implemented a version of the message progression layer, as a new R P I for L A M , that takes advantage of S C T P ' s new features [28]. Here, we briefly describe our implementation and compare it to the middleware implementations described previously. In S C T P , each association between endpoints can have multiple streams. We take advantage of streams in our SCTP-based implementation of M P I by mapping different M P I context and tag combinations onto different streams (as shown in Figure 4.3). The use of an association for each rank and a stream for each context SCTP MPI One-to-Many Socket Context Association Rank / Source ^ ^ Streams Message Tags Figure 4.3: Mapping of S C T P streams to M P I tag, rank and context and tag combination results in a mapping of each T R C to its own S C T P stream. This uses only a single one-to-many socket. The mapping of each T R C to its own stream is semantically equivalent to the mapping of T R C s to T C P connections discussed in Section 3.2. Bo th satisfy the M P I semantics of in-order message delivery, however, S C T P avoids many of the disadvantages discussed in Section 3.3. In the remainder of this section, we discuss our implementation of the mes-sage progression layer with respect to message matching, expected and unexpected messages, and the short/long message protocol outlined in Section 2.4. Effectively, under this design some of the traditional features of a message progression layer are 36 pushed down into the transport layer. We show what features would be required of S C T P if one-to-many sockets were to be used to fully eliminate the message pro-gression layer, split t ing its traditional functionalities into both the library and the transport protocol. 4.2.1 Message Matching In S C T P , we used a one-to-many socket that is similar to a U D P socket. Using socket calls, it is possible to specify the association (destination machine) and stream on which to send messages. Again, just as in the case of one socket per T R C , we can store this information in the communicator that is a parameter to each send call. Ideally, on the receive side we would like to receive the next message on a given association and stream; this is because T R C s are mapped to streams which maintain the order which messages were sent, adhering to M P I messaging semantics. However, S C T P does not support this functionality for streams or associations in the one-to-many style and simply returns the next available message on any association and any stream. Semantically, the provided sc tp j r ecvmsgO socket call is equivalent to an MPI_Recv with MPI_ANY_RANK and MPIJuMY_TAG. As a result, for M P I receive calls that do not use the wildcards, it is necessary to do message matching. It is thus not possible to eliminate message matching from the design without maintaining global expected and unexpected message queues. S C T P one-to-many style does simplify the socket calls needed to receive messages since no s e l e c t () call is required because only one socket is used; imple-menting it using a single one-to-many socket makes the message progression layer thinner. Each sctp_recvmsg() socket call returns a full message that has either to be delivered or added to the unexpected message structure. In comparison to the standard T C P implementation introduced in Section 3.1, it avoids the multiple system calls needed to read a message. In comparison to the socket per T R C design 37 in Section 3.2, it avoids the costly s e l e c t () call. 4.2.2 E x p e c t e d / U n e x p e c t e d M e s s a g e s Since we need to do message matching, our message progression layer sti l l needs to maintain expected and unexpected message structures. Each of these queues needs to be global across all T R C s because the S C T P A P I does not let one specify the stream and association on the sctp_xecvmsg() call. In the socket per T R C design, these queues were eliminated from within the progression layer; their functionalities were effectively split. For the expected message queue, a more local form of the queue was pulled up into the M P I library, having a hash of queues keyed by the T R C . The ability to fully specify the T R C within the r e c v ( ) call was a trait of the transport protocol so expected message queues were split between the M P I library and the transport protocol. For unexpected message queues in the socket per T R C design, they were pushed fully down onto the transport protocol. Traditional designs such as L A M - T C P place the queues in the message progression layer. In any design, while possible to maintain these queues in the M P I library and attach them to a communicator, their mere existence and necessity shows that progression layer elimination with S C T P using a one-to-many socket can not push as much functionality down onto the transport as is possible in the socket per T R C design unless the S C T P A P I provided additional capabilities; these capabilities include the ability to receive messages on a particular association and stream as well as a means to ask the socket which streams currently have data that can be read without blocking (i.e., something similar to the functionality of s e l e c t ()) . Although, every sc tp j r ecvmsgO call returns a message, we continue reading from the socket if there are outstanding requests in order to obtain and deliver as many messages as possible. Like T C P , S C T P has a socket option that can be set so that I / O functions called on that socket return immediately when the socket call would block (returning er rno EWOULDBLOCK); for non-blocking M P I calls, we 38 can exit the message progression layer when this e r rno is returned. In addition, the S C T P receive socket buffer size can be set by the user, but al l streams and associations share the same buffer, congestion control settings, and flow control settings. To avoid triggering the flow control mechanisms, it is better to empty the buffer as often as possible; this is why we read as much as we can from the socket when we enter the message progression layer. Given no other user input, this design decision is likely the best thing since it progresses al l messages that are already sitting at the receiver. 4.2.3 Short/Long Messages The need to manage unexpected messages implies that we wi l l need a short and long message protocol to avoid resource exhaustion. The first issue that arises is handling messages which exceed S C T P ' s socket send buffer size. Messages that exceed this size need to be fragmented and re-assembled by the message progression layer. We use a rendezvous mechanism where, once the rendezvous has occurred, the sender fragments the M P I message into frag-ments and the receiver assembles these fragments in the user's receive buffer as specified in the M P I receive call. The sender could loop sending each fragment, however, this does not allow other messages to advance. As a result, we used round robin scheduling of all out-going S C T P messages and recorded their progress in a structure pointed to from the MPIJlequest . This allows M P I messages with different T R C s to progress concurrently on an association between two machines. Special care had to be taken to eliminate race conditions that could occur when M P I messages had the same T R C [28]. The second issue is flow control. As previously mentioned, it is not possible in S C T P to request the next message on a particular stream and association. A reason for not providing this functionality is that streams share the receive socket buffer inside the transport layer. Providing a mechanism for obtaining messages on 39 a particular stream introduces the possibility of messages not being selected and eventually exhausting the socket buffer resulting in deadlock. The potential for deadlock would be difficult to detect because it depends on the possible ordering of data segments received by the transport layer. The socket per T R C design was able to eliminate the short/long message protocol because it could rely on T C P ' s flow control mechanism. Although the consequences of closing the advertised receive window makes the use T C P ' s flow control costly (as discussed in Section 3.3.4), a push back mechanism would give the M P I program control over which message to advance. This would allow users the ability to advance messages on its critical path and potentially improve the overall execution time of the program. S C T P cannot provide the level of fine grain flow control that would be neces-sary to allow the user to push back on a particular stream. S C T P streams and thus T R C s share flow control values. As a result, one T R C can affect the performance of another. To try to combat this, it is possible in the S C T P one-to-many socket style to allocate socket buffer space on a per association basis, which would give more control over flow control from a particular machine, but there st i l l is no call provided in the A P I to tell which associations or streams have data ready to be read. 40 Chapte r 5 Related Work Computing systems in the Top 500 list [50] increasingly are clusters which use com-modity parts such as operating systems and processor hardware. In the past, the list had previously been occupied more by massively parallel processing ( M P P ) sys-tems, each with a set of customized parts. Using components from various vendors, clusters present a cost effective alternative. For example, the share of clusters in November 2000 was a mere 5% but it was well over 40% by November 2003. In this same span of time, the share of M P P systems dropped from 69% to 33%. Generally this shows a trend towards commoditization. This thesis explores the scenario where advances in commodity networking are to be made, and what M P I implementation re-designs could be made to better leverage commodity networking. Use of the transport protocols atop IP for M P I is quite common however each design and implementation is unique to the M P I distribution. For example, the public domain versions of M P I each have their own components and interfaces. L A M defines its main component interface through its System Services Interface (SSI) [45]. The component performing message progres-sion within SSI is the Request Progression Interface (RPI) [44]. A T C P implemen-tation comes wi th the distribution and in previous work, we have developed an R P I for S C T P [28]. For M P I C H , a T C P component exists within its defined compo-41 nent interface called A D I , or Abstract Design Interface [53]. Under this framework, message progression is achieved through a combination of two layers, an abstract device handling M P I semantics and a lower level channel that acts as a byte.mover. Similarly, O p e n M P I [16] under its M P I Component Architecture ( M C A ) splits the M P I semantics off into its own layer called the P M L (point-to-point management layer). However, underneath this layer it has two other layers for message progres-sion, a simple byte transfer layer ( B T L ) as well as a B T L management layer ( B M L ) that attempts to stripe data across all underlying interconnects [56, 43]. A T C P implementation of the B T L comes standard with the distribution. Each distribution has made independent component interface design deci-sions, and the costs of their decisions have been analyzed. In [21], they implement M P I over Infiniband at various layers within M P I C H 2 . They found substantial per-formance benefits when eliminating layers. O n the other hand, in [3] they look at the component costs of the M C A within O p e n M P I . In this work, they found that eliminating layers did not result in any substantial performance gain. The reasons why these two papers appear contradictory are yet to be explained and remain a topic for future work. The question remains where definitively should features be placed within a given M P I implementation. One feature that is placed in a variety of levels is the ability to stripe data across multiple networks at once. W i t h i n O p e n M P I , this is handled within the middleware [56] for all interconnects. A n M P I C H variation called M P I C H - V M I [38] discusses the ability to stripe data using T C P when the OS and network both support channel bonding [42]. In our work with M P I over S C T P [28], data striping could occur solely within the transport protocol by utilizing Concurrent Multicast Transfer([24, 23]). Another feature that can be placed at different levels is flow control. In this thesis, we showed a design with a T C P socket per message flow i.e., messages that share the same-tag, rank, and context, or T R C . This effectively means that the T C P 42 flow control can throttle an aggressive sender by lowering the advertised window from the receiver. Similar flow control schemes have been implemented within the message progression layer within the middleware. For L A M , the guaranteed envelope resources protocol [8] is implemented in the middleware. This protocol is a flow control mechanism within the middleware that guarantees that only a bounded number of unmatched envelopes can be successfully sent and delivered. Since message ordering is only mandated on a per T R C basis, messages aggregated over a common link can arrive in a variety of orderings. It is the role of expected and unexpected message queues to match and store messages no matter which ordering presents itself. Where expected and unexpected message queues are implemented varies across M P I implementations. In typical implementations([54, 9, 15]), these queues are within the middleware. In [34], these queues are implemented within a customized kernel that they made. In Section 3.2, the socket per T R C design presented a design where the role of the unexpected queue was pushed down onto the transport layer socket buffer and a more local expected queue was pushed up into the M P I library. The effectiveness of S C T P has been explored for several protocols in high latency, high-loss environments such as the Internet or a W A N . Researchers have investigated the use of S C T P in F T P [31], H T T P [40], and also over wireless [12] and satellite networks [1]. Pr ior to our work in [27] and [28], S C T P for M P I had not been explored. Given the strengths of S C T P under high loss and high latency, our work wi th M P I over S C T P would be most relevant to projects operating over W A N s that aim to link together clusters into a Cluster-of-Clusters (CoC) . The approaches researchers have taken to CoCs can be classified into two paradigms: the M e t a M P I approach and the unified stack approach. The M e t a M P I approach to CoCs relies on an M P I implementation to be operational for intra-cluster communication and then, for inter-cluster communi-43 cation, another layer is used as glue. One M e t a M P I example is M P I C H - G 2 [29]. M P I C H - G 2 is a multi-protocol implementation of M P I for the Globus [22]. Globus provides a basic infrastructure for G r i d computing and addresses many of issues wi th respect to the allocation, access and coordination of resources in a G r i d envi-ronment. The communication middleware in M P I C H - G 2 is able to switch between the use of vendor-specific M P I implementations and T C P , which is the protocol used within the glue layer. M P I C H - G 2 maintains a notion of locality and uses this to optimize for both intra-cluster and inter-cluster communications. For inter-cluster communications, the developers report that M P I C H - G 2 optimizes T C P but there is no description of these optimizations and how they might work in general when all the nodes are non-local. Similar to M P I C H - G 2 , P A C X - M P I [30] also takes a M e t a M P I approach to CoCs utilizing vendor M P I implementations and using a glue layer between clusters. The unified stack approach is yet another direction researchers have taken for CoCs . Defining a unified stack eliminates the redundancy and potentially high overheads that a M e t a M P I glue layer presents. M P I C H - V M I [38] takes this ap-proach. M P I C H - V M I utilizes the Vi r tua l Machine Interface ( V M I ) [37], a .mid-dleware optimized for communication within heterogeneous networks. In addition, M P I C H - V M I provides profile-guided optimizations where communication charac-teristics of an application can be automatically captured and used to intelligently allocate nodes. M P I C H - V M I also provides optimizations for communication hier-archies by helping collectives become topologically aware. A similar project that presents a unified stack for CoCs is M P I C H / M A D I I I [2], although it provides no profile-guided optimizations nor topologically aware collectives. In terms of M P I applications executing over W A N s , our S C T P R P I is most similar to the unified stack approach for CoCs because there is no meta-layer re-quired to join heterogenous systems. Instead of implementing our stack for various interconnects, we rely on IP to provide us support for heterogenous networks since 44 most interconnects provide implementations of IP. B y using S C T P atop IP, we take advantage of S C T P ' s loss and latency resilience keeping the M P I implementation thin and to a minimum. One disadvantage of using S C T P for W A N s might be the presence of middle-boxes like firewalls or boxes performing network address translation ( N A T ) . Even though S C T P is atop IP, some middleboxes are transport (i.e., L4 layer) aware and typically only work with T C P . However, some firewalls like the IPTables firewall wi thin Linux, support S C T P . Even with full support in middleboxes, the fact that S C T P is multihomed presents some unique challenges. The I E T F behave working group [4] is currently researching what S C T P N A T traversal considerations are re-quired; an ini t ial draft [47] has been proposed by Randal l Stewart, but at the time of writing of this thesis, it is st i l l in the process of becoming a standard solution. 45 Chapter 6 Conclusions In this thesis, we discussed and compared various M P I designs using either T C P or S C T P . The various implementations are summarized in Table 6.1. The standard L A M - T C P implementation has a message progression layer whereas the other two designs th in or eliminate this communication component from the M P I middleware. This is done by pushing functionality commonly present in this component down onto a transport protocol. As a result, M P I implementations are simplified and in addition, they can leverage advances in networking protocol research instead of hav-ing to implement certain functionalities in the middleware. Our designs illustrate some limitations of T C P in terms of scalability and of S C T P in terms of missing features. For S C T P , because of the lack of stream-level flow control and the abil-ity to select on a particular stream, it is not possible to completely eliminate the message progression layer with something more scalable than a socket per T R C . However, S C T P scales better than T C P , it avoids the head of line blocking that can occur in standard T C P implementations and it provides more opportunity for fairer concurrent message transfer in the case of longer messages. M P I designs that make extensive use of standard protocol stacks may pro-vide a solution to interoperability among interconnects and can take advantage of commodity networking as it continues to gain momentum. 46 Message Message Long/Short Scalability Progression Matching and Protocol Layer Queues L A M - T C P standard, required due required requires maintains to T R C link to avoid large number full message aggregation and resource of sockets progression the presence of exhaustion; using costly state wildcards has user-level flow control s e l e c t ( ) T C P socket eliminated, pushes eliminated, requires per T R C functionality unexpected relies on largest split between queue into T C P flow number of library and transport but control sockets transport a more local using costly layer expected queue required in the library for non-blocking calls s e l e c t ( ) S C T P thinned, required, cannot required, requires requires receive from a S C T P has only a single additional specific stream flow control one-to-many features of in S C T P ; avoids per socket S C T P one- head of line association avoiding to-many A P I blocking in T C P and not per costly for full implementations stream s e l e c t O elimination Table 6.1: Summary of presented M P I implementations 4 7 Bibliography [1] Rumana Alamgir , Mohammed Atiquzzaman, and W i l l i a m Ivancic. Effect of congestion control on the performance of T C P and S C T P over satellite net-works. In NASA Earth Science Technology Conference, Pasadena, C A , June 2002. [2] Olivier Aumage and Guillaume Mercier. M P I C H / M A D I I I : a cluster of clusters enabled M P I implementation. IEEE/A CM International Symposium on Cluster Computing and the Grid (CCGRID'03), 2003. [3] B . Barrett, J . M . Squyres, A . Lumsdaine, R. L . Graham, and G . Bosilca. A n a l -ysis of the component architecture overhead in Open M P I . In Proceedings, 12th European PVM/MPI Users' Group Meeting, Sorrento, Italy, September 2005. [4] I E T F behave working group. Working group charter. Available from http: / / w w w . ietf. org/html, charters/behave-charter .html. [5] B P r o c . Beowulf distributed process space. Available from http://bproc.sourceforge.net. [6] Ron Brightwell and Ke i th Underwood. Evaluation of an eager protocol opti-mization for M P I . In PVM/MPI, pages 327-334, 2003. [7] Ron Brightwell, K e i t h Underwood, and Rol f Riesen. A n ini t ia l analysis of the impact of overlap and independent progress for M P I . In Proceedings of the 11th European PVM/MPI Users' Group Meeting, September 2004. [8] Greg Burns and Raja Daoud. Robust message delivery wi th guaranteed re-sources. In Proceedings of Message Passing Interface Developer's and User's Conference (MPIDC), M a y 1995. [9] Greg Burns, Ra ja Daoud, and James Vaigl . L A M : A n open cluster environment for M P I . In Proceedings of Supercomputing Symposium, pages 379-386, 1994. 48 [10] Ra lph Butler, W i l l i a m Gropp, and Ewing Lusk. Components and interfaces of a process management system for parallel programs. Parallel Computing, 27(11):1417-1429, 2001. [11] Neal Cardwell , Stefan Savage, and Thomas Anderson. Modeling T C P latency. In INFOCOM, pages 1742-1751, 2000. [12] Yoonsuk Choi , Kyungshik L i m , Hyun-Kook Kahng, and I. Chong. A n experi-mental performance evaluation of the stream control transmission protocol for transaction processing in wireless networks. In ICOIN, pages 595-603, 2003. [13] Aaron E . Darling, Lucas Carey, and W u chun Feng. The design, implementa-tion, and evaluation of m p i B L A S T . In ClusterWorld Conference & Expo and the 4th International Conference on Linux Clusters, 2003. [14] W . Feng and P. Tinnakornsrisuphap. The failure of T C P in high-performance computational grids. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 37, Washington, D C , U S A , 2000. I E E E Computer Society. [15] Edgar Gabriel, Graham E . Fagg, George Bosilca, Thara Angskun, Jack J . Don-garra, Jeffrey M . Squyres, Visha l Sahay, Prabhanjan Kambadur, Br ian Barrett, Andrew Lumsdaine, Ralph H . Castain, David J . Daniel, Richard L . Graham, and Timothy S. Woodall . Open M P I : Goals, concept, and design of a next gen-eration M P I implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004. [16] Edgar Gabriel, Gra l iam E . Fagg, George Bosilca, Thara Angskun, Jack J . Don-garra, Jeffrey M . Squyres, Visha l Sahay, Prabhanjan Kambadur, Br i an Barrett, Andrew Lumsdaine, Ralph H . Castain, David J . Daniel, Richard L . Graham, and Timothy S. Woodall . Open M P I : Goals, concept and design of a next gen-eration M P I implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004. [17] Louay Gammo, T i m Brecht, A m o l Shukla, and David Pariag. Comparing and evaluating epoll, select, and pol l event mechanisms. In Proceedings of the Linux Symposium, July 2004. [18] Patr icia Gilfeather and Ar thur B . Maccabe. Connection-less T C P . In IPDPS, 2005. [19] Globus. G r i d middleware. Available from http://www.globus.org. 49 [20] Thomas J . Hacker, Br ian D . Noble, and Br ian D . Athey. Improving throughput and maintaining fairness using parallel T C P . In IEEE INFOCOM, 2004. [21] Wei Huang, Gopalakrishnan Santhanaraman, Hyun-Wook J in , and Dha-baleswar K . Panda. Design alternatives and performance trade-offs for im-plementing M P I - 2 over Infiniband. In PVM/MPI, pages 191-199, 2005. [22] I. Foster and C . Kesselman. Globus: A Metacomputing Infrastructure Toolkit . Intl Journal of Supercomputer Applications, 11(2):115 - 128, 1997. [23] Janardhan R. Iyengar, Paul D . Amer, and Randal l Stewart. Retransmission policies for concurrent multipath transfer using S C T P multihoming. In ICON 2004, Singapore, November 2004. [24] Janardhan R . Iyengar, Keyur C . Shah, Pau l D . Amer, and Randal l Stewart. Concurrent multipath transfer using S C T P multihoming. In SPECTS 2004, San Jose, Ju ly 2004. [25] Humaira Kama l , Brad Penoff, Mike Tsai , Ed i t h Vong, and A l a n Wagner. Using S C T P to hide latency in M P I programs. In Heterogeneous Computing Work-shop: Proceedings of the 2006 IEEE International Parallel and Distributed Pro-cessing Symposium (IPDPS), A p r i l 2006. [26] Humaira Kama l , B rad Penoff, and A l a n Wagner. Evaluating transport level protocols for M P I in the Internet. In International Conference on Communi-cations in Computing (CIC 2005), June 2005. [27] Humaira K a m a l , Brad Penoff, and A l a n Wagner. SCTP-based middleware for M P I in wide-area networks. In 3rd Annual Conf. on Communication Networks and Services Research (CNSR2005), pages 157-162, Halifax, M a y 2005. I E E E Computer Society. [28] Humaira K a m a l , B rad Penoff, and A l a n Wagner. S C T P versus T C P for M P I . In Supercomputing '05: Proceedings of the 2005 A CM/IEEE conference on Su-percomputing, Washington, D C , U S A , 2005. I E E E Computer Society. [29] Nicholas T . Karonis, Br ian R. Toonen, and Ian T . Foster. M P I C H - G 2 : A grid-enabled implementation of the message passing interface. CoRR, cs.DC/0206040, 2002. [30] Rainer Keller, Edgar Gabriel, Bett ina Krammer, Matthias S. Mueller, and Michael M . Resch. Towards efficient execution of M P I applications on the 50 Gr id : Por t ing and optimization issues. Journal of Grid Computing, 1:133-149, June 2003. [31] Sourabh Ladha and Pau l Amer. Improving multiple file transfers using S C T P multistreaming. In Proceedings IPCCC, A p r i l 2004. [32] Supratik Majumder and Scott Rixner. Comparing Ethernet and Myrinet for M P I communication. In LCR '04- Proceedings of the 7th workshop on Work-shop on languages, compilers, and run-time support for scalable systems, pages 1-7, New York, N Y , U S A , 2004. A C M Press. [33] Mat t Mathis , John Heffner, and Raghu Reddy. WeblOO: Extended T C P instru-mentation for research, education and diagnosis. SIGCOMM Comput. Com-mun. Rev., 33(3):69-79, 2003. [34] M . Matsuda, T . Kudoh , H . Tazuka, and Y . Ishikawa. The design and imple-mentation of an asynchronous communication mechanism for the M P I commu-nication model. In IEEE Intl. Conf. on Cluster Computing, pages 13-22, Dana Point, Ca . , Sept 2004. [35] Alberto Medina, Mark Al lman , and Sally Floyd. Measuring the evolution of transport protocols in the Internet, A p r i l 2005. To appear in A C M C C R . [36] O p e n P B S . Open-source portable batch system implementation. Available from http://www.openpbs.org. [37] Scott Pakin and Avneesh Pant. V M I 2.0: A dynamically reconfigurable messag-ing layer for availability, usability, and management. In The 8th International Symposium on High Performance Computer Architecture (HPCA-8), Workshop on Novel Uses of System Area Networks (SAN-1), Cambridge, Massachusetts, February 2, 2002. [38] Avneesh Pant and Hassan Jafri. Communicating efficiently on cluster based grids with M P I C H - V M I . In Proceedings of 2004 IEEE International Conference on Cluster Computing, San Diego, C A , September 2004. I E E E . [39] Brad Penoff and A l a n Wagner. Towards M P I progression layer elimination with T C P and S C T P . In 11th International Workshop on High-Level Programming Models and Supportive Environments (HIPS 2006). I E E E Computer Society, A p r i l 25 2006. [40] R. Rajamani, S. Kumar , and N . Gupta. S C T P versus T C P : Comparing the performance of transport protocols for web traffic. Technical report, University of Wisconsin-Madison, M a y 2002. 51 [41] M . Atiquzzaman S. F u and W . Ivancic. S C T P over satellite networks. In IEEE Computer Communications Workshop (CCW 2003), pages 112-116, Dana Point, Ca . , October 2003. [42] Sourceforge. L inux channel bonding project website. http://sourceforge.net/projects/bonding. [43] Jeff Squyres. Lead developer of OpenMPI , private communication, 2005. [44] Jeffrey M . Squyres, Br ian Barrett, and Andrew Lumsdaine. Request progres-sion interface (RPI) system services interface (SSI) modules for L A M / M P I . Technical Report TR579, Indiana University, Computer Science Department, 2003. [45] Jeffrey M . Squyres, Br ian Barrett, and Andrew Lumsdaine. The system services interface (SSI) to L A M / M P I . Technical Report TR575, Indiana University, Computer Science Department, 2003. [46] W . Richard Stevens, B i l l Fenner, and Andrew M . Rudoff. UNIX Network Programming, Vol. 1, Third Edition. Pearson Education, 2003. [47] Randal l Steward and Michael Tuexen. S C T P network address translation In-ternet draft. Available from http://www.ietf.org/internet-drafts/draft-stewart-behave-sctpnat-01.txt. [48] R. Stewart, Q. Xie , K . Morneault, C . Sharp, H . Schwarzbauer, T . Taylor, M . K a l l a I. Ryt ina , L . Zhang, and V . Paxson. The Stream Control Transmission Protocol ( S C T P ) . Available from http://www.ietf.org/rfc/rfc2960.txt, October 2000. [49] Randal l R. Stewart and Qiaobing X ie . Stream control transmission protocol (SCTP): a reference guide. Addison-Wesley Longman Publishing Co. , Inc., 2002. [50] Top500. Top 500 Supercomputing Sites. Available from http://www.top500.org. [51] Dave Turner, Shoba Selvarajan, Xuehua Chen, and Weiyi Chen. The M P _ L i t e message-passing library. In 14th IASTED International Conference on Parallel and Distributed Computing and Systems, Cambridge, Massachusetts, November 2002. 52 [52] W . Gropp and E . Lusk. M P I C H working note: Creating a new M P I C H device using the channel interface. Technical Report A N L / M C S - T M - 2 1 3 , Argonne National Laboratory, July 1996. [53] W . Gropp and E . Lusk. M P I C H working note: The implementation of the second generation M P I C H A D I . Technical Report A N L / M C S - T M - n u m b e r , Argonne National Laboratory, 2005. [54] W . Gropp, E . Lusk, N . Doss and A . Skjellum. High-performance, portable implementation of the M P I message passing interface standard. Parallel Com-puting, 22(6):789-828, September 1996. [55] Interoperable Message Passing Interface website. I M P I . ht tp: / / impi.nist .gov/ . [56] T .S . Woodall , R . L . Graham, R . H . Castain, D . J . Daniel, M . W . Sukalski, G . E . Fagg, E . Gabriel, G . Bosilca, T . Angskun, J . J . Dongarra, J . M . Squyres, V . Sa-hay, P. Kambadur, B . Barrett, and A . Lumsdaine. T E G : A high-performance, scalable, multi-network point-to-point communications methodology. In Pro-ceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004. [57] J . Yoakum and L . Ong. A n introduction to the Stream Control Transmission Protocol ( S C T P ) . Available from http://www.ietf.org/rfc/rfc3286.txt, M a y 2002. 53 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items