UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

SCTP-based middleware for MPI Kamal, Humaira 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2005-0498.pdf [ 6.36MB ]
JSON: 831-1.0051330.json
JSON-LD: 831-1.0051330-ld.json
RDF/XML (Pretty): 831-1.0051330-rdf.xml
RDF/JSON: 831-1.0051330-rdf.json
Turtle: 831-1.0051330-turtle.txt
N-Triples: 831-1.0051330-rdf-ntriples.txt
Original Record: 831-1.0051330-source.json
Full Text

Full Text

SCTP-Based Middleware for MPI by Humaira Kamal B.S.E.E. University of Engineering and Technology, Lahore, Pakistan, 1994 M.Sc. Computer Science, Lahore Univ. of Management Sciences, Pakistan, 2001 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University of British Columbia August 2005 © Humaira Kamal, 2005 Abstract SCTP (Stream Control Transmission Protocol) is a recently standardized transport level protocol with several features that better support the communication require-ments of parallel applications; these features are not present in traditional TCP (Transmission Control Protocol). These features make SCTP a good candidate as a transport level protocol for MPI (Message Passing Interface). MPI is a message passing middleware that is widely used to parallelize scientific and compute inten-sive applications. TCP is often used as the transport protocol for MPI in both local area and wide-area networks. Prior to this work, SCTP has not been used for MPI. In this thesis, we compared and evaluated the benefits of using SCTP instead of TCP as the underlying transport protocol for MPI. We redesigned LAM-MPI, a public domain version of MPI, to use SCTP. We describe the advantages and disadvantages of using SCTP, the necessary modifications to the MPI middleware to use SCTP, and the performance of SCTP as compared to the stock implementation that uses TCP. ii Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgements ix Dedication x 1 Introduction 1 2 Overview 7 2.1 SCTP Overview 7 2.2 Overview of MPI Middleware 12 2.2.1 Message Progression Layer 14 2.2.2 Message Delivery Protocol 15 2.2.3 Overview of NAS Parallel Benchmarks 17 3 Evaluating TCP for MPI in WANs 19 3.1 TCP Performance Issues 20 3.2 Investigating the Use of TCP for MPI in WANs 22 iii 3.3 MPI in an Emulated WAN Environment 23 3.3.1 Communication Characteristics of the NAS Benchmarks . . . 24 3.4 Experiments 25 3.4.1 NAS Benchmarks, Loss 26 3.4.2 NAS Benchmarks, Latency 27 3.4.3 NAS Benchmarks, Nagle's Algorithm and Delayed Acknowl-edgments 28 3.5 Summary of Results 30 4 SCTP-based Middleware: Design and Implementation 32 4.1 Overview of using SCTP for MPI 35 4.2 Message Demultiplexing 36 4.3 Concurrency and SCTP streams 37 4.3.1 Assigning Messages to Streams 38 4.3.2 Head-of-Line Blocking 38 4.3.3 Example of Head-of-Line Blocking 39 4.3.4 Maintaining State 41 4.4 Resource Management 42 4.5 Race Conditions 43 4.5.1 Race Condition in Long Message Protocol 43 4.5.2 Race Condition in MPI.Init 46 4.6 Reliability 46 4.6.1 Multihoming Feature 47 4.6.2 Added Protection 48 4.6.3 Use of a Single Underlying Protocol 51 4.7 Limitations 51 5 Experimental Evaluation 53 5.1 Evaluation Using Benchmark Programs 55 iv 5.1.1 MPBench Ping-Pong Test 55 5.1.2 Using NAS Benchmarks 58 5.2 Evaluation Using a Latency Tolerant Program 59 5.2.1 Comparison Using a Real-World Program 59 5.2.2 Investigating Head-of-line Blocking 63 6 Related Work 66 7 Conclusions and Future Work 72 Bibliography 75 v List of Tables 3.1 Breakdown of message types for LU classes S, W, A and B using the default L A M settings for TCP 24 3.2 Breakdown of message types for SP classes S, W, A and B using the default L A M settings for TCP 25 5.1 The performance of SCTP and TCP using a standard ping-pong test under loss 56 vi List of Figures 2.1 Single association with multiple streams 10 2.2 MPI and L A M envelope format 13 2.3 Examples of valid and invalid message orderings in MPI 14 2.4 Message progression in MPI middleware 16 3.1 NAS benchmarks performance for dataset B under different loss rates using TCP SACK 26 3.2 NAS benchmarks performance for dataset B with different latencies using TCP SACK 27 3.3 Performance of LU using different combinations of Nagle's algorithm and delayed acknowledgments 28 3.4 Comparison of LU and SP benchmarks for different datasets 29 4.1 Similarities between the message protocol of MPI and SCTP . . . . 35 4.2 Example of head-of-line blocking in L A M TCP module 39 4.3 Example of multiple streams between two processes using non-blocking communication 40 4.4 Example of multiple streams between two processes using blocking communication 41 4.5 Example of long message race condition 44 4.6 SCTP's four-way handshake 49 vii 5.1 The performance of SCTP using a standard ping-pong test normal-ized to TCP under no loss 55 5.2 TCP versus SCTP for the NAS benchmarks 58 5.3 TCP versus SCTP for (a) short and (b) long messages for the Bulk Processor Farm application 61 5.4 TCP versus SCTP for (a) short and (b) long messages for the Bulk Processor Farm application using Fanout of 10 63 5.5 Effect of head-of-line Blocking in the Bulk Processor Farm program for (a) short and (b) long messages 64 viii Acknowledgements I would like to acknowledge the invaluable contribution of my supervisor Dr. Alan Wagner who has been a constant source of inspiration for me. He has always been generous with his time and discussions with him have added a great deal to my knowledge. One of the best things about working under his supervision is that he has always encouraged dialogue and provided constructive feedback on our work. I also want to thank Dr. Norm Hutchinson for reading my thesis and pro-viding useful feedback. I would also like to acknowledge the many useful discussions I have had with my group partner, Brad Penoff, over the course of our project that contributed to our mutual learning. Many thanks to Randall Stewart, one of the primary designers of SCTP, for answering our questions and providing timely fixes to the protocol stack that made it possible to provide the performance evaluation. Thanks to Jeff Squyres, one of the developers of LAM-MPI, for readily answering our many questions about the LAM-MPI implementation. H U M A I R A K A M A L The University of British Columbia August 2005 ix To my parents, sisters and brother, without whose encouragement and support this would not have been possible. x Chapter 1 Introduction I n c r e a s e s i n n e t w o r k b a n d w i d t h a n d c o n n e c t i v i t y h a v e s p u r r e d i n t e r e s t i n d i s t r i b u t e d s y s t e m s t h a t c a n u t i l i z e t h e n e t w o r k a n d c o m p u t a t i o n a l r e s o u r c e s a v a i l a b l e i n w i d e -a r e a n e t w o r k s ( W A N s ) . T h e c h a l l e n g e i s h o w t o h a r n e s s t h e p o w e r o f t h e h u n d r e d s a n d t h o u s a n d s o f p r o c e s s o r s t h a t m a y b e a v a i l a b l e a c r o s s t h e n e t w o r k t o s o l v e p r o b -l e m s f a s t e r b y e x e c u t i n g t h e m i n p a r a l l e l . I n t r o d u c e d a b o u t a d e c a d e a g o , t h e M e s s a g e P a s s i n g I n t e r f a c e ( M P I ) [ 4 6 , 2 4 ] h a s b e c o m e t h e d e f a c t o s t a n d a r d f o r c o n -s t r u c t i n g p a r a l l e l a p p l i c a t i o n s i n d i s t r i b u t e d e n v i r o n m e n t s . T h e M P I f o r u m d e f i n e s a l i b r a r y o f f u n c t i o n s t h a t c a n b e u s e d t o c r e a t e p a r a l l e l a p p l i c a t i o n s w h i c h c o m m u -n i c a t e b y m e s s a g e p a s s i n g . T h e p r i m a r y g o a l s o f M P I w e r e t o p r o v i d e p o r t a b i l i t y o f s o u r c e c o d e a n d a l l o w e f f i c i e n t i m p l e m e n t a t i o n s a c r o s s a r a n g e o f a r c h i t e c t u r e s . M P I f a c i l i t a t e s c r e a t i o n o f e f f i c i e n t s c i e n t i f i c a n d c o m p u t e - i n t e n s i v e p a r a l l e l p r o -g r a m s a n d i m p l e m e n t a t i o n s a r e a v a i l a b l e f o r a w i d e v a r i e t y o f p l a t f o r m s . S i n c e i t s i n c e p t i o n , M P I h a s b e e n v e r y p o p u l a r a n d v i r t u a l l y e v e r y v e n d o r o f p a r a l l e l s y s t e m s h a s i m p l e m e n t e d t h e i r o w n v a r i a n t [ 4 9 ] . T h e r e a r e a l s o o p e n s o u r c e , p u b l i c d o m a i n i m p l e m e n t a t i o n s a v a i l a b l e f o r e x e c u t i o n o f M P I p r o g r a m s i n l o c a l a r e a n e t w o r k s . N o t a b l e a m o n g t h e m a r e L o c a l A r e a M u l t i c o m p u t e r ( L A M ) [ 2 2 ] a n d M P I C H [ 6 2 ] . I n i t i a l i m p l e m e n t a t i o n s o f M P I w e r e t a r g e t e d t o w a r d s t i g h t l y c o u p l e d c l u s t e r -l i k e m a c h i n e s , w h e r e r e l i a b i l i t y a n d s e c u r i t y w e r e n o t a n i s s u e . I m p l e m e n t a t i o n s o f 1 MPI for local area networks widely use TCP as the underlying transport protocol, to provide reliable in-order delivery of messages. TCP was available in the first public domain versions of MPI such as L A M and MPICH, and more recently the use of MPI with TCP has been extended to computing grids [37], wide-area net-works, the Internet and meta-computing environments that link together diverse, geographically distributed, computing resources. The main advantage to using an IP-based protocol (i.e., TCP/UDP) for MPI is portability and ease with which it can be used to execute MPI programs in diverse network environments. One well-known problem with using TCP in wide-area environments is the presence of large latencies and the difficulty in utilizing all of the available bandwidth. MPI processes, in an application, tend to be loosely synchronized, even when they are not directly communicating with each other. If delays occur in one process, they can potentially have a ripple effect on other processes, which can get delayed as well. As a result, performance can be poor especially in the presence of network congestion and the resulting increased latency which can cause processes to stall waiting for the delivery or receipt of a message. Moreover, MPI processes typically establish a large number of TCP connections for communicating with the other processes in the application. This imposes a requirement on the operating system to maintain a large number of socket descriptors. Although applications sensitive to latency suffer when run over TCP or UDP, there are latency tolerant programs such as those that are embarrassingly parallel, or almost so, that can use an IP-based transport protocol to execute in environments like the Internet. In addition, the dynamics of TCP is an active area of research and there is interest in better models [7] and tools for instrumenting and tuning TCP connections [43]. As well, TCP itself continues to evolve, especially for high performance links, with research into new variants like TCP Vegas [21, 45]. Finally, latency hiding techniques and exploiting trade-offs between bandwidth and latency can further expand the range of MPI applications that may be suitable to execute 2 over IP in both local and wide-area networks. In the end, the ability for MPI programs to execute unchanged in almost any environment is a strong motivation for continued research in IP-based transport protocol support for MPI. Currently, TCP is widely used as an IP-based transport protocol for MPI. The TCP protocol, however, is not well matched for MPI applications, especially in wide-area networks, for a number of reasons, which are discussed below: 1. TCP is a stream-based protocol, whereas MPI is message-oriented. This im-poses a requirement on the implementation of the MPI library to carry out message framing. 2. The TCP protocol provides strict in-order delivery of all user data transmit-ted on a connection. MPI applications need reliable message transfer, but not a globally strict sequenced delivery. In Chapter 2 we discuss the details of MPI message matching and how messages with the same tag, rank and con-text should be delivered in order, while those with different values for these parameters are allowed to overtake each other. The strict in-order delivery of messages by TCP can result in unnecessary head-of-line blocking in case of network loss or delays. 3. The ability to recover quickly from loss in a network is especially desirable in wide-area environments. TCP SACK has been shown to be more robust under loss than other variants of TCP, like TCP Reno, however, SACK is not an integral part of the TCP protocol and is provided as an option in some implementations [19]. The user, thus, has to make sure that the proper TCP SACK implementation is available on all the nodes, in order to take advantage of better performance under loss. 4. TCP is known to be vulnerable to certain denial-of-service attacks like SYN flooding. In open environments, like wide-area networks or the Internet, re-silience against such attacks is a concern. 3 5. TCP has no built-in support for multihomed hosts and in case of network failures, there is no provision for it to failover to an alternate path. Mul-tihomed hosts are increasingly becoming common and the ability of being able to transparently switchover to another path can be especially useful in a wide-area network. Recently, a new transport protocol called SCTP (Stream Control Transmis-sion Protocol) [59] has been standardized by the IETF. SCTP alleviates many of the problems with TCP. SCTP is message oriented like UDP but has TCP-like con-nection management, congestion and flow control mechanisms. In SCTP, there is an ability to define streams that allow multiple independent and reliable message subflows inside a single association. This eliminates the head-of-line blocking that can occur in TCP-based middleware for MPI. In addition, SCTP associations and streams closely match the message-ordering semantics of MPI when messages with the same tag, source rank and context are used to define a stream (tag) within an association (source rank). SCTP includes several other mechanisms that make it an attractive target for MPI in open network environments where secure connection management and con-gestion control are important. It makes it possible to offload some MPI middleware functionality onto a standardized protocol that will hopefully become universally available. Although new, SCTP is currently available for all major operating sys-tems and is part of the standard Linux kernel distribution. In this thesis we propose the use of SCTP as a robust transport protocol for MPI in wide-area networks. We first investigate the problems that arise from the use of TCP for MPI in an emulated wide-area network. The objective is to gain a better understanding of the effect TCP has on MPI performance and to learn about what kind of applications are suitable for such environments. We use L A M as the MPI middleware for our experiments and also extensively instrument the L A M TCP module to gain an insight into its working. Our second step is to redesign the 4 LAM-MPI middleware to take advantage of the features SCTP provides. During our design, we investigate the shortcomings of TCP, discussed above, and introduce features using SCTP that alleviate those shortcomings. One novel advantage to us-ing SCTP that we investigate is the elimination of the head-of-line blocking present in TCP-based MPI middleware. Next, we carry out an extensive evaluation of our implementation and compare its performance with the L A M TCP module. We use standard benchmarks as well as real-world programs to evaluate the performance of SCTP. Our experiments show that SCTP outperforms TCP under loss, when latency tolerant applications are used. The benefits of using SCTP were substantial even when a single stream was used between two endpoints. Use of SCTP's mul-tistreaming feature adds to the overall performance and we evaluate the effects of head-of-line blocking in our applications. A second indirect contribution of our work is with respect to SCTP. Our MPI middleware makes very aggressive use of SCTP. In using the NAS parallel benchmarks [9] along with programs of our own design, we were able to uncover problems in the FreeBSD implementation of the protocol that led to improvements in the stack by the SCTP developers. Overall, the use of SCTP makes for a more resilient implementation of MPI that avoids many of the problems present in using TCP, especially in an open wide-area network environment, such as the Internet. Of course, it doesn't eliminate the performance issues of operating in that environment, nor does it eliminate the need for tuning connections. Many of the same issues remain and as mentioned, this is an active area of research in the networking community where numerous variants of TCP have been proposed. Because of the similarity between the two, SCTP will be able to take advantage of improvements to TCP. Although the advantages of SCTP have been investigated in other contexts, this is the first use of it for MPI. SCTP is an attractive replacement for TCP in a wide-area network environment, and adds to the set of IP transport protocols that can be used for MPI. The release of Open MPI [23], and its ability to mix and match transport mechanisms, is an ideal target 5 of our work where an SCTP module can further extend MPI across networks that require a robust TCP-like connection. The remaining thesis is organized as follows. In Chapter 2, we give a back-ground overview of SCTP, MPI middleware and NAS benchmarks used in our ex-periments. Chapter 3 investigates the use of TCP in wide-area environments where bandwidth and latency can vary and discusses the challenges of using TCP for MPI. In Chapter 4, we discuss our motivation for using SCTP for MPI and discuss the design and implementation details. In Chapter 5, we report the results of various experiments carried out to evaluate our SCTP module and provide comparisons with TCP. Research and related work on wide-area computing is discussed in Chapter 6. Chapter 7 presents our conclusions and directions for future work. This thesis is part of work done jointly with Brad Penoff, a fellow graduate student at UBC. This thesis focuses on the higher level design and implementation issues of our SCTP module. There were several other implementation issues, which may likely be de-tailed in Brad Penoff's thesis. Details about design and evaluation of the SCTP module in Chapters 4 and 5 are part of our paper due to appear in Supercomputing conference in November 2005 [35]. 6 Chapter 2 Overview 2.1 SCTP Overview SCTP is a general purpose unicast transport protocol for IP network data commu-nications, which has been recently standardized by the IETF [59, 66, 56]. It was initially introduced by a working group, SIGTRAN, of the IETF as a mechanism for transportation of call control signaling over the Internet. Telephony signaling has rigid timing and reliability requirements that must be met by a phone service provider. Investigations by SIGTRAN led them to conclude that TCP is inadequate for transport of telephony signaling due to the following reasons. Independent mes-sages over a TCP connection can encounter head-of-line blocking due to TCP's order-preserving property, which can cause critical call control timers to expire, leading to setup failures. Also, there is a lack of path-level redundancy support in TCP, which is a limitation compared to the signaling protocol SS7 that is designed with full link-level redundancy. As well, lack of control over the TCP retransmission timers is a shortcoming for applications with rigid timing requirements. Development of SCTP was directly motivated by the need to find a bet-ter transport mechanism for telephony signaling [59]. SCTP has since evolved for more general use to satisfy the needs of applications that require a message-oriented protocol with all the necessary TCP-like mechanisms and additional features not 7 present in either TCP or U D P . SCTP has been referred to, by its designers, as "Super TCP" and the rea-son is that SCTP provides sequencing, flow control, reliability and full-duplex data transfer like TCP, however, it provides an enhanced set of capabilities not in TCP that make applications less susceptible to loss. Like U D P , the SCTP data transport service is message oriented and supports the framing of application data. But like TCP, SCTP is session-oriented and communicates by establishing an association between two endpoints. In SCTP, the word association is used instead of a con-nection between two endpoints because an association has a broader scope than a TCP connection. In an SCTP association, an endpoint can be represented by mul-tiple IP addresses and a port, whereas in TCP, each endpoint is bound to exactly one IP address. SCTP supports two types of sockets: one-to-one style and one-to-many style sockets. One of the design objectives of SCTP was to allow existing TCP applications to be migrated to SCTP with little effort. The one-to-one SCTP style socket provides this function, and the few changes required to achieve this are as follows. The sockets used for communication among processes are created using IPPROTCLSCTP rather than IPPROTCLTCP. Socket options like turning on/off Nagle's algorithm are similar to TCP except that socket option SCTPJJODELAY is used instead of TCP-NODELAY. In the one-to-many style a single socket can commu-nicate with multiple SCTP associations similar to a UDP socket that can receive datagrams from different UDP endpoints. An association identifier is used to dis-tinguish an association on the one-to-many socket. This style eliminates the need for maintaining a large number of socket descriptors. Additionally, SCTP supports multiple logical streams within an association, where each stream has the property of independent, sequenced message delivery. Loss in any of the streams affects only that stream and not others. In an SCTP association, a maximum of 65,536 unidirectional streams can be defined by either end for simultaneous delivery of data. 8 Multistreaming in SCTP is achieved through an independence between de-livery of data and transmission of data. An SCTP packet has one common header followed by one or more "chunks". There are two types of chunks; control and data and each chunk is fully self-descriptive. Control chunks carry information that is used for maintenance of an SCTP association, while user messages are carried in data chunks. Each data chunk is assigned two sets of sequence numbers; a transport sequence number TSN that regulates the transmission of messages and the detection of message loss, and a stream identifier SNo and stream sequence number SSN pair, which is used to determine the sequence of delivery of received data. The receiver can thus use this mechanism to determine when a gap occurs in the transmission sequence, for example, due to loss, and which stream it affects. Depending on the size of a user message and the path maximum transmission unit (PMTU), SCTP may fragment the message into one or more data chunks, however, a single data chunk cannot carry more than one user message. In order to handle the case in which the user message is fragmented, the SSN remains the same on all the data chunks containing the fragments, while consecutive TSNs are assigned to the individual fragments in the order of the fragmentation. An SCTP sender uses a round robin scheme to transmit user messages with different stream identifiers to the receiver, so that none of the streams are starved [57]. Figure 2.1 shows an association between two endpoints with three streams identified as 0, 1 and 2. As Figure 2.1 shows, SCTP sequences data chunks and not bytes as in TCP. Together SNo, SSN, and TSN are used to assemble messages and guarantee the ordered delivery of messages within a stream but not between streams. For example, receiver Y in Figure 2.1 can deliver Msg2 before Msgl should it happen to arrive at Y before Msgl. In contrast, when a TCP source sends independent messages to the same receiver at the same time, it has to open multiple independent TCP connections. It is possible to have multiple streams by having parallel TCP connections and parallel 9 L o g i c a l V i e w of Mul t ip le S t r e a m s in a n A s s o c i a t i o n E n d p o i n t X E n d p o i n t Y SEND RECEIVE Stream 0 Stream 1 Stream 2 w SEND ^ Stream 0 RECEIVE Stream 1 Stream 2 Msg5 0 Msg4 Message Stream Number (SNo) U s e r m e s s a g e s essage btrearr Msg3 Msg2 Msg1 Fragmentation j TSN=8 TSN=7 TSN=6 TSN=5 TSN=4 TSN=3 TSN=2 TSN=1 SSN=2 SSN=2 SSN=2 SSN=1 SSN=0 SSN=0 SSN=0 SSN=0 SNo=2 SNo=2 SNo=2 SNo=2 SNo=1 SNo=1 SNo=2 SNo=0 2c 2b H m S B s • S C T P L a y e r TSN = Transmission Sequence Number SSN = Stream Sequence Number Control chunk queue Msg5 Msg4 Msg3 Data chunk queue Msg2 Msg1 r i. Bundling IP L a y e r S C T P Packets F i g u r e 2 . 1 : S i n g l e a s s o c i a t i o n w i t h m u l t i p l e s t r e a m s c o n n e c t i o n s c a n a l s o i m p r o v e t h r o u g h p u t i n c o n g e s t e d a n d u n - c o n g e s t e d l i n k s . T h e r e a r e , h o w e v e r , f a i r n e s s i s s u e s a s , i n a c o n g e s t e d n e t w o r k , p a r a l l e l c o n n e c t i o n s c l a i m m o r e t h a n t h e i r f a i r s h a r e o f t h e b a n d w i d t h t h e r e b y a f f e c t i n g t h e c r o s s - t r a f f i c [ 3 0 ] . A l s o , t h e s e n d e r a n d r e c e i v e r h a v e t o f r a g m e n t a n d m u l t i p l e x d a t a o v e r e a c h o f t h e p a r a l l e l T C P c o n n e c t i o n s . O n e a p p r o a c h t o m a k i n g p a r a l l e l c o n n e c t i o n s T C P -f r i e n d l y i s t o c o u p l e t h e m a l l t o a s i n g l e c o n n e c t i o n [ 2 5 ] , S C T P d o e s p r e c i s e l y t h i s b y e n s u r i n g t h a t a l l t h e s t r e a m s w i t h i n a s i n g l e S C T P a s s o c i a t i o n s h a r e a c o m m o n s e t o f c o n g e s t i o n c o n t r o l p a r a m e t e r s . I t o b t a i n s t h e b e n e f i t s o f p a r a l l e l T C P c o n n e c t i o n s 10 while keeping the protocol TCP-friendly [53]. SCTP's property of being TCP-friendly promotes its possible wide deployment in the future. Both the SCTP and TCP protocols use a similar congestion control mech-anism. An SCTP sender maintains a congestion window cwnd variable for each destination address, however, there is only one receive window rwnd variable for the entire association. In today's networks congestion control is critical to any transport protocol and Atiquzzaman et al [53] have shown that SCTP's congestion control mechanism has improvements over TCP, that allow it to achieve larger av-erage congestion window size and faster recovery of segment loss, which results in higher throughput. Another difference between SCTP and TCP is that endpoints in SCTP are multihomed and can be bound to multiple IP addresses (i.e., interfaces). If a peer is multihomed, then an SCTP endpoint will select one of the peer's destination ad-dresses as a primary address and all the other addresses of the peer become alternate addresses. During normal operation, all data is sent to the primary address, with the exception of retransmissions, for which one of the active alternate addresses is selected. Recent work on the retransmission policies of protocols that support multihoming, like SCTP, proposes a policy that sends fast retransmissions to the primary address, and timeout retransmissions to an alternate peer IP address [31]. The authors show that sending all retransmissions to an alternate address is use-ful if the primary address has become unreachable, and can result in performance degradation otherwise. When a destination address is determined to be down, SCTP's multihom-ing feature can provide fast path failover for an association. SCTP currently does not support the simultaneous transfer of data across interfaces, but this will likely change in future. Currently, researchers at the University of Delaware [29, 28] are investigating the use of SCTP's multihoming feature to provide simultaneous trans-fer of data between two endpoints through two or more end-to-end paths, and this 11 functionality may become part of the SCTP protocol. Another factor that may contribute to rapid and widespread deployment of SCTP is the availability of utilities like withsctp [42] and an SCTP shim translation layer [3] that can transparently translate calls to the TCP API into equivalent SCTP calls. The withsctp utility intercepts calls to the TCP socket at the application level and converts them to SCTP calls. The SCTP shim translation layer works at the kernel level and allows a hybrid approach, in which it first tries to establish an SCTP association, and failing which sets up a TCP connection. 2.2 Overview of MPI Middleware MPI has a rich variety of message passing routines. These include MPI_Send and MPIJRecv along with various combinations such as blocking, non-blocking, syn-chronous, asynchronous, buffered and unbuffered versions of these calls. Processes exchange messages by using these routines and the middleware is responsible for progressing all the send and receive requests from their initiation to completion. As an example, the format of the basic blocking and non-blocking send and receive calls is as follows: Blocking MPI_Send and MPI_Recv calls: MPI_Send (msg, count, type, dest_rank, tag, context) MPI_Recv (msg, count, type, source_rank, tag, context). Non-blocking MPI_Isend and MPI_Irecv calls: MPI_Isend(msg, count, type, destjrank, tag, context, requestJiandle) MPI_Irecv(msg, count, type, source_rank, tag, context, requestJiandle) In non-blocking calls, the requestJiandle is returned, which can be.used to check, later in the program execution, if the send/receive operation has completed or not. An MPI message consists of a message envelope and payload as shown in 12 Figure 2.2. The matching of a send request to its corresponding receive request, for example, an MPI_Send to its MPIJlecv, is based on three values inside the message envelope: (i) context, (ii) source/destination rank and (iii) tag. Context identifies Envelope Context Rank Tag Payload Format of MPI Message Message Length Tag Flags Rank Context Sequence Number Structure of Envelope in LAM Figure 2.2: MPI and L A M envelope format a set of processes that can communicate with each other. Within a context, each process is uniquely identified by its rank, an integer from zero to one less than the number of processes in the context. A user can specify the type of information being carried by a message by assigning a tag to it. MPI also allows the source and/or tag of a message to be a wildcard in a receive request. For example if MPI_ANY_SOURCE is used then that specifies that the process is ready to receive messages from any sending process. Similarly if MPIJVNY.TAG is specified, a message arriving with any tag would be received. In addition to their use in message matching, the tag, rank and context (TRC) define the order of delivery of MPI messages. MPI semantics require that any messages sent with the same message TRC value should be delivered in order, i.e., they may not overtake each other, whereas MPI messages with different TRCs are not required to be ordered. As seen in Figure 2.3, if messages between a pair of processes are delivered strictly in the order in which they are sent, then this is perfectly permissible according to the MPI semantics, however, this ordering is only a subset of the legitimate message orderings allowed by MPI. Out of order messages 13 Process X MPI_Send(Msg_l, Tag_A) MPI_Send(Msg_2# Tag_B) MPI_Send(Msg_3, Tag_A) Process Y MPI_Irecv(. . ANY_TAG. .) Msg_1 Process X MPI_Send(Msg_l, Tag_A) MPI_Send (Msg_2, Tag_B) MPI_Send (Msg_3, Tag_A) Process X MPl_Send(Msg_l, Tag_A) MPl_Send(Msg_2, Tag_B) MPI_Send (Msg_3, Tag_A) Process Y MPI_Ireov(..ANY_TAO..) Process Y MPI_Irecv(..ANY_TAG..) X Out of order messages with same tags violate MPI semantics Figure 2.3: Examples of valid and invalid message orderings in MPI with the same tags, however, violate MPI semantics. 2.2.1 Message Progression Layer Implementations of the MPI standard, typically, provide a request progression mod-ule that is responsible for progressing requests from initialization to completion. One of the main functions of this module is to support asynchronous and non-blocking message progression in MPI. The request progression module must maintain state information for all requests, including how far the request has progressed and what 14 type of data/event is required for it to reach completion. There are two approaches to implementing the MPI library; single-threaded and multithreaded implementations. Multithreaded implementations [6, 14, 50] use a separate communication thread for making progress asynchronously and provide increased overlap of communica-tion and computation. Multithreaded implementations, however, have additional overheads resulting from synchronization and cache and context switching, which single-threaded implementations, like L A M and MPICH, avoid. Typically, single-threaded implementations require special design considerations since requests are only progressed by the MPI middleware when the user makes any calls to it. LAM's request progression mechanism is representative of the way requests are handled in other MPI middleware implementations and, due to its modular design, provided a convenient platform for us to work in. L A M provides a TCP Request Progression Interface (RPI) module that is used to perform point-to-point message passing using TCP as its underlying transport protocol. Requests in L A M go through four states: init, start, active and done, and the RPI module is responsi-ble for progressing all requests through their life-cycle. We redesigned LAM's request progression interface layer module for TCP to use SCTP. Although LAM's message passing engine is single-threaded, our design is not limited to single-threaded imple-mentations only. Of course, in a corresponding multithreaded implementation some details will need to change to handle issues related to synchronization and resource contention. In the next section we describe the different types of messages and the message delivery protocols as implemented in LAM. 2.2.2 Message Delivery Protocol As shown in Figure 2.4, when an incoming message is received at the transport layer it is first checked against a list of posted receive requests. If a match is found then the message is dispatched to the correct receive request buffer. If a matching receive was not issued at the time of message arrival, the message is placed in an unexpected 15 App l ica t ion Laye r R e c e i v e R e q u e s t is I ssued Transpor t L a y e r Incoming M e s s a g e is R e c e i v e d Figure 2.4: Message progression in MPI middleware message queue (implemented in LAM as an internal hash table). Similarly, when a new receive request is posted, it is first checked against the unexpected message queue, if a match is found then the request is advanced, otherwise, it is placed in the receive request queue. In LAM, each message body is preceded by an envelope that is used to match a message. We can broadly classify the messages in three categories with respect to the way L A M treats them internally: 1. Short messages, which are, by default, of size 64 Kbytes or less. 2. Long messages which are of size greater than 64 Kbytes. 3. Synchronous short messages, which are also of size 64 Kbytes or less, but require a different message delivery protocol than short messages, as discussed below. Figure 2.2 shows the format of an envelope in LAM. The flag field in the envelope indicates what type of message body follows it. Short messages are passed using eager-send and the message body immedi-ately follows the envelope. If the receiving process has posted/issued a matching receive buffer, the message is received and copied into the buffer and the request 16 is marked as done. If no matching receive has been posted, then this is treated as an unexpected message and buffered. When a matching receive is later posted, the message is copied in the request receive buffer and the corresponding buffered message is discarded. Long messages are handled differently than short messages and are not sent eagerly, but instead sent using the following rendezvous scheme: Initially only the envelope of the long message is sent to the receiver, if the receiver has posted a matching receive request then it sends back an acknowledgment (ACK) to the sender to indicate that it is ready to receive the message body. The sender sends back an envelope followed by the long message body in response to the ACK received.. If no matching receive was posted at the time the initial long envelope was received, it is treated as an unexpected message. Later when a matching receive request is posted, it sends back an ACK and the rendezvous proceeds as above. Use of eager send for long messages is a topic that has been under research and people have found that an eager protocol for long messages outperforms a rendezvous protocol only if a significant number of receives have been pre-posted [4]. In [65] the authors report that protocols like eager send can lead to resource exhaustion in large cluster environments. Synchronous short messages are also communicated using eager-send, how-ever, the send is not complete until the sender receives an ACK from the receiver. The discussion above about the handling of unexpected messages also applies here. In addition to point-to-point communication functions, MPI has a number of rou-tines for collective operations. In LAM, these routines are implemented on top of point-to-point communication. 2.2.3 Overview of NAS Parallel Benchmarks In Chapters 3 and 5, we describe several experiments conducted using the Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [9]. These benchmarks 17 approximate the performance that can be expected from a portable parallel appli-cation. A brief overview of these benchmarks is included in this section. The NAS parallel benchmarks were developed by the NASA Ames Research Centre and is a set of eight programs that are derived from computational fluid dy-namics code. NPB was used in our experiments because it approximates real-world parallel scientific applications better than most other available benchmarks. The eight programs are: LU (Lower-upper symmetric Gauss-Seidel), IS (Integer Sort), MG (Multi-grid method for Poisson equation), EP (Embarrassingly Parallel), CG (Conjugate Gradient), SP (Scalar Pentadiagonal Systems), BT (Block Tridiagonal Systems) and FT (Fast Fourier Transform for Laplace equation). Each program has several problem sizes (S, W, and A, B, C and D) ranging from small to very large. Communication characteristics of these benchmarks are described in [20], which shows that all of the benchmarks use collective communication. With the ex-ception of EP benchmark, all the benchmarks use point-to-point communication as well. EP uses only MPI_Allreduce and MPIJSarrier and has very few communica-tion calls. The number of communication calls in the remaining benchmarks range from 1000 to 50,000 for the dataset B. The communication volume ranges from less than 12 Kbytes for EP to several gigabytes of data for the other benchmarks. 18 Chapter 3 Evaluating T C P for MPI in WANs In this chapter we evaluate the use of TCP for MPI in wide-area networks and Internet-like environments where bandwidth and latency can vary. Our objective is to gain a better understanding of the impact of IP protocols on MPI performance that will improve our ability to take advantage of wide-area networks for MPI and may suggest techniques to make MPI programs less sensitive to latency, so that they can execute in Internet-like environments. Not all programs will be suitable for these environments; however, for those programs that are, it is important to understand the limitations and to continue to take advantage of IP transport protocols. We performed our experiments in an emulated wide-area network using the NAS parallel benchmarks to evaluate TCP in this environment. Our experiments concentrated on latency and the effect of packet delay and loss on performance, and we measured their effect using two widely available variants of TCP, namely New-Reno and SACK. We also evaluated these two versions with respect to the TCPJJODELAY option, which disables Nagle's algorithm, and the effect of delayed acknowledgments. In this chapter we discuss a number of challenges for using TCP and MPI in a wide-area network. Our focus is on performance issues rather than 19 issues pertaining to management and access. 3.1 TCP Performance Issues For improving TCP's performance in high bandwidth delay environments, tech-niques like tuning of TCP buffers and using parallel streams are well known in the networking community. Work done in [15] implements a TCP tuning daemon that uses network metrics to transparently tune TCP parameters for individual flows over designated paths. The daemon decides on appropriate buffer sizes to use on a per flow basis and in case of packet loss attempts to speed-up recovery by mod-ifying TCP congestion control parameters, disabling delayed acknowledgments or using parallel streams. Their results show that their tuning techniques achieve im-provement for bulk transfers over standard TCP. Other work on TCP shows that the common practice of setting the socket buffer size based on the definition of bandwidth-delay product (BDP) as a function of transmission capacity and round-trip time can lead to suboptimal throughput [30]. They show that the socket buffer size should be set to the BDP only in paths that do not carry cross traffic and do not introduce queuing delays. They also present an application level mechanism that automatically adjusts the socket buffer size to near its optimal value. Work in [11, 21, 64] focuses on TCP's role in high performance computing and identifies both implementation-related and protocol-related problems in TCP. Some of the issues that these papers look at are: 1. Operating system bottlenecks due to increasing differences between networking speeds and processor and bus speeds. It suggests use of OS-bypass or interrupt coalescing to help TCP deal with OS's interrupt processing overhead, and use of a larger effective size of MTU e.g., through use of jumbograms. 2. Comparison of rate-based versions like TCP Vegas with TCP Reno on high performance links. Their results show that, with proper selection of Vegas 20 parameters, TCP Vegas performs better than TCP Reno for real traffic dis-tributions. There are also approaches that use other transport protocols like UDP and are aggressive about acquiring network capacity. However, these techniques raise concerns about fairness when competing with other traffic in the network. In [13] a new communication level protocol is proposed that uses a single UDP stream for high-performance data transfers. Their protocol implements a communication window that spans the entire data buffer at the user level and a selective acknowl-edgment window that spans the entire data transfer. In addition to this, they use a user-defined acknowledgment frequency. Their protocol achieves increased through-put over both short and long haul high-performance networks, however, the paper does not discuss any fairness issues and the effect of their protocol on cross traffic. Our work does not investigate buffer tuning, because our focus is on the effect of packet delay and loss on performance. Buffer tuning is certainly important for large messages and it raises interesting questions for LAM, which sets up a fully connected environment. During MPI_INIT connections are set up between each pair of nodes and the socket buffer size is set to the same value for all connections. As shown with MPICH-G2 [37], it is possible to tune TCP parameters for individual connections and this is one area for future exploration. Given our focus on latency, we investigated the effect of Nagle's algorithm and delayed acknowledgments on the benchmarks. In TCP, the Nagle's algorithm holds back data in the TCP buffer until a sufficiently large segment can be sent. De-layed acknowledgments do not ACK each segment received but wait for (a) data on the reverse stream, (b) two segments, or (c) a 100 to 200 millisecond timer to go off, before an ACK is sent. Both Nagle's algorithm and delayed acknowledgments opti-mize bandwidth utilization and normally both are active. In LAM/MPI, the Nagle's algorithm is disabled (i.e., TCPJJODELAY is set) to reduce message latency. Ideally, to achieve minimum latency, the Nagle's algorithm and delayed acknowledgments 21 should both be disabled. We investigated using different combinations of these two parameters. These parameters are not independent, as reported in [47], and this work proposes a modification to the Nagle's algorithm that can improve latency and also protect against transmission of an excessive number of short packets. Another factor that can impact performance is the difference between various TCP stacks. It has been shown that Apache webservers, which have TCPJIODELAY option set on all its sockets, perform worse on Linux than on FreeBSD. FreeBSD derived stacks have byte-oriented ACKs, whereas Linux uses a packet-oriented ACK. Linux asks for an acknowledgment after the first packet, while in FreeBSD the total size and not the packets matter, and it asks for an acknowledgment after several packets [60]. This enhancement is also part of SCTP's congestion control mechanism. In SCTP, the congestion window is increased based on the number of bytes acknowledged, not the number of acknowledgments received [1]. 3.2 Investigating the Use of TCP for MPI in WANs For evaluating the use of TCP for MPI, we used the NAS benchmarks NPB-MPI version 3.2 and L A M version 7.0.6 as the MPI middleware. As a proof of concept and the first step in understanding the problems of us-ing TCP for MPI in environments with high latency and loss, we explored different environments for our experiments. Our first attempt was to use PlanetLab [51] which provides a world-wide shared distributed compute environment. We attempted to run MPI programs on a set of machines in North America and Asia. Although they did run, as expected we found a huge variation in times due to network and processing delays. Moreover, the timing information from PlanetLab was not reli-able and the results obtained were not consistent. We also investigated the use of EmuLab [16] as an experimental testbed for this work. However, the time required to run the NAS benchmarks was often several days, therefore, we decided in favor of using a dedicated collection of machines and created an emulated wide-area network 22 environment that uses Dummynet to vary packet loss and packet latency. 3.3 MPI in an Emulated WAN Environment Our experimental setup consisted of four FreeBSD-5.3 nodes connected together via a layer-two switch using lOOMbit/s Ethernet connections. Two of the nodes used were Pentium-4 2.8GHz, one Pentium III 558MHz and one Pentium II 400MHz. These nodes represented a mix of fast and slow machines that introduced some measure of heterogeneity in our network, which is something that would be expected while executing MPI programs over WANs. We experimented with six of the eight NAS benchmarks; we did not use the FT benchmark because it required Fortran 90 support and we did not use BT because a single run took several days to complete on our setup. We investigated both small (S, W) and large problem sizes (B). Dummynet was configured on each of the nodes to allow us to vary the amount of loss and latency on the links between the nodes. TCP is used as the underlying transport mechanism in L A M and our goal was to determine how it would perform under varying loss rates and latencies similar to those encountered in WANs. We experimented with varying the latencies between 0 and 50 millisec-onds and packet loss rates between 0% to 10%. The two different variants we investigated were New-Reno and SACK. As reported by Floyd, New-Reno is widely deployed and is used in 77% of the web servers that could be classified [45]. Since we were investigating loss, we also chose TCP SACK as well. It is well-known that TCP SACK performs better than Reno under loss conditions where TCP SACK's selective acknowledgment can avoid some retransmissions under packet loss. Finally, we experimented with the socket settings for TCP-NODELAY and delayed acknowledg-ments. 23 3.3.1 Communication Characteristics of the NAS Benchmarks The communication characteristics of the NAS parallel benchmarks were discussed in Section 2.2.3 and it was noted that with the exception of the EP benchmark, the remaining benchmarks use both collective and point-to-point communication. EP uses very few communication calls, however, in the remaining benchmarks, the communication calls range from 1000 to 50,000 for dataset B. Tables in [20] show that the communication volume ranges from less than 12 Kbytes for EP to several gigabytes of data for the other benchmarks. As discussed in Section 2.2.2, MPI middleware keeps track of two types of messages: expected messages and unexpected messages. There is slightly more overhead for processing unexpected messages, however, a large number of expected messages indicates that the application may be stalled waiting for a message to arrive. We instrumented the L A M code to determine the number of expected and unexpected short and long messages generated during the execution of the bench-marks. A breakdown of different message types for classes S, W, A and B is shown in Table 3.1 for the LU benchmark and in Table 3.2 for the SP benchmark. Message Type Expected Unexpected Dataset Long Short Long Short S 0 1,097 0 0 w 0 18,978 0 0 A 300 31,145 208 0 B 254 50,221 251 0 Table 3.1: Breakdown of message types for LU classes S, W, A and B using the default L A M settings for TCP LU and SP benchmarks are two of the most communication intensive bench-marks in the set and they were chosen because of the different characteristics of their communication calls and the difference in the type of messages exchanged during their execution. LU uses only blocking send calls (MPI_Send) apart from collec-24 Message Type Expected Unexpected Dataset Long Short Long Short S 0 1,216 0 0 W 0 4,422 0 0 A 2,807 9 2,011 0 B 1,314 9 3,504 0 Table 3.2: Breakdown of message types for SP classes S, W, A and B using the default L A M settings for TCP tive communications, whereas, SP uses non-blocking MPI send calls (MPI_Isend). For larger datasets like A and B, LU mostly communicates using short messages, whereas the number of short messages in SP are almost non-existent compared to long messages. As mentioned, we used a heterogeneous environment with two faster ma-chines and two slower machines. The numbers in these tables are for the two slower machines. For the faster machines, a greater proportion of long messages were expected messages. These numbers demonstrate some of the problems in un-derstanding the performance of these benchmarks, especially in a heterogeneous environment. The fact that the vast majority of messages are expected indicates that the application is very likely spending time waiting for messages to arrive. This may indicate the benchmark is sensitive to message latency. 3.4 Experiments In the following sections we describe the results of our experiments with varying loss and delay for six of the eight NAS benchmarks for different problem sizes. 25 3.4.1 N A S Benchmarks, Loss E x p e r i m e n t s w e r e p e r f o r m e d w i t h t h e N A S b e n c h m a r k s w h e r e D u m m y n e t w a s u s e d t o i n t r o d u c e r a n d o m p a c k e t l o s s o f 1 % , 2 % a n d 1 0 % b e t w e e n e a c h o f t h e f o u r n o d e s . F o r t h e b a s e l i n e c a s e , t h e e x p e r i m e n t s w e r e a l s o r u n w h e n t h e r e w a s n o l o s s o n t h e l i n k s . T h e b e n c h m a r k s r e p o r t p e r f o r m a n c e b y t h r e e m e a s u r e m e n t s : t h e t o t a l a m o u n t o f t i m e t a k e n , M o p s / s t o t a l v a l u e a n d M o p s / s / p r o c e s s . T h e v a l u e M o p s / s t o t a l v a l u e i s l i n e a r l y r e l a t e d t o t o t a l t i m e . F i g u r e 3 . 1 s h o w s t h e p e r f o r m a n c e o f T C P S A C K f o r d a t a s e t B u n d e r d i f f e r e n t l o s s r a t e s . P e r f o r m a n c e o f T C P N e w - R e n o s h o w e d a s i m i l a r t r e n d , t h e r e f o r e , w e d i d Variation over Different Loss Rates for T C P S A C K (Nagle's Algorithm Disabled) • Loss=0% • Loss=1% • Loss=2% • Loss=10% F i g u r e 3 . 1 : N A S b e n c h m a r k s p e r f o r m a n c e f o r d a t a s e t B u n d e r d i f f e r e n t l o s s r a t e s u s i n g T C P S A C K n o t i n c l u d e i t s g r a p h . A s e x p e c t e d t h e p e r f o r m a n c e f a l l s w i t h i n c r e a s i n g l o s s . T h e d e g r a d a t i o n i n p e r f o r m a n c e i s m o r e m a r k e d f o r t h e L U b e n c h m a r k t h a n o t h e r s . T h e L U b e n c h m a r k c o n t a i n s a l a r g e n u m b e r o f n e i g h b o r t o n e i g h b o r M P I _ S E N D c a l l s w h i c h a r e f o r t h e m o s t p a r t s h o r t a n d e x p e c t e d m e s s a g e s . A s a r e s u l t , L U i s s e n s i t i v e t o t h e l a r g e d e l a y s t h a t o c c u r w h e n p a c k e t s a r e l o s t a n d n e e d t o b e r e t r a n s m i t t e d . T h e o n l y b e n c h m a r k n o t a f f e c t e d b y l o s s i s E P , w h i c h h a s o n l y a f e w c o l l e c t i v e c o m m u n i c a t i o n c a l l s . T h e r e d u c t i o n i n b a n d w i d t h d u e t o T C P ' s c o n g e s t i o n c o n t r o l 2 6 m e c h a n i s m d o e s n o t s e e m t o b e a f f e c t i n g p e r f o r m a n c e . O u r r e s u l t s f o r 0 % , 1 % a n d 2 % o n a l l b e n c h m a r k s s h o w e d t h a t T C P S A C K a l w a y s p e r f o r m e d b e t t e r o r e q u a l t o T C P N e w - R e n o f o r a l l s i x N A S m e a s u r e m e n t s . O u r r e s u l t s a r e c o n s i s t e n t w i t h t h e r e s u l t s i n t h e l i t e r a t u r e t h a t s h o w t h a t T C P S A C K u n d e r l o s s p e r f o r m s b e t t e r t h a n T C P N e w - R e n o . A s d i s c u s s e d i n F a l l a n d F l o y d ' s p a p e r [ 1 9 ] , N e w - R e n o d o e s n o t p e r f o r m a s w e l l a s T C P S A C K w h e n m u l t i p l e p a c k e t s a r e d r o p p e d f r o m a w i n d o w o f d a t a . D u r i n g F a s t R e c o v e r y , N e w - R e n o i s l i m i t e d t o r e t r a n s m i t t i n g a t m o s t o n e d r o p p e d p a c k e t p e r r o u n d t r i p t i m e , w h i c h a f f e c t s i t s p e r f o r m a n c e . 3.4.2 N A S Benchmarks, Latency I n o r d e r t o s t u d y n e t w o r k l a t e n c i e s a n d i t s e f f e c t o n M P I a p p l i c a t i o n s w e r a n t h e N A S b e n c h m a r k s w i t h 1 0 m s a n d 5 0 m s l a t e n c i e s . T h e i n c r e a s e d l a t e n c i e s w e r e i n t r o d u c e d e q u a l l y o n a l l n e t w o r k l i n k s . A s s h o w n i n F i g u r e 3 . 2 , t h e p e r f o r m a n c e i s s i m i l a r t o t h a t o b t a i n e d i n F i g -u r e 3 . 1 f o r l o s s . T h i s i s f u r t h e r e v i d e n c e t h a t t h e m a j o r e f f e c t o f l o s s o n p e r f o r m a n c e Variation over Different Latencies for TCP S A C K (Nagle's Algorithm Disabled) • Delay=0ms 3 0 0 ] • Delay=10ms LU CG EP IS MG SP F i g u r e 3 . 2 : N A S b e n c h m a r k s p e r f o r m a n c e f o r d a t a s e t B w i t h d i f f e r e n t l a t e n c i e s u s i n g T C P S A C K i s i n c r e a s e d l a t e n c y . I t i s s u r p r i s i n g t h a t t h e r e i s n o s i g n i f i c a n t d i f f e r e n c e i n b e h a v i o r 27 b e t w e e n i n c r e a s i n g t h e l a t e n c y f o r a l l m e s s a g e s v e r s u s l o s s , w h i c h i n c r e a s e s d e l a y f o r t h e s m a l l n u m b e r o f l o s t m e s s a g e s . T h i s e f f e c t m a y b e d u e t o t h e l o o s e l y s y n c h r o n o u s n a t u r e o f t h e c o m p u t a t i o n w h e r e d e l a y i n g o n e m e s s a g e d e l a y s t h e m a l l . 3.4.3 N A S Benchmarks, Nagle's Algor i thm and Delayed Acknowl-edgments T h e l a s t c h a r a c t e r i s t i c o f T C P t h a t w e i n v e s t i g a t e d w a s t h e e n a b l i n g o f N a g l e ' s a l g o r i t h m a n d d e l a y e d a c k n o w l e d g m e n t s . T y p i c a l l y , N a g l e ' s a l g o r i t h m a n d d e l a y e d a c k n o w l e d g m e n t s a r e b o t h e n a b l e d o n a c o n n e c t i o n . C o m m o n l y , i m p l e m e n t a t i o n s o f M P I , i n c l u d i n g L A M , d i s a b l e N a g l e ' s a l g o r i t h m b y d e f a u l t , a n d r e t a i n t h e s y s t e m d e f a u l t o f d e l a y e d a c k n o w l e d g m e n t s . T h e r e s u l t s f o r d i f f e r e n t s e t t i n g s o f t h e s e t w o p a r a m e t e r s f o r t h e L U b e n c h m a r k a r e s h o w n i n F i g u r e 3 . 3 . LU: Combinations of Nagle and Delayed A C K s aNagle, Delayed ACK • Nagle, No Delayed ACK 50 i DNo Nagle, Delayed ACK • No Nagle, No Delayed ACK W Dataset Size F i g u r e 3 . 3 : P e r f o r m a n c e o f L U u s i n g d i f f e r e n t c o m b i n a t i o n s o f N a g l e ' s a l g o r i t h m a n d d e l a y e d a c k n o w l e d g m e n t s F o r L U , F i g u r e 3 . 3 s h o w s t h e s i g n i f i c a n t i m p r o v e m e n t o b t a i n e d b y d i s a b l i n g N a g l e ' s a l g o r i t h m f o r s m a l l p r o b l e m s i z e s . T h i s a d v a n t a g e i s l e s s e v i d e n t f o r l a r g e r p r o b l e m s s i z e s . S i m i l a r o b s e r v a t i o n s w e r e m a d e f o r t h e S P b e n c h m a r k . A l t h o u g h t h e d i f f e r e n c e i s n o t l a r g e , w e c o n s i s t e n t l y f o u n d t h a t L A M ' s d e f a u l t s e t t i n g o f d i s -a b l i n g N a g l e ' s a l g o r i t h m a n d l e a v i n g d e l a y e d a c k n o w l e d g m e n t s e n a b l e d g a v e t h e 2^ b e s t p e r f o r m a n c e . I n p a r t i c u l a r , i t o u t p e r f o r m e d t h e c a s e w h e r e b o t h a r e d i s a b l e d , w h i c h s h o u l d h a v e a c h i e v e d t h e s m a l l e s t l a t e n c y . S u r p r i s i n g l y , e v e n t h o u g h t h e p e r f o r m a n c e o f t h e o t h e r c o m b i n a t i o n s i s c l o s e , t h e t c p d u m p t r a c e f i l e s s h o w v e r y d i f f e r e n t t r a f f i c p a t t e r n s . T h e t r a c e s f o r N a g l e d i s a b l e d , a n d n o d e l a y e d a c k n o w l -e d g m e n t s h a d a l o t o f v a r i a t i o n i n t h e R T T ( r o u n d t r i p t i m e ) a n d f a s t e r r e c o v e r y d u r i n g c o n g e s t i o n a v o i d a n c e , s i n c e a c k n o w l e d g m e n t s r e t u r n m o r e q u i c k l y . W e e x -p e c t e d t h e s e t t i n g w i t h N a g l e d i s a b l e d a n d n o d e l a y e d a c k n o w l e d g m e n t s t o g i v e t h e l e a s t a m o u n t o f d e l a y s i n c e s e g m e n t s a n d A C K s a r e s e n t o u t i m m e d i a t e l y . H o w e v e r , a s F i g u r e 3.3 s h o w s , t h i s d o e s n o t r e s u l t i n i m p r o v e d p e r f o r m a n c e . O n c e a g a i n t h e s y n c h r o n o u s n a t u r e o f t h e c o m p u t a t i o n m a y i m p l y t h a t i t i s o n l y t h e m a x i m u m d e l a y s t h a t m a t t e r . F i g u r e 3.4 s h o w s t h e p e r f o r m a n c e o f L U a n d S P f o r p r o b l e m s i z e s W , S a n d Comparison of LU and SP for Different Datasets (Loss=0%, S A C K ON, Nagle OFF, Delayed A C K s ON) • S SP F i g u r e 3.4: C o m p a r i s o n o f L U a n d S P b e n c h m a r k s f o r d i f f e r e n t d a t a s e t s B u s i n g t h e b e s t s e t o f p a r a m e t e r s ( S A C K , N a g l e d i s a b l e d , a n d d e l a y e d a c k n o w l e d g -m e n t s t u r n e d - o n ) . A s F i g u r e 3.4 s h o w s t h e r e i s a s i g n i f i c a n t d i f f e r e n c e b e t w e e n t h e b e h a v i o r o f L U a n d S P a c r o s s t h e p r o b l e m s e t s i z e s . S P s h o w s a s i g n i f i c a n t d r o p i n p e r f o r m a n c e f o r d a t a s e t B c o m p a r e d t o L U . W e i n v e s t i g a t e d t h i s f u r t h e r t o f i n d t h a t t h e m a j o r d i f f e r e n c e b e t w e e n S , W a n d B w a s t h e n u m b e r o f l o n g m e s s a g e s u s e d i n t h e s e b e n c h m a r k s . F o r p r o b l e m s i z e B , t h e n u m b e r o f l o n g m e s s a g e s r a n g e d f r o m 29 almost none in LU to almost all of the approximately 5000 messages exchanged in SP. This shows the dramatic impact that the MPI middleware and the use of mes-sage rendezvous for long messages can have on performance. SP uses non-blocking MPI_ISEND but is not able to begin transferring data until the rendezvous point is reached. 3.5 Summary of Results Understanding the dynamics of TCP is difficult because TCP contains a range of flow control and congestion avoidance mechanisms that result in varied performance in different environments. There are also a number of different TCP variants. This chapter highlights the performance problems with using TCP for MPI in WANs, especially taking into account the TCP variants and settings typically used for MPI in local and wide-area networks. We explored the performance of TCP in an emulated WAN environment to better quantify the differences. TCP SACK consistently outperformed TCP New-Reno and this points to the importance of better congestion control for such environments. For NAS benchmarks, as expected, there was a considerable drop in performance as the packet loss increased from 0% to 10%. In the NAS benchmarks, for the latencies investigated, the majority of the messages were expected and thus potentially very sensitive to variation in latencies. We found that, for several of the benchmark programs, performance is determined by the maximum RTT across all of the connections. Even a modest loss in one link creates spikes in the RTT that in the end slows down all the processes. The sensitivity of an application to latency results in the overall performance to be as fast as the slowest process. In terms of Nagle's algorithm, the default configuration in LAM-MPI achieved the best results. There was not much of a difference between the best results and those obtained for other combinations except for TCP's default setting: Nagle en-abled with delayed acknowledgments. We found that the rendezvous mechanism 30 used by L A M can have an impact on performance much more than adjusting the Nagle and delayed acknowledgments settings. Our experiments with TCP emphasize the importance of latency tolerant MPI programs for wide-area networks and the ability to recover quickly from loss. In Chapter 4, we discuss SCTP-based middleware for MPI and discuss many of the features of SCTP, like multistreaming, multihoming, better congestion control and built-in SACK mechanism that make it more resilient and better suited to wide-area networks. Due to SCTP's similarity with TCP, it will also be able to take advantage of research on tuning of individual connections and other performance enhancements. 31 Chapter 4 SCTP-based Middleware: Design and Implementation In this chapter we discuss the role that SCTP can play as the underlying trans-port layer for MPI. The advantage of using MPI in distributed environments is that its libraries exist for a wide variety of machines and environments ranging from dedicated high performance architectures, local area networks, compute grids, to execution across wide-area networks and the Internet. As a standard it has been successful in creating a body of code that is portable to any of these environments. However, the strict definition of portability is insufficient since scalability and per-formance must also be considered. It is challenging to obtain good performance in environments such as wide-area networks where bandwidth and latency vary. Our experiments in Chapter 3 with TCP underscore the importance of latency tolerant MPI programs for the wide-area environments. Programs that overlap communica-tion with computation can hide much of the effects of latency that is inevitable in such environments. There are several reasons why SCTP may be more suitable than TCP for MPI, in general, and also can prove to be more robust in wide-area environments. 1. As we discuss in detail in this chapter, SCTP has a lot of similarities in common 32 with MPI, which allows us to offload a lot of functionality of MPI onto the transport layer. 2 . SCTP's multistreaming ability allows us to provide additional concurrency and avoid head-of-line blocking that is possible in TCP. 3. Congestion control and SACK in SCTP is more extensive than in TCP and allows more efficient communication under loss conditions. 4 . SCTP's support for multihoming makes it more fault tolerant - an important property in wide-area networks. It can take advantage of multiple alternate paths between a pair of nodes and can recover from loss segments faster than TCP. In this thesis we investigate all of the above advantages of SCTP, except multihoming. We first discuss the details of redesigning the L A M TCP RPI module to use SCTP as its underlying transport protocol. We implemented three versions of an SCTP-based RPI module to investigate the opportunities offered by SCTP in terms of the variety of new features that it offers and explore ways in which MPI can benefit from a new transport protocol. The design and implementation of our SCTP module was a three-phased process where we iteratively added functionality to each successive module. Each module was refined as we learned from our experiences until we achieved the final version, which assimilates the benefits of the previous modules. One of the first challenges we faced in the redesigning the TCP RPI mod-ule was the lack of documentation about the module and its interaction with other parts of LAM. The TCP RPI module maintains state information for all requests such as how far the request has progressed and what type of data/event is re-quired for it to reach completion. It has a complex state machine for supporting asynchronous and non-blocking message progression in MPI. The only way to learn about how it worked was by thorough code examination. In the process, we also 33 created a detailed technical document explaining the module's working that other people can benefit from [33]. This document is publicly available and is linked-off the LAM/MPI official website. We also extensively instrumented L A M code to trace request progression and other key events using lam_debug statements. The lam_debug statements allowed us to tune the level of verbosity in the output and were used to create our own custom diagnostic system. During our initial learning process and also during implementation and testing we made extensive use of the diagnostic output to trace problems. Tracing problems in the execution of parallel programs was quite challenging at times, especially when the output was not always reproducible and when slight changes in timing of events produced different results. In many instances, painstaking examination of the diagnostic traces had to be done to identify the source of errors. Another challenge was the use of a new transport protocol that is still in its early stages of optimization and tuning. We found that performance of SCTP was highly dependant on the implementation used, with the Linux lksctp implementation performing much poorer than the FreeBSD K A M E implementation. Moreover, while using different features of the SCTP API, there were several coding nuances that we had to discover for ourselves as not much documentation was available. During our testing, the SCTP module also provided a testing ground for the FreeBSD K A M E SCTP implementation where we discovered several problems with the protocol that led to improvements in the stack. We collaborated closely with Randall Stewart, one of the primary designers of SCTP, in the identification of those problems. In the subsequent sections we describe the design and implementation details of the final version of the SCTP module, since it incorporates the features of previous versions, and also compare it to the LAM-TCP design. Our iterative design process is discussed in more detail in [34]. We first discuss why SCTP is a good match for MPI as its underlying transport protocol. In Sections 4.2 and 4.3 we discuss how multiple streams are used in our module and the enhanced concurrency obtained as 34 a result. In Section 4.4 we discuss resource management issues in the middleware and provide a comparison with LAM-TCP. In Section 4.5 we describe the race conditions in our module and the solution adopted to fix them. In Section 4.6 we discuss enhanced reliability and fault tolerance in our module due to the features provided by SCTP. In the last section we discuss some limitations of using SCTP. 4.1 Overview of using SCTP for MPI SCTP promises to be particularly well-suited for MPI due to its message-oriented nature and provision of multiple streams in an association. As shown in Figure 4.1, there are some striking similarities between SCTP S C T P MPI One-to-Many Socket Context Association R a n k / Source Streams M e s s a g e T a g s ^ Figure 4.1: Similarities between the message protocol of MPI and SCTP and MPI. Contexts in an MPI program identify a set of processes that communicate with each other, and this grouping of processes can be represented as a one-to-many socket in SCTP that establishes associations with that set of processes. SCTP can map each association to the unique rank of a process within a context and thus use an association number to determine the source of a message arriving on its socket. Each association can have multiple streams which are independently ordered and this property directly corresponds with message delivery order semantics in MPI. In MPI, messages sent with different tag, rank and context (TRC) to the same receiver are allowed to overtake each other. This permits direct mapping of streams 35 to message tags. This similarity between SCTP and MPI is workable at a conceptual level, however, there are some implementation issues, to be discussed, that make the socket-context mapping less practical. Therefore, we preserved the mappings from associations to ranks and from streams to message tags but not the one from socket to context in our implementation. Context creation within an MPI program can be a dynamic operation and if we map sockets to contexts, this would require creating a dynamic number of sockets during the program execution. Not only would this add complexity and additional bookkeeping to the middleware, there can be an overhead to managing a large number of sockets. Creation of a lot of sockets in the middleware counteracts the benefits we can get from using a single one-to-many SCTP socket that can send/receive messages from multiple associations. Due to these reasons, instead of mapping sockets to contexts, the context and tag pair was used to map messages to streams. In discussions with Randall Stewart we discovered another way of dealing with contexts and that is to map them to PID (Payload Identifier) present in the common SCTP header of each packet. The PID is a 32-bit field that can be used at the application level to label the contents of an SCTP packet and is ignored by the SCTP layer. Using the PID field gives us the flexibility of dynamic context creation in our application without the need for maintaining a corresponding number of sockets. Also, the PID mapping can be easily incorporated in our module, with minor modifications. 4.2 Message Demultiplexing As discussed in Section 2.1, SCTP provides a one-to-many UDP-like style of com-munication that allows a single socket descriptor to receive messages from multiple associations, eliminating the need for maintaining a large number of socket descrip-tors. In the case of LAM's TCP module there is a one-to-one mapping between 36 processes and socket descriptors, where every process creates individual sockets for each of the other processes in the L A M environment. In SCTP's one-to-many style, the mapping between processes and sockets no longer exists. With one-to-many sockets there is no way of knowing when a particular association is ready to be read from or written to. At any time, a process can receive data sent on any of the associations through its sole socket descriptor. SCTP's one-to-many socket API does not permit reception of messages from a par-ticular association, therefore, messages are received by the application in the order they arrive and only afterwards is the receive information examined to determine the association it arrived on. It is likely that the ability to peek at the association information before receiving data may become part of later versions of the SCTP API [57]. In our SCTP module, each message goes through two levels of demultiplexing; first on the association number the message arrived on and secondly on the stream number within that association. These two parameters allow us to invoke the correct state function for that request, which directs the incoming message to the proper request receive buffer. If no matching receive was posted, it is treated like an unexpected message and buffered. 4.3 Concurrency and SCTP streams MPI send/receive calls without wildcards define an ordered stream of messages be-tween two processes. It is also possible, through the use of wildcards or appropriate tags, to relax the delivery order of messages. Relaxing the ordering of messages can be used to make the program more message driven and independent from network delay and loss. However, when MPI is implemented on top of TCP, the stream-oriented semantics of TCP with one connection per process precludes having un-ordered message streams from a single process. This restriction is removed in our implementation by mapping tags onto SCTP streams, which allows different tags 37 from the same source to be independently delivered. 4.3.1 Assigning Messages to Streams As discussed in the Chapter 2, message matching in MPI is based on the tag, rank and context of the message. In SCTP, the number of streams is a static parameter (short integer value) that is set when an association is initially established. For each association, we use a fixed sized pool of stream numbers, ten by default, that is used for sending and receiving messages between the endpoints of that association. Messages with different TRC are mapped to different stream numbers within an association to permit independent delivery. Of course, since the number of streams is fixed, the degree of concurrency achieved depends on the number of streams. 4.3.2 Head-of-Line Blocking Head-of-line blocking can occur in the L A M TCP module when messages have the same rank and context but different tags. As shown in Figure 4.2, we use non-blocking communication between two processes that communicate by send-ing/receiving several messages with different tags. The communication/computation overlap depends on the amount of data that is sent without blocking for it to be received, and continuing with useful computation. As Figure 4.2 shows, loss of a single TCP segment can block delivery of all subsequent data in a TCP connection until the lost segment is delivered. Our assignment of TRC to streams in the SCTP module alleviates this problem by allowing the unordered delivery of these messages. It is often the case that MPI applications are loosely synchronized and as a result alternate between computation and bursts of communication. We believe this characteristic of MPI programs makes them more likely to be affected by head-of-line blocking in high loss scenarios, which underpins our premise that SCTP is better suited as a transport mechanism for MPI than TCP on WANs. 38 P r o c e s s X P r o c e s s Y MFI i s e n d o M P I _ I s e n d o Tag_B Tag_A 7 \ Msg_B Msg_A B l o c k e d M P I _ I r e c v o M P I I r e c v o LAM TCP Module P r o c e s s X M P l _ l s e n d o M P I I s e n d o Tag_B Tag_A 7 Msg_B Msg_A Delivered P r o c e s s Y M P I i r e c v o M P I _ I r e c v SCTP Module F i g u r e 4 . 2 : E x a m p l e o f h e a d - o f - l i n e b l o c k i n g i n L A M T C P m o d u l e 4.3.3 Example of Head-of-Line Blocking A s a n i l l u s t r a t i o n o f t h e b e n e f i t s o f o u r S C T P m o d u l e , c o n s i d e r t h e t w o M P I p r o -c e s s e s P O a n d P I s h o w n i n F i g u r e 4 . 3 . P r o c e s s P I s e n d s t w o m e s s a g e s M s g - A a n d M s g - B t o P O i n o r d e r , u s i n g d i f f e r e n t t a g s . P O d o e s n o t c a r e w h a t o r d e r i t r e c e i v e s t h e m e s s a g e s a n d , t h e r e f o r e , p o s t s t w o n o n - b l o c k i n g r e c e i v e s . P O w a i t s f o r a n y o f t h e r e c e i v e r e q u e s t s t o c o m p l e t e a n d t h e n c a r r i e s o u t s o m e c o m p u t a t i o n . A s s u m e t h a t p a r t o f M s g - A i s l o s t d u r i n g t r a n s m i s s i o n a n d h a s t o b e r e t r a n s m i t t e d w h i l e M s g - B a r r i v e s s a f e l y a t t h e r e c e i v e r . I n t h e c a s e o f T C P , i t s c o n n e c t i o n s e m a n t i c s r e q u i r e t h a t M s g - B s t a y i n T C P ' s r e c e i v e b u f f e r u n t i l t h e t w o s i d e s r e c o v e r f r o m t h e l o s s a n d t h e l o s t p a r t s a r e r e t r a n s m i t t e d . E v e n t h o u g h t h e p r o g r a m m e r h a s s p e c i f i e d t h a t t h e m e s s a g e s c a n b e r e c e i v e d i n a n y o r d e r , P O i s f o r c e d t o r e c e i v e M s g - A f i r s t a n d i n c u r 3 9 P1 MPI_Send(Msg-A , PO, tag-A) MPI_Send(Msg-B , PO, tag-B) S C T P MPIJrecv MPI_lrecv MPI_Waitany Compute MPI Waitall Figure 4.3: Example of multiple streams between two processes using non-blocking communication increased latency as a result of TCP's semantics. In the case of the SCTP module, since different message tags are mapped to different streams and each stream can deliver messages independently, Msg-B can be delivered to PO and the process can continue executing until Msg-A is required. SCTP matches the MPI semantics and makes it possible to take advantage of the concurrency that was specified in the program. The TCP module offers concurrency at process level, while our SCTP module adds to this with enhanced concurrency at the TRC level. Even when blocking communication takes place between a pair of processes and there is loss, there are still advantages to using SCTP. Consider Figure 4.4, where blocking sends and receives are being used with two different tags. In SCTP, if Msg-A is delayed or lost, and Msg-B arrives at the receiver, it is treated as an unexpected message and buffered. SCTP, thus, allows the receive buffer to be emptied and | PO | MPI_lrecv(P1, tag-A) MPI_lrecv(P1, tag-B) MPI_Waitany() Compute() MPI_Waitall() T C P o 0_ 0) x LU MPI Irecv MPI Irecv MPI_Waitany Compute MPI Waitall Msg-B arrives Msg-A arrives 40 P O P1 MPI_Recv (P1 , tag-A) MPI_Recv (P1 , tag-B) MPI_Send(Msg-A, PO, tag-A) MPI_Send(Msg-B, PO, tag-B) Figure 4.4: Example of multiple streams between two processes using blocking com-munication hence does not slow down the sender due to the flow control mechanism. In TCP, however, Msg-B will occupy the socket receive buffer until Msg-A arrives. 4.3.4 Maintaining State The L A M TCP module maintains state for each process (mapped to a unique socket) that it reads from or writes to. Since TCP delivers bytes in strict sequential order and the TCP module transmits/ receives one message body at a time per process, there is no need to maintain state information for any other messages that may arrive from the same process since they cannot overtake the message body currently being read. In our SCTP module, this assumption no longer holds true since subflows on different stream numbers are only partially ordered with respect to the entire association. Therefore, we need to maintain state for each stream number that a message can arrive on from a particular process. We only need to maintain a finite amount of state information per association since we limit the possible number of streams to our pool size of ten streams. For each stream number, the state holds information about how much of the message has been read, what stage of progression the request is in and what needs to be done next. At the time when an attempt is made to read an incoming message, we cannot tell in advance from where the message will arrive and how large it will be, therefore, we specify a length equal to the socket receive buffer. SCTP's receive function sctp_recvmsg does take care of message framing by returning the next message and not the number of bytes specified by the size field in the function call. This frees us from having to look through the receive buffer to locate the message boundaries. 41 4.4 Resource Management LAM's TCP RPI module uses TCP as its communication mechanism and employs a socket based interface in a fully connected environment. It uses a one-to-one connection-oriented scheme and maintains N sockets, one for each of the processes in its environment. The select system call is used to poll these sockets to determine their readiness for reading any incoming messages or writing outgoing messages. Polling of these sockets is necessary because operating systems like Linux and UNIX do not support asynchronous communication primitives for sockets. It has been shown that the time taken for the select system call has a cost associated with it that grows linearly with the number of sockets [44]. Socket design was originally based on the assumption that a process would initiate a small number of connections, and it is known that performance is affected if a large number of sockets are used. Implications of this are significant in large scaled commodity clusters with thousands of nodes that are becoming more frequently used. Use of collective communications in MPI applications also strengthens the argument, since nearly all connections become active at once when a collective communication starts, and handling a large number of active connections may result in performance degradation. SCTP's one-to-many communication style eliminates the need for maintain-ing a large number of socket descriptors. In our implementation each process creates a single one-to-many SCTP socket for communicating with the other processes in the environment. An association is established with each of the other processes us-ing that socket and since each association has a unique identification, it maps to a unique process in the environment. We do not use the select system call to detect events on the socket, instead an attempt is made to retrieve messages at the socket using sctpjrecvmsg, as long as there are any pending receive requests. In a simi-lar way, if there are any pending send requests, they are written out to the socket using sctp_sendmsg. If the socket returns EAGAIN, signaling that it is not ready to perform the current read or write operation, the system attempts to advance other 42 outstanding requests. The use of one-to-many sockets in SCTP results in a more scalable and portable implementation since it does not impose a strong requirement on the system to manage a large number of socket descriptors and the resources for the associated buffers. As discussed in [5], managing these resources can present problems. We investigated a process's limits for the number of socket descriptors it can manage and the number of associations it can handle on a single one-to-many socket. In our setup, a process reached its maximum socket descriptors' limit much earlier than the limit on the maximum number of associations it can handle. The number of file descriptors in a system usually has a user limit of 1024 and needs root privileges to be changed. Moreover, clusters can potentially execute parallel jobs by several users, and need to have the capability of providing a number of sockets to each user. 4.5 Race Conditions In this section we describe the race conditions in our module and the solutions adopted to fix them. 4.5.1 Race Condition in Long Message Protocol In the SCTP module it was necessary to change the long message protocol. This occurred because even though SCTP is message-based the size of message that can be sent in a single call to sctp_sendmsg function is limited by the send buffer size. As a result large messages had to broken up into multiple smaller messages of size less than that of send buffer. These messages then had to be reassembled at the receiving side. All pieces of the large message are sent out on the same stream number to en-sure in-ordered delivery. As a refinement to this scheme, we considered interleaving messages sent on different streams with portions of a message larger than the send buffer size, at the time it is passed to the SCTP transport layer. Since reassembly 43 of the large message is being done at the RPI level, and not at the SCTP level, this may result in reduced latency for shorter messages on other streams especially when processes use non-blocking communication. While testing our SCTP module we encountered race conditions that oc-curred due to the rendezvous mechanism of the long message protocol. Consider the case when two processes PO and PI are exchanging long messages using the same stream numbers. Both of them would send rendezvous initiating envelopes for their long messages to the other and wait for an acknowledgment before starting the transmission of the long message body. As shown in Figure 4.5, after PI receives the ACK for its long message Msgl, PO E n v e l o p e - M s g l A C K - M s g l E n v e l o p e - Msg2 Par t ia l b o d y - M s g l A C K - Msg2 (incorrect) R e m a i n i n g b o d y - M s g l P1 Figure 4.5: Example of long message race condition it begins transmitting it. It is possible that PI is successful in writing only part of Msgl to the socket, and it then saves its current state which would include the number of bytes sent to PO. Process PI would now continue and try to advance its other active requests and would return to writing the rest of Msgl later. During that time PI sends out an ACK for long message Msg2 to PO using the same stream number that Msgl was being sent on. From the perspective of process PO, it was in the middle of receiving Msgl and the next message it receives is from the same 44 process and on the same stream number. Since PO decides what function to call, after reading a message from its socket, based on the association and stream number it arrived on, it calls the function that was directing the bytes received to the buffer designated for Msgl. As a result, PO erroneously reads the ACK for Msg2 as part of body of Msgl. Option A In order to fix the above race condition we considered several options, and the trade-offs associated with them. The first option was to stay in a loop while sending Msgl until all of it has been written out to the socket. One advantage of this is that the rendezvous mechanism introduces latency overhead and once the transmission of long message body starts, we do not want to delay it any longer. The disadvantage is that it greatly reduces the amount of concurrency that would otherwise have been possible, since we do not receive from or send to other streams or associations. Also, if the receiver of the long message had issued a non-blocking receive call, then it might continue to do other computations and, as a result, sending the data in a tight loop may simply cause the sender to stall waiting for the receiver to remove the data from the connection. Another disadvantage that may arise is when the receiver is slow and we would be overloading it by sending a lot of data in a loop instead of advancing other requests on other streams or associations. Option B The second option was to disallow PI from writing a different message, if a previous message writing operation was still in progress, on the same stream number to the same process. The advantage of this was simplicity in design, but we reduce the amount of overlap that was possible in the above case if we allowed both processes to exchange long messages at the same time. However, it is still possible to send and receive messages from other streams or associations. This option is the one we 45 implemented in the SCTP module. Option C A third option was to treat acknowledgments used within L A M to synchronize message exchange, such as those used in the long message rendezvous mechanism, as control messages. These control messages would be treated differently from other messages containing actual data. Whenever a control message carrying an ACK arrives, the middleware would know it is not part of any unfinished message, e.g., the long message in our example, and would invoke the correct state function to handle it. This option introduces more complexity and bookkeeping to the code, but may turn out to be one that offers the most concurrency. 4.5.2 Race Condition in MPI.Init There was one other race condition that occurred in implementing the SCTP mod-ule. Because the one-to-many SCTP style does not require any of the customary accept and connect function calls before receiving messages, we had to be careful about MPI.Init to ensure that each process establishes associations with all the other processes before exchanging messages. In order to ensure that all associations are established before messages are exchanged, we implemented a barrier at the end of our association setup stage and before the message reception/ transmission stage. This is especially important in an heterogeneous environment with varying network and machine speeds. It is easier with TCP because the MPI_Init connection setup procedure automatically takes care of this issue by its use of connect and accept function calls. 4.6 Reliability In general, MPI programs are not fault tolerant and communication failure typically causes the programs to fail. There are, however, projects like FT-MPI [18], which 46 introduce different modes of failure, that can be controlled by an application using a modified MPI API. This work is discussed in Chapter 6. One source of failure is packet loss in networks, which can result in added latency and reduced bandwidth that severely impact the overall performance of programs. There are several mech-anisms in SCTP that help reduce the number of failures and improve the overall reliability of executing an MPI program in a WAN environment. 4.6.1 Multihoming Feature MPI applications have a strong requirement for the ability to rapidly switch to an alternate path without excessive delays in the event of failure. MPI processes in a given application's environment tend to be loosely synchronized even though they may not all be communicating directly with each other. Delays occurring in one process due to failure of the network can have a domino effect on other processes and they can potentially be delayed as well. We saw some evidence of this phenomenon in our experiments in Chapter 3. It would be highly advantageous for MPI processes if the underlying transport protocol supports fast path failover in the event of network failure. SCTP's multihoming feature provides an automatic failover mechanism where a communication failure between two endpoints on one path will cause switchover to an alternate path with little interruption to the data transfer service. Of course, this is useful only when there are independent paths, but having multiple interfaces on different networks is not uncommon and SCTP makes it possible to exploit the possibility when it exists. TCP has no built-in support for multihomed IP hosts and cannot take advantage of path-level redundancy. It could be accomplished by using multiple sockets and managing them in the middleware but this introduces added complexity to the middleware. SCTP uses a heartbeat control chunk to periodically probe the reachability of the idle destination addresses of an association and to update the path round 4 7 trip times of the addresses. The heartbeat period is a configurable option, modi-fiable through a socket option, with a default value usually set to 30 seconds [59]. The heartbeat mechanism detects failure by keeping count of consecutive missed re-sponses to the data and/or heartbeat chunks to that address. If this count exceeds a configurable threshold spp_patnmaxrxt then that destination address is marked as unreachable. By default spp.pathmaxrxt is set to a value of 5. The value of this parameter represents a trade-off between speed of detection and reliability of detec-tion. If the value is set too low, an overloaded host may be declared unreachable. In addition to this, there are a range of user-adjustable controls that can be used to adjust maximum and minimum retransmission timeouts (RTO) and hence tune the amount of time it takes to determine network failures [55]. There are trade-offs associated with selecting the value of these control parameters. Setting max RTO too low, for example, can result in unnecessary retransmissions in the event of delay in the network. Ideally, these control parameters need to be tuned to suit the nature and requirements of the application and the network to ensure fast failover, but to avoid unnecessary switchovers due to network delay. If we contrast this with the TCP protocol, there is a lack of control over TCP retransmission timers and there is no built-in mechanism for a failover. As a side note on multihoming, an SCTP association spans both IPv4 and IPv6 addresses and can aid in progression of the Internet from IPv4 to IPv6 [58]. 4.6.2 Added Protection SCTP has several built-in features that provide an added measure of protection against flooding and masquerade attacks. They are discussed below. As SCTP is connection oriented, it exchanges setup messages at the initiation of the communication, for which it uses a robust four-way handshake as shown in Figure 4.6. The receiver of an INIT message does not reserve any resources until the sender proves that its IP address is the one claimed in the association setup 48 Figure 4.6: SCTP's four-way handshake message. The handshake uses a signed state cookie to prevent use of IP spoofing for SYN flooding denial-of-service attack. The cookie is authenticated by making sure that it has a valid signature and then a check is made to verify that the cookie is not stale. This check guards against replay attacks [59]. Some implementations of TCP also use a cookie mechanism, but those typically do not use signed cookies and also the cookie mechanism is supplied as an option and it is not an integral part of the TCP protocol as is the case with SCTP. In TCP, a user has to validate that both sides of the connection have the required implementation. SCTP's four-way handshake may be seen to be an overhead, however, if a one-to-many style socket is used, then user data can be piggy-backed on the third and fourth leg of the handshake. Every SCTP packet has a common header that contains, among other things, a 32-bit verification tag. This verification tag protects against two things: (1) It prevents an SCTP packet from a previous inactive association from being mistaken as a part of a current association between the same endpoints. (2) It protects against blind injection of data in an active association [59]. A study [63] 1 done on 1Network transport protocols rarely hit the news headlines. The work by Paul Watson was widely reported in the media. The Vancouver Sun featured a front page article titled "The man who saved the Internet"on April 22, 2004. This problem, however, had been 49 TCP Reset attacks has shown that TCP is vulnerable to a denial-of-service attack in which the attacker tries to prematurely terminate an active TCP connection. The study has found that this type of attack can utilize the TCP window size to reduce the number of sequence numbers that must be guessed for a spoofed packet to be considered valid and accepted by the receiver. This kind of attack is especially effective on long standing TCP flows such as BGP peering connections, which if successfully attacked, would disrupt routing in the Internet. SCTP is not susceptible to the reset attacks due to its verification tag [57]. SCTP is able to use large socket buffer sizes by default because it is not subject to denial-of-service attacks that TCP becomes susceptible to with large buffers [57]. SCTP also provides an autoclose option that protects against accidental denial-of-service attacks where a process opens an association but does not send any data. Using autoclose, we can specify the maximum number of seconds that an association remains idle, i.e., no user traffic in either direction. After this time the association is automatically closed [55]. There is another difference between SCTP's and TCP's close mechanisms; TCP allows an "half-closed" state, which is not allowed in SCTP. In the half-closed state, one side closes a connection and can-not send data, but may continue to receive data until the peer closes its connection. SCTP avoids the possibility of a peer failing to close its connection by disallowing a half-closed state. In this section we have highlighted the additional security features that are present in SCTP. These features are especially important in open network environ-ments, such as the Internet, where security is an issue and therefore the use of SCTP adds to the reliability of the environment for MPI. identified earlier by the network community [57]. 50 4.6.3 Use of a Single Under ly ing Protocol One issue not directly related to SCTP was the mixed use of UDP and TCP in L A M . L A M uses user level daemons that by default employ UDP as their transport protocol. These daemons serve several purposes: 1. Daemons act as external agents that have previously fulfilled the authentica-tion requirements for connecting to other nodes and can enable fast execution of the MPI programs. 2. They enable external monitoring of running jobs and carrying out cleanup when a user aborts an MPI process. 3. L A M also provides remote I/O via the L A M daemon processes. We modified the transport protocol used by daemons in L A M to SCTP so that we have a single underlying protocol while executing MPI programs instead of a combination of protocols that do not offer the set of features we want to take advantage of, such as, the capability to multihome by all the components of the L A M environment. 4.7 Limitations As discussed in Section 4.5, SCTP has a limit on the size of message it can send in a single call to sctp_sendmsg. This limits us from taking full advantage of the message-framing property of SCTP since our long message protocol divides a long message into smaller messages and then carries out message framing at the middle-ware level. Both TCP and SCTP use a flow control mechanism where a receiver adver-tises its available receive buffer space to the sender. Normally when a message is received it is kept in the kernel's protocol stack buffer until a process issues a read 51 system call. When large sized messages are exchanged in MPI applications and mes-sages are not read out quickly, the receiver advertises less buffer space to the sender, which slows down the sender due to flow control. An event driven mechanism that gets invoked as soon as a message arrives is currently not supported in our module. Some factors that can affect the performance of SCTP are as follows: • SCTP uses a comprehensive 32-bit CRC32c checksum which is expensive in terms of CPU time, while TCP typically offloads checksum calculations to the network interface card (NIC). However, CRC32c provides much stronger data verification at the cost of additional CPU time. It has been proven to be mathematically stronger than the 16-bit checksums used by TCP or UDP [58]. It is likely that hardware support for CRC32c may become available in a few years. • TCP can always pack a full MTU, but SCTP is limited by the fact that it bundles different messages together, which may not always fit to pack a full MTU. However, in our experiments this was not observed to be a factor impacting the performance. • TCP performance has been fine-tuned over the past decades, however, opti-mization of the SCTP stack is still in its early stages and will improve over time. In this chapter we discussed the design and implementation issues of our SCTP module in detail. We have focused on the need for latency tolerant MPI programs for wide-area environments and described SCTP's many features, which make it more resilient and suitable for such environments. In Chapter 5 we carry out a thorough experimental evaluation of the SCTP module, and compare its per-formance to the LAM-TCP module. 52 Chapte r 5 Experimental Evaluation In this chapter we describe the experiments carried out to compare the performance of our SCTP module with the LAM-TCP module for different MPI applications. Our experimental setup consists of a dedicated cluster of eight identical Pentium-4 3.2GHz, FreeBSD-5.3 nodes connected via a layer-two switch using lGbit/s Ethernet connections. Kernels on all nodes are augmented with the kame.net SCTP stack [36]. The experiments were performed in a controlled environment and Dummynet was configured on each of the nodes to allow us to vary the amount of loss on the links between the nodes. We compare the performance under loss rates of 0%, 1% and 2% between each of the eight nodes. The cluster used for our experiments in this chapter is larger and faster than the one used for evaluating TCP for MPI, described in Chapter 3. The new cluster had become available to us at the time we were ready to test our SCTP module, and provided the opportunity to carry out more extensive experiments with better machines and larger number of nodes. There were NAS benchmarks, like BT, which could not be run on the older cluster, as they were taking an inordinate amount of time to execute. Since the new cluster reduced the overall time required for each experiment run, we were able to move quickly. All our experiments were executed several times to verify the consistency of our results. 53 In Section 5.1, we evaluate the performance of the SCTP module using two standard benchmark programs. In Section 5.2, we look at the effect of using a latency tolerant program that overlaps communication with computation. We compare the performance using a real-world program that uses a master-slave communication pattern commonly found in MPI programs. We also discuss the effects of head-of-line blocking. In order to make the comparison between SCTP and TCP as fair as possible, the following settings were used in all the experiments discussed in subsequent sections: 1. By default, SCTP uses much larger S0_SNDBUF/S0JICVBUF buffer sizes than TCP. In order to prevent any possible effects on performance due to this difference, the send and receive buffers were set to a value of 220 Kbytes in both the TCP and SCTP modules. 2. Nagle's algorithm is disabled by default in LAM-TCP and this setting was used in the SCTP module as well. 3. An SCTP receiver uses Selective Acknowledgment SACK to report any missing data to the sender, therefore, SACK option for TCP was enabled on all the nodes used in the experiment. 4. In our experimental setup, the ability to multihome between any two endpoints was available since each node is equipped with three gigabit interface cards and three independent paths between any two nodes were available. In our experiments, however, this multihoming feature was not used so as to keep the network settings as close as possible to that used by the LAM-TCP module. 5. TCP is able to offload checksum calculations on to the NICs on our nodes and thus has zero CPU cost associated with its calculation. SCTP has an expensive CRC32c checksum which can prove to be a considerable overhead in terms of CPU cost. We modified the kernel to turn off the CRC32c checksum in SCTP so that this factor does not affect the performance results. 54 5.1 Evaluation Using Benchmark Programs We evaluated our implementation using two benchmark programs; MPBench ping-pong test and NAS parallel benchmarks. The NAS benchmarks approximate the performance of real applications. These benchmarks provided a point of reference for our measurements and the results are discussed below. 5.1.1 MPBench Ping-Pong Test We first report the output obtained by running the MPBench [48] ping-pong test with no message loss. This is a standard benchmark program that reports the throughput obtained when two processes repeatedly exchange messages of a speci-fied size. All messages are assigned the same tag. Figure 5.1 shows the throughput MPBench Ping Pong Test _ # _ L A M _ S C T P —•>— L A M _ T C P 1.4 n , 0 H 1 1 1 1 1 32768 65535 98302 131069 Message Size (bytes) Figure 5.1: The performance of SCTP using a standard ping-pong test normalized to TCP under no loss obtained for different message sizes in the ping-pong test under no loss. The through-put values for the LAM-SCTP module are normalized with respect to LAM-TCP. The results show that SCTP is more efficient for larger message sizes, however, 55 TCP does better for small message sizes. The crossover point is approximately at a message size of 22 Kbytes, where SCTP throughput equals that of TCP. The SCTP stack is fairly new compared to TCP and these results may change as the SCTP stack is further optimized. We also used the ping-pong test to compare SCTP to TCP under 1% and 2% loss rates. Since the L A M middleware treats short and long messages differently, we experimented with loss for both types of messages. In the experiments short messages were 30 Kbytes and long messages were 300 Kbytes. For both short and long messages under 1% and 2% loss, SCTP performed better than TCP and the results are summarized in Table 5.1. Throughput (bytes/second) Loss:l% Loss:2% MPI Message Size SCTP TCP SCTP TCP (bytes) 30K 54,779 1,924 44,614 1,030 300K 5,870 1,818 2,825 885 Table 5.1: The performance of SCTP and TCP using a standard ping-pong test under loss SCTP shows that it is more resilient under loss and can perform rapid re-covery of lost segments. Work done in [1] and [53] compares the congestion control mechanisms of SCTP and TCP and shows that SCTP has a better congestion con-trol mechanism that allows it to achieve higher throughput in error prone networks. Some features of SCTP's congestion control mechanism are as follows: • Use of SACK is an integral part of the SCTP protocol, whereas, for TCP it is an option available in some implementations. In those implementations SACK information is carried in IP options and is, therefore, limited to reporting at most four TCP segments. In SCTP, the number of gap ACK blocks allowed is much larger as it is dictated by the PMTU [59]. 56 • Increase in the congestion window in SCTP is based on the number of bytes acknowledged and not on the number of acknowledgments received [1]. This allows SCTP to recover faster after fast retransmit. Also, SCTP initiates slow start when the congestion window is equal to the slow start threshold. This helps in achieving a faster increase in the congestion window. • The congestion window variable cwnd can achieve full utilization because when a sender has 1 byte of space in the cwnd and space available in receive window, it can send a full PMTU of data. • The FreeBSD K A M E SCTP stack also includes a variant called New-Reno SCTP that is more robust to multiple packet losses in a single window [32]. • When the receiver is multihomed, an SCTP sender maintains a separate con-gestion window for each transport address of the receiver because the conges-tion status of the network paths may differ from each other. SCTP's retrans-mission policy helps in increasing the throughput, since retransmissions are sent on one of the active alternate transport addresses of the receiver. The effect of multihoming was not a factor in our tests. In order to determine the performance overhead of SCTP's CRC32c check-sum, we re-ran the ping-pong test for short and long message sizes under no loss with CRC32c checksum turned-on. The throughput, on average, degraded by 12% compared to the case when the checksum was turned-off. In wide-area networks, the effect of CRC32c checksums on overall performance is likely to be small since communication time is dominated by network latency. The effect of the CRC32c checksum was more noticeable in the ping-pong tests because the experiments were performed on a low latency local area network. In general, although the time for calculating the checksum does add a few microseconds to message latency, it is likely to be small in comparison to network latency. 57 5.1.2 Using N A S Benchmarks In our second experiment, we used the NAS parallel benchmarks (NPB version 3.2) to test the performance of MPI programs with SCTP and TCP as the underlying transport protocol. The characteristics of the NAS benchmarks have already been discussed in Chapter 3. We tested the following seven of those benchmarks: LU, IS, MG, EP, CG, BT and SP. Dataset sizes S, W, A and B were used in the experiments with the number of processes equal to eight. The Dataset sizes increase in order S, W, A, and B, with S being the smallest. We have done an analysis of the type of messages being exchanged in these benchmarks and we have found that in datasets 'S' and ' W , short messages (as defined in L A M middleware to be messages smaller than or equal to 64 Kbytes) are predominantly being sent/received. In datasets 'A' and 'B ' we see a greater number of long messages (i.e., messages larger than 64 Kbytes) being exchanged. NAS Parallel MPI Benchmarks 4500 4000 3500 3000 £ 2500 | 2000 1500 1000 500 0 • LAM_SCTP • LAM_TCP m LU SP EP CG BT MG IS Figure 5.2: TCP versus SCTP for the NAS benchmarks Figure 5.2 shows the results for dataset size 'B' under no loss. The results for the other datasets are not shown in the figure, but as expected from the ping-pong test results, TCP does better for the shorter datasets. These benchmarks use single tags for communication between any pair of nodes and the benefits of using multiple 58 streams in the SCTP module are not being utilized. The SCTP module, in this case, reduces to a single stream per association and as the results show, the performance on average is comparable to TCP. TCP shows an advantage for the MG and BT benchmarks and we believe the reason for this is that these benchmarks use greater proportion of short messages in dataset 'B ' than the other benchmarks. 5.2 Evaluation Using a Latency Tolerant Program In this section we compare the performance of the SCTP module with LAM-TCP using a latency tolerant program. As our discussion in the previous chapters showed, in order to take advantage of highly available, shared environments that have large delays and loss, it is important to develop programs that are latency tolerant and capable of overlapping computation with communication to a high degree. Moreover, SCTP's use of multiple streams can provide more concurrency. Here, we investigate the performance of such programs with respect to TCP and also discuss the effect of head-of-line blocking in these programs. 5.2.1 Comparison Using a Real-World Program In this section we investigate the performance of a realistic program that makes use of multiple tags. Our objectives were two-fold: first, we wanted to evaluate a real-world parallel application that is able to overlap communication with computation, and secondly, to examine the effect of introducing multiple tags which, in case of the SCTP module, can map to different streams. We describe the experiments performed using an application which we call the Bulk Processor Farm program. The communication characteristics of this program is typical of real-world master-slave programs. The Bulk Processor Farm is a request driven program with one master and several slaves. The slaves ask the master for tasks to do, and the master is responsible for creating tasks, distributing them to slaves and then collecting the results from 59 all the slaves. The master services all task requests from slaves in the order of their arrival. Each task is assigned a different tag, which represents the type of that task. There is a maximum number of different tags (MaxWorkTags) that can be distributed at any time. The slaves can make multiple requests at the same time and when they finish doing some task they send a request for a new one, so at any time, each of the slaves have a fixed number of outstanding job requests. This number was chosen to be ten in our experiments. The slaves use non-blocking MPI calls to issue send and receive requests and all messages received by the slaves are expected messages. The slaves use MPI_ANY_TAG to show that they are willing to perform a task of any type. The master has a total number of tasks (NumTasks) that need to be performed before the program can terminate. In our experiments we set that number to 10,000 tasks. We experimented with tasks of two different sizes; short tasks equal to 30 Kbytes and long tasks equal to 300 Kbytes. The size of the task represents the message sizes that are exchanged between the master and the slave. In case of the SCTP module, the tags will be mapped to streams and in case of congestion if a message sent to a slave is lost, then it would still be possible for messages on the other streams to be delivered and the slave program can continue working without blocking on the lost message. In LAM-TCP, however, since all messages between the master and a slave have to be delivered in order, there would be less overlap of communication with computation. The experiments were run at loss rates of 0%, 1% and 2%. The farm program was run six times for each of the different combinations of loss rates and message sizes and the average value of total run-times are reported. The average and the median values of the multiple runs were very close to each other. We also calculated the standard deviation of the average value, and found the variation across different runs to be very small. Figure 5.3 shows the comparison between LAM-SCTP and LAM-TCP for short and long message sizes under different loss rates. As seen in Figure 5.3, SCTP outperforms TCP under loss. For short messages, LAM-TCP run-times are 10 to 11 60 (a) 140 LAM_SCTP versus LAM_TCP for Farm Program Message Size: Short » 120 100 80 60 40 20 0 6.8 5.9 I 0% 79.9 7.7 1% Loss Rate • L A M _ S C T P • LAM_TCP 131.5 11.2 2% (b) LAM_SCTP versus LAM_TCP for Farm Program • LAM_SCTP 5000 n 4500 4000 -| 3500 3000 2500 -\ 2000 1500 1000 500 Message Size: Long 83 114 0% 2080 804 1% Loss Rate • LAM_TCP 4311 1595 2-"c Figure 5.3: TCP versus SCTP for (a) short and (b) long messages for the Bulk Processor Farm application times higher than that of LAM-SCTP at loss rates of 1 to 2%. For long messages, LAM-TCP is slower than LAM-SCTP by 2.58 times at 1% and 2.7 times at 2% loss. Although SCTP performs substantially better than TCP for both short and long messages, the fact that the difference is more pronounced for short messages is very positive. MPI implementations, typically, try to optimize short messages for latency and long messages for bandwidth [4]. In our latency tolerant application, the slaves, at any time, have a number of pre-posted outstanding receive requests. Since short messages are sent eagerly by the master, they are copied to the correct Gl receive buffer as soon as they are received. Since messages are being transmitted on different streams in LAM-SCTP, there is less variation in the amount of time the slave might have to wait for a job to arrive, especially under loss conditions, and results in more overlap of communication with computation. The rendezvous mechanism in the long messages introduces synchrony to the transmission of the messages, and the payload can only be sent after the receiver has acknowledged its readiness to accept it. This reduces the amount of overlap possible, compared to the case when messages are sent eagerly. The long messages are also more prone to be affected by loss by virtue of their size. Long messages would typically be used for bulk transfers, and the cost of the rendezvous is amortized over the time required to transfer the data. We also introduced a tunable parameter to the program described above, which we call Fanout. Fanout represents the number of tasks a master will send to a slave in response to a single task request from that slave. In Figure 5.3 the Fanout is 1. We experimented with a Fanout value of 10 as shown in Figure 5.4. We anticipated that the settings with Fanout=10 would create more possibilities for head-of-line blocking in the LAM-TCP case, since in this case ten tasks are sent to a slave in response to one task request, and are more prone to be affected by loss. Figure 5.4 shows that with larger Fanout, the average run-time for TCP increases substantially for long messages, while there are only slight changes for short messages. One possibility for this behavior is TCP's flow-control mechanism. We are sending ten long tasks in response to a job request, and in case of loss, TCP blocks delivery of all subsequent messages until the lost segment is recovered. If it does not empty the receive buffer quickly enough, it can cause the sender to slow down. Moreover, as discussed in Section 5.1.1, SCTP's better congestion control also aids in faster recovery of lost segments, and this is also reflected in all our results. Also, we expect the performance difference between TCP and SCTP to widen when 62 (a) LAM 180 160 140 120 100 -| 80 60 40 20 H 0 _SCTP versus LAM_TCP for Farm Program Message Size: Short, Fanout: 10 88.1 8.7 6.2 0% 11.7 1% Loss Rate • LAM_SCTP • LAM_TCP 154.7 16.0 2% (b) LAM_SCTP versus LAM_TCP for Farm Program Message Size: Long, Fanout: 10 7000 6000 5000 4000 i= 3000 4 2000 % 1000 79 129 0% 3103 786 1% Loss Rate • LAM_SCTP • LAM_TCP 6414 1585 2% Figure 5 . 4 : TCP versus SCTP for (a) short and (b) long messages for the Bulk Processor Farm application using Fanout of 10 multihoming is present and retransmissions are sent on an alternate path. 5.2.2 Investigating Head-of-line Blocking In the experiments presented so far, we have shown that SCTP has superior perfor-mance than TCP under loss, and there are two main factors affecting the results: improvements in SCTP's congestion control mechanism, and the ability to use mul-tiple streams to reduce head-of-line blocking. In this section we examine the per-formance improvement obtained as a result of using multiple streams in our SCTP G3 module. In order to isolate the effects of head-of-line blocking in the farm program, we created another version of the SCTP module, one that uses only a single stream to send and/or receive messages irrespective of the message tag, rank and context. All other things were identical to our multiple-stream SCTP module. The farm program was run at different loss rates for both short and long mes-sage sizes. The results are shown in Figure 5.5. Since multihoming was not present (a) LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream • 10 Streams for Farm Program. Message Size: Short, Fanout: 10 D 1 stream 25 21.6 20 15 10 5 0 8.7 9.3 0% 11.7 11.0 1% Loss Rate 16.0 2% (b) LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream tor Farm Program. Message Size: Long, Fanout: 10 2500 i •o c o o 01 2000 i 1500 1000 500 0 79 79 0% 1000 786 1% Loss Rate • 10 Streams • 1 Stream 1942 1585 2% Figure 5.5: Effect of head-of-line Blocking in the Bulk Processor Farm program for (a) short and (b) long messages in the experiment, the results obtained show the effect of head-of-line blocking and the advantage due to the use multiple tags/streams. The reduction in average run-64 times when using multiple streams compared to a single stream is about 25% under loss for long messages. In the case of short messages, the benefit of using multiple streams becomes apparent at 2% loss where using a single stream shows increase in run-time of about 35%. This shows that head-of-line blocking has a substantial effect in a latency tolerant program like the Bulk Processor Farm. The performance difference between the two cases, can increase in a network where it takes a long time to recover from packet loss [57]. There is a possibility for improvement in these results even further as the FreeBSD KAME SCTP stack is optimized. Currently, the receive path for an SCTP receiver needs to be completely re-written for optimal performance [57]. As mentioned in Section 2.1, an SCTP sender uses a round-robin scheme to send messages on different streams, and according to the K A M E SCTP developers, the receive function, similarly, needs to be carefully thought out and a proper scheme devised for when multiple messages, on different streams, are ready to be received. 65 Chapter 6 Related Work The effectiveness of SCTP has been explored for several protocols in high latency, high-loss environments. Researchers have investigated the use of SCTP in FTP [41], HTTP [52], and also over wireless [10] and satellite networks [1]. Using SCTP for MPI has not been investigated. There are a wide variety of projects that use TCP in an MPI environment. MPICH-G2 [37] is a multi-protocol implementation of MPI for the Globus [27] en-vironment that was primarily designed to link together clusters over wide-area net-works. Globus provides a basic infrastructure for Grid computing and addresses many of issues with respect to the allocation, access and coordination of resources in a Grid environment. The communication middleware in MPICH-G2 is able to switch between the use of vendor-specific MPI libraries and TCP. In this regard, it is well suited for environments where there is a connection between clusters or other Grid elements. MPICH-G2 maintains a notion of locality and uses this to optimize communication between local nodes in a cluster or local network and nodes which are further away that use TCP. The developers report that MPICH-G2 optimizes TCP but there is no description of these optimizations and how they might work in general when all the nodes are non-local. Another related work is PACX-MPI [38], which is a grid-enabled MPI implementation. PACX-MPI enables execution of MPI 66 applications over a cluster of high-performance computers like Massively Parallel Processors (MPPs), connected through high-speed networks or the Internet. Com-munication within MPPs is done with vendor MPI and between meta-computers is done through TCP connections. LAM/MPI, as well, provides limited support for Globus grid environment. L A M can be booted across a Globus grid, but only restricted types of execution are possible [40, 54]. Another project on wide-area communication for grid computing is Netlbis [12] that focuses on connectivity, performance and security problems of TCP in wide-area networks. Netlbis enhances performance by use of data compression over parallel TCP connections. Moreover, it uses TCP splicing for connection establish-ment through firewalls and can use encryption as well. MagPIe [39] is a library of collective communication optimized for wide-area networks that is built on top of MPICH. MagPIe constructs communication graphs that are wide-area optimal, taking the hierarchical structure of network topology into account. Their results show that their algorithms outperform MPICH in real wide-area environments. Researchers have also focused on providing system fault tolerance in large scale distributed computing environments. HARNESS FT-MPI [18] is one such project that handles fault tolerance at the MPI communicator level. It aims at providing application programmers with different methods of dealing with failures in MPI rather than having to rely only on check-pointing and restart. In FT-MPI, the semantics and modes of failure can be controlled by the application using a modified MPI API. Our work provides the opportunity to add SCTP for transport in a Grid-environment where we can take advantage of the improved performance in the case of loss. Several projects have used UDP rather than TCP. As mentioned, UDP is message-based and one can avoid all the "heavy-weight" mechanisms present in TCP to obtain better performance. However, when one adds reliability on top of UDP the advantages begin to diminish. For example, LA-MPI [2], a high-performance, 67 reliable MPI library uses UDP and supports sender-side retransmission of messages based on acknowledgment timeout. Their retransmission approach has many simi-larities to TCP/IP but they bypass the complexities of TCP's flow and congestion control mechanisms. They justify their use of a simple retransmission technique by stating that their scheme is targeted at low-error rate cluster environments and is thus appropriate. LA-MPI also supports fault tolerance through message striping across multiple network interfaces and can achieve greater bandwidth through si-multaneous transfer of data. Automatic network failover in case of network failures is currently under development in their implementation. LA-MPI reports perfor-mance of their implementation over UDP/IP to be similar to TCP/IP performance of other MPI implementations over Ethernet. WAMP [61] is an example of UDP for MPI over wide-area networks. This work presents a user-level communication protocol where reliability was built on top of UDP. Interestingly, WAMP only wins over TCP in heavily congested networks where TCP's congestion avoidance mechanisms limit bandwidth. One of the reasons for WAMP's improvement over TCP is the fact that their protocol does not attempt to modify its behavior based on network congestion and maintains a constant flow regardless of the network conditions. Their protocol, in fact, captures bandwidth at the expense of other flows in the network. This technique may not be appropri-ate for Internet-like environments where fairness and network congestion must be considered. Limitation on achievable bandwidth due to TCP's congestion control is a problem but hopefully research in this area will lead to better solutions for TCP that can also be incorporated into SCTP. Another possible way to use streams is to send eager short messages on one stream and long messages on a second one. LA-MPI followed this approach with their UDP/IP implementation of MPI [2]. This potentially allows for the fast delivery of short messages and a more optimized delivery for long messages that require more bandwidth. It also may reduce head-of-line blocking for the 68 case of short messages waiting for long messages to complete. However, one has to maintain MPI message ordering semantics and thus, as is the case in LA-MPI, sequence numbers were introduced to ensure that the messages are received strictly in the order they were posted by the sender. Thus, in the end, there is no added opportunity for concurrent progression of messages as there is in the case of our implementation. There have been a number of MPI implementations that have explored im-proving communication with emphasis on some particular aspect of the overall per-formance. Work discussed in [44] integrates MPI's receive queue management into the TCP/IP protocol handler. The handler is interrupt driven and copies received data from the TCP receive buffer to the user space. The focus of this work is on avoiding polling of sockets and reducing system call overhead in large scale clusters. In general, these implementations require special drivers or operating systems and will not have a great degree of support, which limits their use. Although the same can be said about SCTP, the fact that SCTP has been standardized and implementations have begun to emerge, are all good indications of more wide-spread support. Recently Open MPI has been announced which is a new public domain ver-sion of MPI-2 that builds on the experience gained from the design and implemen-tation of LAM/MPI , LA-MPI and FT-MPI [17] and PACX-MPI. Open MPI takes a component-based approach to allow the flexibility to mix and match components for collectives and for transport and link management. It is designed to be scalable and fault-tolerant and provides a platform for third-party research and enables ad-dition of new components. We hope that our work can be incorporated into Open MPI as a transport module. TEG [65] is a fault-tolerant point-point communica-tion module in Open MPI that supports multihoming and the ability to stripe data across interfaces. It provides concurrent support for multiple network types, such as Myrinet, InfiniBand, GigE etc., and can carry out message fragmentation and 69 delivery utilizing different multiple network interfaces (NICs). The ability to sched-ule data over different interfaces has been proposed for SCTP and may provide an alternative way to provide the functionality of TEG in environments like the Inter-net. A group at the University of Delaware is researching Concurrent Multipath Transfer (CMT) [29, 28], which uses SCTP's multihoming feature to provide simul-taneous transfer of data between two endpoints via two or more end-to-end paths. The objective of using CMT between multihomed hosts is to increase an application throughput. CMT is at the transport layer and is thus more efficient, compared to multipath transfer at application level, since it has access to finer details about the end-to-end paths. CMT is currently being integrated into the FreeBSD K A M E SCTP stack and will be available as a sysctl option by end of year 2005. Another recent project, still under development, is Open Run-Time Envi-ronment (OpenRTE) [8], for supporting distributed high-performance computing in heterogeneous environments. It is a spin-off from the Open MPI project and has drawn from existing approaches to large scale distributed computing like LA-MPI, HARNESS FT-MPI and the Globus project. It focuses on providing: 1. Ease-of-use, with focus on transparency and the ability to execute an applica-tion on a variety of computing resources. 2. Resilience, with user-definable/selectable error management strategies. 3. Scalability and extensibility, with the ability to support addition of new fea-tures. It is based on the modular component architecture developed for the Open MPI project and the behavior of any of the OpenRTE subsystems can be changed by defining a new component and selecting it for use. Their design allows users to customize the run-time behavior for their applications. A beta version of OpenRTE is currently being evaluated by the developers. We believe this project and the Open MPI project can benefit from using SCTP in their transport module. SCTP 70 is particularly valuable for applications that require monitoring and path/session failure detection, since its heartbeat mechanism is designed for providing active monitoring of associations. One potential module in Open MPI may be a wide-area network message progression module. 71 Chapter 7 Conclusions and Future Work In this thesis we discussed the design and evaluation of using SCTP for MPI. SCTP is better suited as a transport layer for MPI because of its several distinct features not present in TCP. We have designed the SCTP module to address the head-of-line blocking problem present in LAM-TCP middleware. We have shown that SCTP matches MPI semantics more closely than TCP and we have taken advantage of the multistreaming feature of SCTP to provide a direct mapping from streams to MPI message tags. This has resulted in increased concurrency at the TRC level in the SCTP module compared to concurrency at process level in LAM-TCP. Our SCTP module's state machine uses one-to-many style sockets and avoids the use of expensive select system calls, which leads to increased scalability in large scale clusters. In addition, SCTP's multihoming feature makes our module fault tolerant and resilient to network path failures. We have evaluated our module and reported the results of several experi-ments using standard benchmark programs as well as a real-world application and compared the performance with LAM-TCP. Simple ping-pong tests, under no loss, have shown that LAM-TCP outperforms the SCTP module for small message sizes but SCTP does better for large messages. Additionally, SCTP's performance is comparable to TCP's for standard benchmarks such as the NAS benchmarks when 72 larger dataset sizes are used. The strengths of SCTP over TCP become apparent under loss conditions as seen in the results for ping-pong tests and our latency tol-erant Bulk Processor Farm program. When different tags are used in a program, the advantages due to multistreaming in our module can lead to further benefits in performance. SCTP is a robust transport level protocol that extends the range of MPI programs that can effectively execute in a wide-area network. There are many advantages of using MPI in open environments like the Internet, the first being portability and the ability to execute a large body of MPI applications unchanged in diverse environments. Secondly, not everybody has access to dedicated high per-formance clusters and this provides the opportunity to take advantage of computing resources in other environments as well as to be able to access specialized resources not locally available. There are, of course, many performance issues of operating in an open environment and the need for dynamically optimizing connections remains an active area of research in the networking community. Several variants of TCP have been proposed and we believe that SCTP, due to its similarity with TCP, will also be able to take advantage of improvements in that area. Our work has the potential to spawn a new body of research where other features of SCTP like multihoming and load balancing can be investigated to better support parallel applications using MPI. The current ongoing integration of CMT in the SCTP stack poses interesting questions about potential benefits for MPI applications. Work has also been done on prioritization of streams in SCTP [26], and it would be interesting to investigate the benefits to MPI, if a selection of different latencies are available to messages by assigning them to different priority streams. In general, the performance requirements of different MPI programs will vary. There will be those programs that can only achieve satisfactory performance on dedicated machines, with low-latency and high-bandwidth links. On the other 73 hand, there will be those latency tolerant programs that will be able to run just as well in highly available, shared environments that have large delays and loss. Our contribution is to the latter type of programs, for extending their performance in open environments such as the Internet. Ideally, we want to increase the portability of MPI and in doing so, encourage programmers to develop programs more towards this end of the spectrum. 74 Bibliography [1] Rumana Alamgir, Mohammed Atiquzzaman, and William Ivancic. Effect of congestion control on the performance of TCP and SCTP over satellite net-works. In NASA Earth Science Technology Conference, Pasadena, CA, June 2002. [2] Rob T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean Risinger, Mark A. Taylor, Timothy S. Woodall, and Mitchel W. Sukalski. Ar-chitecture of LA-MPI, a network-fault-tolerant MPI. In 18th International Parallel and Distributed Processing Symposium (IPDPS'04), Sante Fe, New Mexico, April 2004. [3] Ryan W. Bickhart. Transparent SCTP shim, personal communication. Univer-sity of Delaware, http://www.cis.udel.edu/ bickhart/research/shim.html. [4] Ron Brightwell and Keith Underwood. Evaluation of an eager protocol opti-mization for MPI. In PVM/MPI, pages 327-334, 2003. [5] Greg Burns and Raja Daoud. Robust message delivery with guaranteed re-sources. In Proceedings of Message Passing Interface Developer's and User's Conference (MPIDC), May 1995. [6] Sadik G. Caglar, Gregory D. Benson, Qing Huang, and Cho-Wai Chu. A multi-threaded implementation of MPI for Linux clusters. In 15th IASTED International Conference on Parallel and Distributed Computing and Systems, Los Angeles, November 2003. [7] Neal Cardwell, Stefan Savage, and Thomas Anderson. Modeling TCP latency. In INFOCOM, pages 1742-1751, 2000. [8] R. H. Castain, T. S. Woodall, D. J. Daniel, J. M. Squyres, B. Barrett, and G. E. Fagg. The open run-time environment (OpenRTE): A transport multi-cluster environment for high-performance computing, September 2005. To appear in 75 EURO P V M MPI 2005 12th European Parallel Virtual Machine and Message Passing Interface Conference. [9] NASA Ames Research Center. Numerical aerodynamic simulation (NAS) paral-lel benchmark (NPB) benchmarks. http://www.nas.nasa.gov/Software/NPB/. [10] Yoonsuk Choi, Kyungshik Lim, Hyun-Kook Kahng, and I. Chong. An experi-mental performance evaluation of the stream control transmission protocol for transaction processing in wireless networks. In ICOIN, pages 595-603, 2003. [11] Wu chun Feng. The future of high-performance networking, 2001. Workshop on New Visions for Large-Scale Networks: Research and Applications. [12] A. Denis, O. Aumage, R. Hofman, K. Verstoep, T. Kielmann, and H.E.Bal. Wide-area communication for grids: an integrated solution to connectivity, performance and security problems. In High performance Distributed Comput-ing, 2004- Proceedings. 13th IEEE International Symposium on, pages 97-106, June 2004. [13] P. Dickens, W. Gropp, and P. Woodward. High performance wide area data transfers over high performance networks. In International Workshop on Per-formance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems, 2002. [14] Rossen Dimitrov and Anthony Skjellum. Software architecture and performance comparison of MPI/Pro and MPICH. In Lecture Notes in Computer Science, Volume 2659, Pages 307 - 315, January 2003. [15] Tom Dunigan, Matt Mathis, and Brian Tierney. A TCP tuning daemon. In Supercomputing '02: Proceedings of the 2002 A CM/IEEE conference on Super-computing, pages 1-16, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. [16] EmuLab. Network Emulation Testbed. http://www.emulab.net/. [17] Graham E. Fagg and Graham E. Dongarra. Building and using a fault-tolerant MPI implementation. International Journal of High Performance Computing Applications, 18(3):353-361, 2004. [18] Graham E. Fagg and Jack J. Dongarra. HARNESS fault tolerant MPI design, usage and performance issues. Future Gener. Comput. Syst., 18(8):1127-1142, 2002. 76 [19] Kevin Fall and Sally Floyd. Simulation-based comparisons of Tahoe, Reno and SACK TCP. SIGCOMM Comput. Commun. Rev., 26(3):5-21, 1996. [20] Ahmad Faraj and Xin Yuan. Communication characteristics in the NAS parallel benchmarks. In PDCS, pages 724-729, 2002. [21] W. Feng and P. Tinnakornsrisuphap. The failure of TCP in high-performance computational grids. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing, page 37, Washington, DC, USA, 2000. IEEE Computer Society. [22] G. Burns, R. Daoud and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Supercomputing Symposium '94, Toronto, Canada, June 1994. [23] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Don-garra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, concept and design of a next gen-eration MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004. [24] Al Geist, William Gropp, Steve Huss-Lederman, Andrew Lumsdaine, Ewing L. Lusk, William Saphir, Tony Skjellum, and Marc Snir. MPI-2: Extending the message-passing interface. In Euro-Par, Vol. I, pages 128-135, 1996. [25] Thomas J. Hacker, Brian D. Noble, and Brian D. Athey. Improving throughput and maintaining fairness using parallel TCP. In IEEE INFOCOM, 2004. [26] Gerard J. Heinz and Paul D. Amer. Priorities in SCTP multistreaming. In SCI '04, Orlando, FL, July 2004. [27] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. Intl Journal of Supercomputer Applications, 11(2):115 - 128, 1997. [28] Janardhan R. Iyengar, Paul D. Amer, and Randall Stewart. Retransmission policies for concurrent multipath transfer using SCTP multihoming. In ICON 2004, Singapore, November 2004. [29] Janardhan R. Iyengar, Keyur C. Shah, Paul D. Amer, and Randall Stewart. Concurrent multipath transfer using SCTP multihoming. In SPECTS 2004, San Jose, July 2004. 77 [30] M. Jain, R. Prasad, and C. Dovrolis. The TCP bandwidth-delay product revis-ited: Network buffering, cross traffic, and socket buffer auto-sizing. Technical Report GIT-CERCS-03-02, Georgia Tech, February 2003. [31] Armando L. Caro Jr., Paul D. Amer, and Randall R. Stewart. Retransmission policies for multihomed transport protocols. Technical Report TR2005-15, CIS Dept, University of Delaware, March 2005. [32] Armando L.Caro Jr., Keyur Shah, Janardhan R. Iyengar, Paul D. Amer, and Randall R. Stewart. SCTP and TCP variants: Congestion control under mul-tiple losses. Technical Report TR2003-04, CIS Dept, U of Delaware, February 2003. [33] Humaira Kamal. Description of L A M TCP RPI module. Technical re-port, University of British Columbia, Computer Science Department, Available at: http://www.cs.ubc.ca/labs/dsg/mpi-sctp/LAM_TCPJlPLMODULE.pdf, 2004. [34] Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP-based middleware for MPI in wide-area networks. In 3rd Annual Conf. on Communication Networks and Services Research (CNSR2005), pages 157-162, Halifax, May 2005. IEEE Computer Society. [35] Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP versus TCP for MPI. In Supercomputing '05: Proceedings of 2005 A CM/IEEE conference on Super-computing (to appear), Seattle, WA, November 2005. [36] KAME. SCTP stack implementation for FreeBSD. http://www.kame.net. [37] Nicholas T. Karonis, Brian R. Toonen, and Ian T. Foster. MPICH-G2: A grid-enabled implementation of the message passing interface. CoRR, cs.DC/0206040, 2002. [38] Rainer Keller, Edgar Gabriel, Bettina Krammer, Matthias S. Mueller, and Michael M. Resch. Towards efficient execution of MPI applications on the grid: Porting and optimization issues. Journal of Grid Computing, 1:133-149, June 2003. [39] Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal, Aske Plaat, and Raoul A. F. Bhoedjang. MagPIe: MPI's collective communication operations for clustered wide area systems. SIGPLAN Not, 34(8): 131-140, 1999. 78 [40] The LAM/MPI Team Open Systems Lab. LAM/MPI user's guide, September 2004. Pervasive Technology Labs, Indiana University. [41] Sourabh Ladha and Paul Amer. Improving multiple file transfers using SCTP multistreaming. In Proceedings IPCCC, April 2004. [42] Linux Kernel Stream Control Transmission Protocol (lksctp) project, withsctp lksctp-tools utility, lksctp-tools package, http://lksctp.sourceforge.net/. [43] Matt Mathis, John Heffner, and Raghu Reddy. WeblOO: Extended TCP instru-mentation for research, education and diagnosis. SIGCOMM Comput. Com-mun. Rev., 33(3):69-79, 2003. [44] M. Matsuda, T. Kudoh, H. Tazuka, and Y. Ishikawa. The design and imple-mentation of an asynchronous communication mechanism for the MPI commu-nication model. In IEEE Intl. Conf. on Cluster Computing, pages 13-22, Dana Point, Ca., Sept 2004. [45] Alberto Medina, Mark Allman, and Sally Floyd. Measuring the evolution of transport protocols in the Internet, April 2005. To appear in ACM CCR. [46] Message Passing Interface Forum. MPI: A Message Passing Interface. In Super-computing '93: Proceedings of1993 ACM/IEEE conference on Supercomputing, pages 878-883. IEEE Computer Society Press, 1993. [47] G. Minshall, Y. Saito, J. Mogul, and B. Verghese. Application performance pit-falls and TCP's Nagle algorithm. In Workshop on Internet Server Performance, Atlanta, GA, USA, May 1999. [48] Philip J. Mucci. MPBench: Benchmark for MPI functions. Available at: http://icl.cs.utk.edu/projects/llcbench / mpbench. html. [49] Peter S. Pacheco. Parallel programming with MPI. Morgan Kaufmann Pub-lishers Inc., San Francisco, CA, USA, 1996. [50] Scott Pakin, Mario Lauria, and Andrew Chien. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, San Diego, CA, December 1995. [51] PlanetLab. An open platform for developing, deploying, and accessing planetary-scale services, http://www.planet-lab.org/. 79 [52] R. Rajamani, S. Kumar, and N. Gupta. SCTP versus TCP: Comparing the performance of transport protocols for web traffic. Technical report, University of Wisconsin-Madison, May 2002. [53] M. Atiquzzaman S. Fu and W. Ivancic. SCTP over satellite networks. In IEEE Computer Communications Workshop (CCW 2003), pages 112-116, Dana Point, Ca., October 2003. [54] Jeffrey M. Squyres, Brian Barrett, and Andrew Lumsdaine. Boot systems ser-vices interface (SSI) modules for LAM/MPI. Technical Report TR576, Indiana University, Computer Science Department, 2003. [55] W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. UNIX Network Programming, Vol. 1, Third Edition. Pearson Education, 2003. [56] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, M. Kalla I. Rytina, L. Zhang, and V. Paxson. The stream control transmission protocol (SCTP). Available from http://www.ietf.org/rfc/rfc2960.txt, October 2000. [57] Randall Stewart. Primary designer of SCTP and developer of FreeBSD K A M E SCTP stack, private communication, 2005. [58] Randall Stewart and Paul D. Amer. Why is SCTP needed given TCP and UDP are widely available? Internet Society, ISOC Member Briefing #17. Available from http://www.isoc.org/briefings/017/briefingl7.pdf, June 2004. [59] Randall R. Stewart and Qiaobing Xie. Stream control transmission protocol (SCTP): a reference guide. Addison-Wesley Longman Publishing Co., Inc., 2002. [60] Alexander Tormasov and Alexey Kuznetsov. TCP/IP options for high-performance data transmission, 2002. http://builder.com.com/5100-6372_14-1050878.html. [61] Rajkumar Vinkat, Philip M. Dickens, and William Gropp. Efficient communi-cation across the Internet in wide-area MPI. In Conference on Parallel and Dis-tributed Programming Techniques and Applications, Las Vegas, Nevada, USA, 2001. [62] W. Gropp, E. Lusk, N. Doss and A. Skjellum. High-performance, portable implementation of the MPI message passing interface standard. Parallel Com-puting, 22(6):789-828, September 1996. 80 [63] Paul A. Watson. Slipping in the window: TCP reset attacks, October 2003. CanSecWest Security Conference. [64] Eric Weigle and Wu chun Feng. A case for TCP Vegas in high-performance computational grids. In HPDC '01: Proceedings of the 10th IEEE International Symposium on High Performance. Distributed Computing (HPDC-10'01), page 158, Washington, DC, USA, 2001. IEEE Computer Society. [65] T.S. Woodall, R.L. Graham, R.H. Castain, D.J. Daniel, M.W. Sukalski, G.E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sa-hay, P. Kambadur, B. Barrett, and A. Lumsdaine. TEG: A high-performance, scalable, multi-network point-to-point communications methodology. In Pro-ceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004. [66] J. Yoakum and L. Ong. An introduction to the stream control transmission pro-tocol (SCTP). Available from http://www.ietf.org/rfc/rfc3286.txt, May 2002. 81 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items