UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Hybrid design of MPI over SCTP Tsai, Mike Yao Chen 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2007-0622.pdf [ 3.39MB ]
JSON: 831-1.0302119.json
JSON-LD: 831-1.0302119-ld.json
RDF/XML (Pretty): 831-1.0302119-rdf.xml
RDF/JSON: 831-1.0302119-rdf.json
Turtle: 831-1.0302119-turtle.txt
N-Triples: 831-1.0302119-rdf-ntriples.txt
Original Record: 831-1.0302119-source.json
Full Text

Full Text

Hybrid Design of M P I over S C T P by Mike Yao Chen Tsai B .Sc , The University of British Columbia, 2005 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF Master of Science The Faculty of Graduate Studies (Computer Science) The University Of British Columbia September, 2007 © Mike Yao Chen Tsai 2007 ii A b s t r a c t Message Passing Interface (MPI) is a popular message passing interface for writing parallel applications. It has been designed to run over many differ-ent types of network interconnects ranging from commodity Ethernet to more specialized hardwares including: shared memory, and Remote Direct Memory Access (RDMA) devices such as InfiniBand and the recently standardized In-ternet Wide Area RDMA Protocol (iWARP). The API itself provides both the point-to-point and remote memory access (RMA) operations to the applica-tion. However, it is often implemented based on one kind of underlying network device, namely entirely RDMA or point-to-point. As a result, it is often not possible to provide a direct mapping from the software semantics to the un-derlying hardware. In this work, we propose a hybrid approach in designing MPI in which network device to use can depend on its functional requirement. This allows the MPI API to exploit the potential performance benefits of the underlying hardware more directly. Another highlight of this work is the design of the MPI middleware to be IP based in order to provide support, for both cluster and wide area network environment; this can be achieved via the use of a commodity transport layer protocol, namely Stream Control Transmission Protocol (SCTP). We will demonstrate how SCTP can be used to support MPI with different kinds of network devices and to provide multirailing support' from the transport layer. iii C o n t e n t s Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgements ix 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 4 1.3 Thesis outline 5 2 Background 6 2.1 Message Passing Interface 6 2.2 Stream Control Transmission Protocol 9 2.3 Internet Wide Area RDMA Protocol 11 3 Design and Implementation 16 3.1 Overview 16 3.2 MPICH2-SCTP 16 3.2.1 Dynamic Connection and Initialization 17 Contents iv 3.2.2 Message de-multiplexing and SCTP events 19 3.2.3 Progress Engine 21 3.2.4 Multihoming 22 3.2.5 Process Spawning 23 3.2.6 Limitations 24 3.3 iWARP Implementation 28 3.3.1 Direct Data Placement (DDP) over SCTP 29 3.3.2 DDP header 30 3.3.3 DDP to SCTP semantic mapping 30 3.3.4 Asynchronous support • . 31 3.3.5 User level API, Verbs 33 3.3.6 Limitations 35 3.4 MPICH2-Hybrid 36 3.4.1 Memory Management 36 3.4.2 Connection Management 37 3.4.3 Progress Engine Extension 41 3.4.4 MPI-2 RMA semantic mapping 42 3.4.5 Limitations 48 4 Evaluation 50 4.1 Experiment setup 50 4.2 MPLPut/Get performance • • 51 4.3 CMT Related Experiments 54 4.4 General Benchmarks 58 4.4.1 Microbenchmarks 59 4.4.2 Real applications 60 4.5 Summary 63 Contents v 5 Related Work 64 5.1 Hybrid Design Related 64 5.2 C M T related 66 5.3 R D M A related 67 6 Future work and Conclusion 71 Bibliography 73 vi L i s t o f T a b l e s 3.1 OSC-iWARP Verbs (Function calls extended from OSC-iWARP are underlined) 34 3.2 M P I - 2 R M A function prototypes 43 vii L i s t o f F i g u r e s 1.1 Hybrid M P I design over SCTP 3 2.1 M P I middleware structured view 7 2.2 i W A R P stack layout 12 3.1 MPICH2 Implementation Structure 17 3.2 Object/Component view of ch3: sctp 18 3.3 M P I and S C T P semantic mapping 20 3.4 Progress Engine design 22 3.5 M P L A N Y . T A G issue 27 3.6 i W A R P semantics and modules 33 3.7 Component view of ch.3: hybr id 37 3.8 Static connect all routine 40 3.9 Dynamic connection setup for QPs 41 3.10 MPLPut via RDMA.Wmte during fence call 45 3.11 MPI.Get via RDMA. Write during fence call 46 3.12 MPLGet via RDMA.Read during fence call 47. 4.1 Cluster Configuration 51 4.2 MPI.Put/MPI.Get latency test 52 4.3 MPI-1/2 Synthetic test 54 4.4 Iperf S C T P test 56 List of Figures viii 4.5 OSU Bandwidth test 57 4.6 OSU Latency test (Less is better) 60 4.7 OSU Bi-directional test (More is better) 61 4.8 NAS Parallel M P I Benchmarks (Less is better) 62 5.1 Layer view of SDP 68 5.2 T C P Latency test with NetPIPE (Less is better) 69 ix A c k n o w l e d g e m e n t s I would like to express sincere gratitude to my supervisor, Dr. Alan Wagner. His encouragement, insight, and faith in my abilities are immensely appreciated. Without his guidance and ideas, I would not have completed this program. I would like to dedicate this thesis to my family and Jiemin for their unques-tioning love and support. I am also grateful to Dr. Norm Hutchinson for being the second reader and providing his insight in making this thesis more complete. Many thanks to Randall Stewart, one of the core designers of SCTP, for technical support of the stack. Last but not the least, I would like to thank Dennis Cheong for his support and the gang in the Distributed Systems Group for all the fun times: Brad Penoff, Kan Cai, Carnilo Rostoker, Brendan Cully, Gang Peng, Le Cuong and other buddies. M I K E Y A O C H E N TSAI The University of British Columbia September 2007 1 Chapter 1 I n t r o d u c t i o n 1.1 Mot iva t ion Network components are critical to any distributed systems in both performance and reliability. Traditional network fabrics such as Ethernet are widely available in every computing environment, however, these devices have been criticized for the lack of performance scaling as network speed increases. As a result, various technologies have been proposed such as T C P offload engines (TOE) and R D M A (Remote Direct Memory Access) to enable transport protocol offloading and zero-copy respectively. Although R D M A based devices such as InfiniBand can provide low latency and high bandwidth, it does not support the traditional IP framework and, in comparison to Ethernet, it is still relatively expensive. There are many different network interconnects available, each with its own pros and cons. I believe that the application should be able to decide on the appropriate network devices to support with respect to its functional requirements and hence increase flexibility. Currently, there are many network technologies that can provide low latency and high bandwidth. These include InfiniBand [8], Myrinet [1], and Quadrics [3]. The per port cost of these devices is often fairly expensive and it depends on the applications whether such high-end devices are necessary. The argument against R D M A capable devices has been the application semantic mapping [21]. R D M A requires memory to be pinned in order for the receive operation to take Chapter 1. Introduction 2 place and pin and unpin are relatively expensive operations as it requires the host operating system to intervene. As a result, applications that seek gains in performance need to be modified and redesigned in order to exploit the benefits of RDMA devices. Ethernet, on the other hand, is relatively cheap and common in both wide area networks and major cluster environments. Although its performance is not as competitive as others, there is a large base of legacy applications that rely on the traditional socket interface and the existing Ethernet infrastructure that has been deployed for many years. However, Ethernet has been highly criticized for its high latency and it is not suitable for applications that require low latency support in a cluster environment. Some of the latency cost can be minimized by designing the software in a task farm structure that allows com-munication to overlap with computation in order to hide some of the latency; this is shown to be particularly useful in a wide area network in which network conditions can fluctuate [28]. There are also technologies that can help improve the performance of traditional Ethernet devices such as SolarFlare's Ethernet Accelerator [5]. This is an interesting approach as it supports the Berkeley socket interface and provides the performance improvement transparently over the IP network without any modification to legacy applications. Hence, we be-lieve that Ethernet still has its place in the future as there has been proposals in building 40/100 GbE (Gigabit Ethernet) devices [7]. With its large deployment base over other network technologies1, it may remain the dominant network interconnect for years to come. MPI [18] is a popular message passing interface for writing parallel appli-cations and it supports a variety of network interconnects ranging from shared memory to Ethernet to RDMA capable devices. In addition to the hardware sup-1In the latest rankings of the Top 500 supercomputers, 211 of the systems use GbE as the internal network [50]. Chapter 1. Introduction 3 port, M P I specification [18] has two notions for communication: point-to-point (MPI-1) and remote memory access (MPI-2) operations. The MPI-2 specifica-tion was later defined to clarify some of the MPI-1 issues and to propose exten-sions of the M P I programming model substantially beyond the strict message-passing model represented by MPI-1. While we are focusing on the commu-nication aspect of M P I , MPI-2 introduced remote memory access (one-sided) operations and these specifications are meant to be hardware independent. In most middleware implementations of MPI-1/2, the MPI-1/2 communication functions are usually implemented entirely on top of point-to-point or R D M A hardware regardless of the specification. In this work, we are proposing a hybrid design of M P I based on its functional requirements. We believe by providing a more direct mapping from the M P I primitives to the underlying network device, this can increase the flexibility of middleware and allow the application to better exploit the hardware. In the end, all of the necessary changes are transparent to the end user. MPI-1 API MPI-2 RMA API MPI Middleware Pt2Pt RMA Socket Software fWARP SCTP/IP Multirail support | Ethernet based device (1 or more)! Figure 1.1: Hybrid M P I design over S C T P Our prototype implementation of MPI middleware supports both socket based calls for point-to-point and R D M A based semantics via i W A R P (Inter-Chapter 1. Introduction 4 net Wide Area R D M A Protocol), which is a set of protocol specifications for R D M A over IP. Both semantics are using Stream Control Transmission Protocol (SCTP) as the underlying transport layer protocol to provide reliable message transferring and the use of commodity Ethernet devices for both local and wide area networks as shown in Figure 1.1. S C T P was considered because it is message-based and it simplifies the imple-mentation of point-to-point as well as providing a solid foundation for iWARP. Namely, it avoids the message framing problem that occurs when i W A R P is lay-ered over T C P . Furthermore, our previous work has shown that S C T P is able to adapt to fluctuating network conditions (varying latency and loss) better than T C P [28] and we will be demonstrating how various configurations of SCTP can be supported in our implementation to exploit performance improvement via multirailing in a cluster environment. Due to lack of hardware support, the i W A R P stack is implemented in software [16, 38] and the hardware device itself is simulated. As a result, performance improvement of the simulated R D M A device is not the main focus of this work. Moreover, we will describe how a software based i W A R P stack can be used with certain Ethernet accelerators to provide low latency and can also be used to communicate with native i W A R P based hardware. 1.2 Contribution The main contribution of this work is the exploration of issues that arise in a hybrid design to support MPI-1 and MPI-2 where a functional decomposition is implemented such that the point-to-point and one-sided primitives use the best available device. We take advantage of SCTP as it provides a common layer for both point-to-point and iWARP. The R D M A device is based on Ohio Supercorn-puting Center's (OSC) software implementation of iWARP. The OSC's i W A R P Chapter 1. Introduction 5 software stack [16] only supports T C P / I P and we extend this implementation to use S C T P as there is currently no such implementation that we are aware of. Previous work by our group had constructed an M P I C H 2 2 implementation of S C T P and this was the basis for the hybrid design extension. 1.3 Thesis outline The remainder of the thesis is organized as follows. In Chapter 2, we provide some background information about the higher level components that are being addressed by this work, namely MPI , SCTP and iWARP. In Chapter 3, we discuss the design of each of the main components in our hybrid design and their interactions. In Chapter 4, we evaluate the design and compare the hybrid extension to its parent design, the SCTP channel. In Chapter 5, we present some of the related work to this project. Lastly, we discuss possible future extensions to our design arid finally conclude in Chapter 6. 2 MPICH2 is a popular open source implementation of MPI Middleware and SCTP is implemented within MPICH2 as a communication module. It is also part of the official release of the MPICH2: http://www-unix.mcs.anl.gov/mpi/mpich/and has been thoroughly tested. 6 Chapter 2 B a c k g r o u n d In this chapter, we provide some background information about the main tech-nologies that were used in the design and implementation of our system. 2.1 Message Passing Interface Message Passing Interface (MPI) is a popular API for writing parallel applica-tions for distributed systems. Its popularity is based on its transparent support on a variety of network devices and languages binding for C/C++ and Fortran. It is primarily used in writing scientific computing code that can run on a variety of different clusters ranging from small to large supercomputers. MPI typically consists of a middleware library, various helper commands to compile and run MPI-specific programs and a runtime daemon to create and teardown the MPI universe. From a component point of view, MPI consists of a process scheduler, memory manager, job scheduler and communication modules (Figure 2.1). Of-ten MPI middleware is modularized to glue more flexibilities in the choice of hardware and software. The MPI daemon acts as an administrator that runs in the background and is responsible for managing the runtime environment. Each endpoint has an MPI daemon attached to it and the daemon itself communicates with others to coordinate its activities, such as resource discovery. In addition, the daemon also serves as a database to store the information about each MPI process and Chapter 2. Background 7 MPI application MPI Library MPI Middleware MPI Daemon zzr. Process Manager 1 Job Scheduler Communication module Socket RDMA API 3 Operating System Hardware Figure 2.1: MPI middleware structured view this information is kept globally consistent across the MPI universe. The MPI library comprises of both MPI-1 and MPI-2 specifications. The MPI-1 library contained over 200 API calls1 and a variety of communication primitives including the basic send/receive operations both synchronously and asynchronously. MPI allows asynchronous communication via nonblocking prim-itives to overlap computation with communication, hence helping to improve the latency tolerance of the application. Aside from point-to-point communication, MPI also provides mechanisms for collective communication for a group of MPI processes such as broadcast, reduce, gather, and all-to-all. The MPI-2 specifi-cation was later introduced primarily to clarify issues in MPI-1 and to provide three extensions in the areas of remote memory access (RMA) operations, dy-namic process spawning, and parallel I/O. From the MPI specification, each MPI process is assigned to a globally unique rank (from 0 to N-1, where N is the size of the MPI universe) and the rank is used to address an MPI peer with respect to the participants of the parallel application. A typical MPI message has an envelope attached that con-1 M P I - 1 s p e c i f i c a t i o n : h t t p : / / w u . m p i - f o r u m . o r g / d o c s / m p i - 1 1 - h t m l / m p i - r e p o r t . h t m l Chapter 2. Background 8 tains the metadata of the message. The metadata includes the type of message, source and target rank, and a tag value. In addition, for point-to-point opera-tions, each message is also designated by an opaque value that serves an MPI level context ID. A n M P I level context ID is used to separate collections of M P I messages. In particular, the context ID is used to differentiate M P I primitives from each other. For instance, M P I collective messages should not be mixed with simple point-to-point communication. By including a context value, M P I middleware can ensure that MPI messages from different primitives are matched correctly. Message matching is performed at the T R C level (Tag, Rank, and Context) and not every operation requires the use of a tag (e.g., collectives and one-sided operations). T R C information is used in both sending and receiving. In addition, M P I allows the use of wildcards in posting a receive call to match from any source (MPI.ANY.SOURCE) and any tag {MPLANY.TAG). M P I also provides the abstraction for process group and communicators that allows the user application to parallelize their problem domain in a more descriptive way. Overall, the M P I specification exports a comprehensive set of abstrac-tions to the users and by using a common specification, this can improve the portability of M P I programs across different MPI middlewares. Internally, M P I defines the notion of a short/long and expected/unexpected messages. Short messages are transmitted immediately while long messages (larger than a threshold value) require the use of a rendezvous protocol to prevent the sending side from overflowing the receiver's buffer. A n expected message is one for which the receiver is waiting ("posted") and an unexpected message is an M P I message that was not able to be matched and thus will be buffered. Recall that M P I provides asynchronous transfer of messages and in order to ensure a nonblocking send or receive request is completed, user applications need Chapter 2. Background 9 to query the M P I middleware. Moreover, with the introduction of MPI-2's dy-namic process support, M P I processes can be created and join the M P I universe dynamically. As a result, within the MPI middleware, messages and endpoints are defined as M P I level abstractions that need to be progressed and main-tained by the Message Progression Layer, or a progress engine. In general, the progress engine is responsible for managing the life-cycle of a message and the connection states of M P I peers. The life-cycle of a message includes: performing message matching for incoming messages, handling unexpected messages, and transmitting pending send requests (short and long messages). The M P I specification does not mandate any design decisions for the middle-ware other than the implementation requirements for developers. As a result, there are many popular open source MPI middleware implementations available: L A M / M P I [19], MPICH2 [52], Open M P I [20]. Most M P I implementations sup-port T C P / I P since Ethernet is widely used and available in every cluster. Aside from T C P / I P , there is also support for shared memory, Myrinet, InfiniBand and iWARP. In general, every large parallel machine supports the M P I library. With the strong growth and availability of multi-core processors, we be-lieve that parallel computing will become more critical than ever and M P I will continue to evolve through research and provide the tools that can ease the development of parallel applications. 2.2 Stream Control Transmission Protocol Stream Control Transmission Protocol (SCTP) is a new reliable transport layer protocol that has recently been standardized by the IETF [48]. It was originally designed for the telephony community. Unlike T C P , it is message oriented and is similar to U D P but without message truncation. Aside from the TCP-like fea-tures such as congestion control, it provides additional features: multistreaming, Chapter 2. Background 10 multihoming, security enhancements arid improved extensibility. S C T P maps a set of IP and Port pairs to an association. Within an association, multiple streams can be supported. A stream is an uni-directional flow of data and data ordering is only preserved within a given stream. This is analogous to opening multiple T C P connections except streams share common flow and congestion control. Moreover, by assigning data, packets to different streams, this can over-come the potential head-of-line problem found in T C P [47]. Multihoming is a major feature of SCTP that is not in T C P , and provides additional fault tolerance for user applications. Multihoming allows an applica-tion to bind to multiple network interfaces and switch from one to another upon network failures. For maximum fault tolerance, independent links are preferred so that paths do not share a common bottleneck. There is a recent project from the University of Delaware to extend the multihoming feature to support Con-current Multipath Transfer (CMT), which inherits the fault tolerance aspect of multihoming and provides additional bandwidth by striping data across all available links at the transport layer [24], S C T P was introduced with backward compatibility in mind. As a result, both U D P and T C P style sockets are supported through the use of the Berkeley socket API . With respect to UDP-style sockets, SCTP is able to provide a reliable message oriented service. In the underlying design, U D P and T C P style sockets are compatible with each other as everything is mapped to association IDs. In addition to the Berkeley Socket API , S C T P introduced another set of APIs to support its new features. For instance, there is a new sctp-bindx call that allows a user to selectively bind to specific network interfaces. Furthermore, there are multiple extended send and receive calls that allow the user to specify or retrieve more S C T P related options or information. T C P was found to have various security loopholes, for example; during the Chapter 2. Background 11 connection establishment and handshake process where SYN-flood attacks can be made. As a result, S C T P incorporates a cookie mechanism during initializa-tion with a four-way handshake and a 32-bit sequence number to prevent such denial of service attack. Moreover, SCTP uses a stronger checksum, CRC32c, to ensure end to end reliability and data integrity. The stronger checksum requires more computing cycles and it is often not available for offload to present-day hardware. S C T P is also more resilient to packet loss as it uses Selective Acknowledg-ment (SACK). This is available in T C P as well, but with a smaller number of S A C K blocks. This feature is useful where loss is likely such as Wide Area and Satellite networks. With regards to extensibility, S C T P defines a notion of chunk within each SCTP datagram and it is possible for user to define their own chunks. For instance, there are proposals for SCTP to support partial reliability of the messages. Overall, S C T P is gradually gaining popularity as many application layer protocols have been ported such as F T P [30], H T T P [44] to support the new transport layer protocol. In this work, we are focusing on supporting SCTP within M P I and i W A R P with proper mappings to exploit the benefits of the protocol. 2.3 Internet Wide Area R D M A Protocol Internet Wide Area R D M A Protocol (iWARP) is a set of protocol specifica-tions to provide R D M A support for wide area networks and it was designed to operate on top of the existing IP infrastructure to overcome the performance deficiency of traditional Ethernet based network devices. It is an update from the I E T F of the R D M A Consortium's R D M A over T C P standard [4]. In order to improve the performance, both R D M A and OS-bypass are necessary in order Chapter 2. Background 12 to move data without the CPU or the operating system being involved. Most RDMA-capable network adapters also handle protocol offloading (TOE) and this allows the NIC to do the network processing locally. Hence, the host CPU can continue application processing concurrently while the network adapters handle the majority of the network traffic, thus improving the overall efficiency. Verbs or API RDMAP DDP MPA TCP SCTP IP Figure 2.2: iWARP stack layout, iWARP is combination of the following layers (Figure 2.2) : • Verbs API The verbs or API layer exposes the RDMA operations to the user application [26]. Furthermore, it also provides direct access to the RDMA capable Network Interface Card (RNIC) hardware. The verbs layer communicates directly with the RDMAP layer. iWARP does not provide any verbs specification. As a result, this layer is flexible in terms of API design. However, without any standardization, every vendor can potentially implement their own API resulting in application incompati-bility. • Remote Direct Memory Access Protocol (RDMAP): The RDMAP layer provides both traditional send/receive and read/write services to the user applications [10, 42]. RDMAP itself is a fairly thin layer in which it depends on the lower layer protocol to provide the corresponding service. In this case, the lower layer is the Direct Data Placement layer. Chapter 2. Background 13 • Direct Data Placement (DDP): DDP layer is responsible for placing the data in the proper place after being received [23]. It supports two mod-els of message transfer: Tagged and Untagged. Tagged message transfers require the use of a steering tag (within the DDP header) to allow the receiver to place the upcoming data directly to a user buffer that is reg-istered with the specified steering tag. Tn the case of untagged message transfers, they provide the traditional Copy-in/out model by specifying message and buffer queue number. Both models require a buffer to be pre-posted in order to ensure correctness. Moreover, each DDP segment is message based and each fragment must have a corresponding header attached. As a result, each DDP segment preserves the self-describing property of the protocol and can be placed at the appropriate location without any information from other DDP segments. In addition, DDP requires a reliable delivery from the lower layer protocol. • Marker P D U Al igned(MPA)-TCP/SCTP: Since DDP is an end-to-end protocol, the intermediate nodes do not necessarily support DDP. As a result, it is possible that intermediate switches/routers can splice the packets; this is know as the "Middle Box Fragmentation" problem. In the worst case, fragmentation could cause the self-describing property to fail. To overcome this challenge, Marker PDU Aligned or MPA was introduced to provide message framing capability for TCP/IP serving as an extension to TCP/IP in order to maintain backward compatibility [40]. Its function is to insert, markers at specific places in order for the receiver to have a deterministic way to find the markers and find the right header for the segment. Each MPA segment is known as the Framing Protocol Data Unit, FPDU. In addition, the MPA layer adds a more robust checksum, CRC32, to each FPDU, to preserve data integrity. During initialization, Chapter 2. Background 14 active/passive i W A R P endpoints are required to exchange information about the specific location of M P A markers. It has been shown that the M P A layer is fairly complex as the placement of the marker is fairly t r icky and this problem becomes worse in the case of support ing i W A R P with multiple network devices in which out-of-order communication can occur [39]. Moreover, along with the added checksum calculat ion, the M P A layer can add a considerable overhead to the overall performance. Alternatively, S C T P is a candidate to replace M P A - T C P as the S C T P stack provides reliable message transfer, stronger checksum and message framing support natively. W i t h S C T P , it is possible to schedule messages across multiple streams and out-of-order if necessary. Recal l that S C T P uses S A C K to inform data gaps during transmission and this is part icular ly useful in the case of mult irai l ing in which out-of-order arr ival of messages is inevitable. In this work, we are using S C T P as the lower layer protocol for D D P to exploit some of the possible benefits of this newly standardized transport layer protocol. Since i W A R P is based on a layered design, there are various options to provide i W A R P support. For instance, one can implement al l or part of the software i W A R P stack that runs in user space or in the host operating sys-tem. A l though this approach may not provide any performance improvement, it serves as a good rollout strategy as there are only a few i W A R P capable de-vices available on the market. In addit ion, the software approach allows i W A R P to be hardware independent as it wi l l run on any Ethernet capable device and it can be easily patched should the specification change. One other interesting approach is to run the software i W A R P on top of an Ethernet Accelerator; this wi l l enable low latency and high bandwidth support. Ult imately, from the per-formance perspective, i W A R P was designed to run on R D M A capable devices Chapter 2. Background 15 (RNIC)2 with protocol offloading in order to provide true zero-copy and OS-bypass. Nonetheless, all these implementation options allow more flexibility in future deployment and support. 2NetEffect lOGbps iWARP Ethernet Channel Adapter: http://uvrw.neteffect.coin/ products.html 16 Chapter 3 Design and Implementation 3.1 Overview In this chapter, we introduce the design and implementation of various compo-nents in our prototype and describe how they interact with each other. 3.2 M P I C H 2 - S C T P We have implemented SCTP as a Channel-3 (CH3) device in MPICH2, which is a direct extension from our previous work with L A M / M P I [29]. MPICH2 was developed at Argonne National Laboratory and it is one of the most popular M P I implementations. MPICH2 is aimed at supporting both MPI-1 and MPI-2 features and implementing a new modular design to provide more performance, portability and flexibility. Conceptually, a CH3 layer implements only a subset of functions from a higher level abstract module called an Abstract Device Interface 3 (ADI3, third generation design) (Figure 3.1). Consequently, a channel implements the CH3 interface and there are existing channels for most communication architectures including T C P , and shared memory (SHMEM). Since the CH3 interface only exposes a fairly small set of primitives, it is easier to implement a CH3 chan-nel than a complete ADI3 device. Internally, a channel device should provide a progress engine to monitor many states between M P I endpoints and asyn-Chapter 3. Design and Implementation 1 7 c h r o n o u s s u p p o r t f o r m e s s a g e t r a n s p o r t . MPICH2 9 ADI3 F i g u r e 3 . 1 : M P I C H 2 I m p l e m e n t a t i o n S t r u c t u r e O u r C H 3 i m p l e m e n t a t i o n i s b a s e d o n t h e p o i n t - t o - p o i n t p a r a d i g m a n d u s e s t h e S C T P o n e - t o - m a n y ( S C T P - U D P ) s t y l e s o c k e t . I n t h e f o l l o w i n g s u b s e c t i o n s , w e w i l l d i s c u s s v a r i o u s i m p l e m e n t a t i o n h i g h l i g h t s a n d d e s i g n t r a d e o f f s d u r i n g o u r d e v e l o p m e n t c y c l e . T h e C H 3 d e v i c e f o r S C T P a n d t h e H y b r i d d e v i c e a r e d e n o t e d a s c h 3 : s c t p a n d c h 3 : h y b r i d r e s p e c t i v e l y . 3.2.1 Dynamic Connection and Initialization I n t h i s s e c t i o n , w e d i s c u s s t h e a s p e c t s r e l a t e d t o s e t t i n g t h e c o n n e c t i o n s f o r M P I t o u s e S C T P . O n e g o a l o f o u r o n e - t o - m a n y i m p l e m e n t a t i o n i s t o m i n i m i z e t h e n u m b e r o f f i l e d e s c r i p t o r s t h a t n e e d t o b e o p e n e d a n d m a n a g e d f o r e a c h M P I p r o c e s s a n d t h i s c a n b e a c c o m p l i s h e d b y u s i n g a U D P ( o n e - t o - m a n y ) s t y l e s o c k e t . I n t h e o n e - t o - m a n y d e s i g n l i k e U D P , i t i s p o s s i b l e t o o p e n u p o n e s o c k e t f o r e a c h M P I p r o c e s s . I n a d d i t i o n , i t i s p o s s i b l e t o a l l o w f o r " l a z y c o n n e c t " w h e r e c o n n e c t i o n s a r e n o t s e t u p u n t i l t h e y a r e n e e d e d . T h u s , w i t h d y n a m i c c o n n e c t i o n , w e c a n d r a m a t i c a l l y s p e e d u p i n i t i a l i z a t i o n t i m e a n d a l l o w e a c h M P I Chapter 3. Design and Implementation 18 process to start computing as soon as possible. During initialization, each MPI process opens a single S C T P - U D P style socket for both connection and message transfer purposes. This socket informa-tion is saved as a MPICH2 specific business card object1 and committed to the MPICH2 Daemon. As a result, MPI processes that wish to set up a connection can query the MPICH2 daemon to lookup the specific information, such as the port of the designated process' S C T P - U D P style socket. We use the U D P model to piggyback application level messages with the connection setup process. r w r i Daemon ^-MPI Process (Rank 0)X ^ V C (rank 0) SCTP Association N n p r i Daemon with unidirectional T M p | P r o c e s s ( R a n k streams ' S C T P - U D P socket Figure 3.2: Object/Component view of ch3:sctp Within MPICH2, each endpoint is encapsulated as a Virtual Connection object (VC) and these objects are pre-allocated but are not connected during initialization. Each V C stores the target endpoint's state and designated M P I Send or Receive message queues on a per SCTP stream basis (Figure 3.2). Each endpoint can have multiple SCTP streams for Send/Receive and each SCTP stream represents a separate connection. The number of streams is adjustable during compilation and we reserve stream 0 to be used for control information transmission when necessary. When one of the streams is set to connected, 1 A business card stores the identity of each M P I process for resource discovery. Chapter 3. Design and Implementation 19 the V C state is also set to connected. Our current implementation adjusts the endpoint's specific stream state upon a receive. Within MPICH2, there is a special type of message called the connection packet. The connection packet contains information of the source endpoint's hostname, process group ID and rank. In order to initiate a connection, one is required to send a connection packet to the specific stream before transmitting any other M P I messages. Upon receiving the connection packet, the endpoint can use the included data to locate all the proper target endpoint's stream states and update them correctly. The disadvantage of our current implementation is that the connection packet is not necessarily small (it ranges from 40-70 bytes). Moreover, we are sending them on a per stream basis due to the fact that SCTP provides full ordering within a stream but not across streams and we need to guarantee that the con-nection packet is the first message the target endpoint receives for a particular stream. There are certainly optimizations to avoid sending a connection packet for every stream. For example, one could set a threshold value that limits the number of initialized streams. This would allow the sender to stop sending connection packets for uninitialized streams once it hit the threshold limit. We believe that with dynamic connections, hardware and software resources can be managed more efficiently. With the use of S C T P - U D P style sockets, the connection process is also simplified because the middleware can manage less states without explicitly calling connect and accept. 3 .2 .2 M e s s a g e d e - m u l t i p l e x i n g a n d S C T P e v e n t s Message de-multiplexing needed to be redesigned, since each M P I process only has one global S C T P - U D P style socket for all general purpose communication. Each endpoint is now mapped to an SCTP association that is managed by the Chapter 3. Design and Implementation 20 SCTP MPI SCTP-UDP Style Socket MPI Semantics Association Rank (X) (Oxabcdef) Streams (0...N) Message Context ••-Tag Figure 3.3: MPI and SCTP semantic mapping transport layer. Since an endpoint is represented as a Virtual Connection ob-ject in MPICH2, a hash table was used to map SCTP associations to their corresponding VC. In order to utilize multiple SCTP streams within an asso-ciation, we assigned streams with respect to the context and tag of an MPI message (Figure 3.3). This is a straightforward mapping that obeys the MPI specification since message ordering is only guaranteed for messages of the same tag and context value. At the same time, we export a function hook to allow a user defined stream assignment procedure. Message de-multiplexing needs to be mapped with a pair of {association-ID, SCTP stream number} and the message is also enqueued into a specific queue for a stream belonging to a particular VC. In order to retrieve the SCTP stack level information for message de-multiplexing, it is necessary to register for interested events with the SCTP stack via setting socket options. The only metadata that we are currently interested in is the association and stream value of each incoming message. The SCTP stack pop-ulates this metadata structure upon performing a receive. Our current implementation statically allocates a message queue per stream resulting in higher memory consumption than needed since not all of the streams are used during runtime. Currently, the message queue itself is implemented as a linked list as only a pointer to the head element is required. If the scalability Chapter 3. Design and Implementation 21 of stream management is a concern, a more dynamic stream queue management can be adopted. 3.2.3 Progress Engine Each CH3 device requires a progress engine to transition the middleware's ab-straction states. The states that are being managed in our implementation con-sists of VC and MPI request states. VC states represent the connection state of each endpoint in MPI.COMM.WORLD. In the other case, an MPI request state represents the completeness of application level messages and is divided into two general categories: read and write requests. MPI request management includes message matching done at an upper layer while write requests are passed down to the CH3 device for transmission. In order to map from socket level events to the application level, we introduce the notion of an application level event queue. Each successful socket event, either read or write, will be converted to a corresponding application level event that will result in a state transition for a particular object, VC or MPI Request. The engine is also responsible for reading the incoming message from the socket buffer and transmitting pending send requests that are being queued from the CH3's base primitives. These routines are represented by Read and Write logic blocks respectively as shown in Figure 3.4. Read logic performs sctp.recvmsg to a global buffer along with the corresponding message metadata such as an association ID and stream number that are necessary for message de-multiplexing. The write logic retrieves queued MPI requests from a global send queue and tries to send them on the specified stream for a particular endpoint. Each global send queue element is only removed from the queue when it is completely finished. Notice that the read logic is placed before the write logic to ensure the SCTP-UDP style socket will not cause a deadlock within the Chapter 3. Design and Implementation 22 Figure 3.4: Progress Engine design SCTP stack. This pipelined design allows us to add more stages to the progress engine, if necessary, and this is where the iWARP integration for the progress engine occurs. These changes are described in Section 3.4.3. 3 .2 .4 M u l t i h o m i n g In this section, we discuss how the multihoming feature of SCTP is supported within the MPI middleware to provide additional fault tolerance and perfor-mance improvement. Our implementation allows a user to provide a config-uration file to specify which network interfaces the SCTP-UDP style listener and iWARP should bind. This selective binding capability allows the user to separate MPI traffic from other control traffic. Chapter 3. Design and Implementation 23 Ideally, we would like to have all the network traffic over SCTP. However, the network connection code for the MPI daemon was not modified in our implemen-tation because there is not a portable Python binding to native SCTP sockets and most of the MPICH2 utility functions are written in Python. Nonethe-less, with the use of a configuration file, we can separate the network traffic if necessary. The configuration file path is defined as an environment variable on each node and the same file can be shared between all MPI processes. During initial-ization, the file is parsed and the SCTP-UDP style socket is bound specifically to the IPs that are provided within the configuration file via the use of SCTP's extended sctp-bindx call. Currently, each line of input is a combination of host-name and IP(s) delimited with spaces. This is sufficient for CMT to operate as it does not require any static network characteristics, like latency and band-width, to be provided in advance. In the case of a missing configuration file, we bind to the same IP address as the daemon. Future work for CMT is to allow the application to provide feedback to the SCTP stack to schedule messages with respect to its requirements. For instance, we can inform the SCTP stack to send certain application messages on the lowest latency link. CMT currently only schedules messages to maximize bandwidth [13]. 3.2.5 Process Spawning MPI-2 introduces new API calls to perform process spawning (MPLCornrnspawn) This is particularly useful if your application requires more processing power as it allows for the dynamic allocation of more MPI processes to transparently join MPI peers. In the case of client-server systems and task-farming jobs, more workers can be created or destroyed on demand and this can potentially utilize Chapter 3. Design and Implementation 24 the systems more efficiently and increase the scalability of user applications. During the process spawning operations, an intracommunicator is created be-tween the new MPI process and its parent process. After the new MPI process is spawned, it is free to join any other MPI communicators and this is called the "connect" phase. The procedure is greatly simplified with our dynamic connec-tion support using an SCTP-UDP style socket because our dynamic connection is uni-directional and it does not require the target endpoint to acknowledge the message to enable it to form a virtual connection. As a result, MPI level connections are formed when a send request is triggered and the SCTP stack guarantees that an association is created between the endpoints. 3.2 .6 L i m i t a t i o n s During initial development we decided to focus on correctness over performance. In this section, we discuss some of the limitations of our current implementation that uses an SCTP-UDP style socket [41] and propose some possible solutions. • Polling versus Simple Receive When each MPI process only has one SCTP-UDP style socket, it makes sense to read from the socket directly during each progress wait iteration rather than performing a poll/select to check for incoming events. How-ever, not all reads will return with data and this consumes CPU cycles. It may be possible to be more intelligent by blocking and unblocking the socket dynamically. • Memory copying In our current implementation, we are reading everything into a global buffer and later copying it to its proper place. MPI messages have an en-velope2 attached (except when it is part of a long message). The envelope 2 Recall that an envelope is an MPI level metadata that describes the attached MPI message. Chapter 3. Design and Implementation 25 is used to find the corresponding read request if there is any. Instead of sending the envelope and message data separately, a short message with envelope is transmitted in a single send call to minimize the network la-tency and the receiver can do a full read at once. Clearly, there is a tradeoff between the number.of system calls made versus memory copies. This may be insignificant for small messages, but as message size increases, memory bandwidth may become a potential bottleneck. This is a consequence of the SCTP-UDP style socket in which you do not know where the next message is corning from. There are ways to minimize the memory copying overhead, namely to use more optimized memory copy routines, limit the amount that is being copied or to extend the SCTP API in a way that polling on associations is possible. A more sophisticated design that requires both changes in the SCTP stack and MPI middleware is to allow the SCTP to perform a receive and peek of the next message in a single receive call or even a batch read3. Recall that an MPI message needs to be matched with the attached envelope and with that, the future peek allows the middleware to set up the receive buffer in advance and thus eliminate the memory copy. However, not all future peeks will succeed because the next message may not be ready for reading since it may not have arrived to the kernel yet. On the other hand, a batch read allows the middleware to minimize the number of receive calls, but will increase the frequency of memory copying. Both design changes to the SCTP stack may not be general enough for other applications other than the use of an SCTP-UDP style socket for MPI. • Message Fragmentation Since SCTP is message oriented, it is not possible to send a message that is 3 This suggested change was made to the SCTP standard and is implemented in the FreeBSD's SCTP stack. Chapter 3. Design and Implementation 26 larger than the send socket buffer. Therefore, a large MPI message needs to be fragmented manually and a threshold value smaller than the send buffer size needs to be selected. In the final design, we choose 64Kbytes, which is also the long and short message boundary for MPI. This ad-justable value was chosen to minimize the number of message fragments and the time required to perforin a memory copy at the receiver. MPI also supports the use of message I/O vectors for user-defined datatypes; this complicates our design when the sum of all I/O vector elements is larger than the threshold. In the case of a large message comprised of multiple I/O vector elements, we take a simple approach and send them one by one with respect to the threshold. This is certainly not optimal because of the added system calls to transmit each of the I/O vector elements. An alternative is to modify the I/O vector to have a sum that is equal to the threshold. However, both approaches highlight the shortcoming of SCTP's support for MPI. In a situation dealing with MPI large messages, one would want the transport layer stack to be byte-oriented. An exten-sion to SCTP called, Explicitly End-Of-Record, would allow the user to explicitly mark the end of a message to inform the stack. This would certainly be useful in the case of large message transferring, but was not attempted in this design. • M P I - A N Y . T A G MPI often requires only partial ordering with respect to the message's TRC (tag, rank, and context), with one exception, which is the use of MPI_ANY_TAG. MPLANY.TAG allows an MPI endpoint to receive a message with a wildcard match on the tag value. Though wildcard use with a receive is useful when writing an MPI application, it complicates our channel design with SCTP. Recall that our implementation is utilizing Chapter 3. Design and Implementation 27 SCTP streams by providing mapping to MPI message tag and context. As a result, messages sent, on different streams can overtake each other; this is the intended behavior to avoid potential head-of-line blocking. On the other hand, MPLANY.TAG semantics require the endpoint to receive a message in the order of the corresponding sends and this can be difficult to achieve because SCTP only provides partial ordering across streams (Fig-ure 3.5). One possible solution is to include an application level sequence number or timestamp within the message envelope to perform matching but this would require adding a separate code path to handle this special case. Another alternative would be to assign one of the SCTP streams as the control channel and send all message envelopes on that stream to serialize the order at the receiver. This allows the envelopes to arrive at the endpoint to have a total ordering while the data still flows through different streams. The reassembly of the message may be complicated when the data arrives earlier than the envelope itself. MPI Sender MPI Receiver Uni-dirocUona! SCTP streams /* MPI_ANY_TAG requires total .Cont. lsend(TAG=1); lsend(TAG=3); lsend(TAG=5); ....Cont. .Cont. MPICH 2 SCTP [ T A G rrxsi 5 3 TAG1 • Recv(MPI_ANY TAG); TAG2 = Recv(MPI_ANY TAG); TAG3 = Recv(MPi_ANY_TAG); ordering of sends */ ....Cont. Correct Output: (TAG1 ,TAG2,TAG3)»(1,3,5); Possible Incorrect Output with our use of SCTP streams: (TAG1JAG2.TAG3) » (3,5,1) (TAG1JAG2.TAG3) = (5,1,3) Figure 3.5: MPLANY.TAG issue Multithread support The MPI specification allows multiple user application threads to be cre-ated. However, it depends on the middleware whether the use of threads Chapter 3. Design and Implementation 28 is supported; this can be done by querying the middleware during MPI initialization4. By supporting multiple user threads, this essentially allows concurrent access to the progress engine. Currently, our implementation does not support of use of threads as most of the internal data structures are not protected, including the global event queue and send queue. In order to provide full support for multithreading (assuming the MPI mid-dleware needs to guard every MPI related data structure), various locks need to be inserted in all queue structures and in any potentially shared objects such as an MPLRequest. This undoubtedly would add consider-able overhead to the runtime as acquiring and freeing the locks are nec-essary. As a result, we decided not to tackle this problem and focused on implementing as many MPI features as we can for the channel device. 3.3 i W A R P Implementation iWARP is an RDMA protocol for wide area networks. Our implementation is based on Ohio Supercornputing Center's (OSC) software iWARP stack5 and is extended with more capabilities [16]. The OSC's iWARP stack is implemented, with respect to the RFC specifications, and is shown to be compatible with actual iWARP capable hardware. In this section, we discuss some of the design changes that were made to add more flexibility and functionality to the software stack. The OSC's software iWARP stack is denoted as OSC-iWARP. 4MPICH2 CH3 Thread Support: http: //www-unix .mcs . anl. gov/'/,7Ebalaji/support. html 5OSC iWARP project site: http://www.osc.edu/research/network_file/projects/ iwarp/iwarp_main.shtml Chapter 3. Design and Implementation 29 3.3.1 D i r e c t D a t a P l a c e m e n t ( D D P ) over S C T P In this section, we discuss the implementation details of DDP over SCTP. The main benefit of using SCTP as the underlying reliable transport layer protocol is that the protocol itself also supports many of the criteria that DDP requires. Recall that DDP is an end-to-end protocol and each DDP segment has the property of being self-describing and self-contained. SCTP is a perfect candi-date because SCTP is message oriented. Unlike for the case of TCP where MPA was introduced as an extension to solve TCP/IP :s shortcoming in message fram-ing and insufficient checksum. Furthermore, the MPA layer within the software stack is fairly heavyweight; it contains the majority of the code for the iWARP implementation. Developers have been trying to implement MPA as a separate layer on top of TCP/IP, but its performance is suboptimal because of addi-tional copying from layer to layer. In order to solve these resulting performance problems, another strategy is to insert the MPA changes directly within the TCP/IP stack. While this improves the overall performance of MPA-TCP/IP, it also sacrifices the backward compatibility of the TCP stack. The IETF draft proposed by Caitlin and Randall [12] introduces an adapta-tion layer to bridge between DDP and SCTP. Part of its design includes defining DDP related chunks within SCTP for connection setup and DDP data transfer. Moreover, it requires the use of SCTP's unordered message delivery. This is certainly a more complete design than our implementation. Nonetheless, our implementation is using the existing data chunk and utilizes the SCTP features as well. We also believe that DDP itself is a fairly thin layer when the underly-ing transport layer can handle the majority of the complexity. As a result, the DDP layer within the OSC-iWARP is implemented directly on top of SCTP. Recall that negotiation communication is required for both iWARP capable endpoints to exchange key pairs (for security reasons), network hardware infor-Chapter 3. Design and Implementation 30 mation, MPA marker locations (in the case of TCP), and other information. Our implementation currently disables the negotiation stage during connection setup for the simplicity of design. 3.3 .2 D D P h e a d e r Since SCTP provides message framing by default, it gives the application layer protocol more flexibility in choosing the segment size. This also allows us to remove two bytes of MPA related fields for the DDP header for both the tagged and untagged models. These two bytes represents the DDP segment length and it is necessary for the MPA to include this information within each DDP header to allow the receiver, to function correctly. However, this is not required for SCTP since message boundaries will be preserved. Even though the bytes saved is not overwhelming, SCTP can handle messages better than TCP by managing all the complexity of message framing. 3 .3 .3 D D P t o S C T P s e m a n t i c m a p p i n g Recall that DDP provides two models of message transfer, untagged and tagged. An untagged message is analogous to the traditional send/receive and each DDP untagged segment is defined by a set of queue numbers, DDP level message number and buffer offset. The tagged model allows a DDP segment to be copied to the designated user buffer directly. Therefore, each DDP tagged segment contains a steering tag and offset to match the receiver's buffer. In order to utilize multiple SCTP streams, we use a simple, yet intuitive mapping. We use the steering tag to define which stream to send for the tagged model and message number for the untagged model. The software iWARP's DDP layer assumes that DDP segments will arrive in order and the last arrived segment is the end of the message. However, this is not required as one can Chapter 3. Design and Implementation 31 change the DDP layer to support unordered delivery of each DDP segment and this can be supported with SCTP as well by using multiple streams and the MSG-UNORDERED message flag. Currently, our DDP over SCTP uses the SCTP-TCP style socket for the underlying connections. This decision was made to preserve the one to one relationship between queue pairs and to minimize the changes to the software iWARP stack. However, we believe that the use of SCTP-UDP style is also feasible, but it would require a more extensive redesign of the send and receive logic. In addition, a more complete implementation would be needed to replace the use of file descriptors to association IDs to avoid an internal lookup within the SCTP stack. 3.3.4 Asynchronous support The OSC-iWARP is mostly synchronous by default, as a result all Verbs calls are blocked except for read requests. This blocking behavior is not appropriate for MPI semantics since it potentially stalls the middleware and prevents it from progressing multiple outstanding requests simultaneously. Due to this limita-tion, we extended the stack to provide the intended asynchronous behavior. The OSC-iWARP exports two main functions: one for posting send work requests and one for posting receive work requests. There are two alternatives to provide asynchronous support • Single threaded A single threaded design requires each socket related system call to be nonblocking. As a result, when posting a work request to the RDMA layer, it may not be completed immediately and requires probing later to check for completeness. Work request states need to be saved and updated accordingly. This design is analogous to the MPI middleware's progress Chapter 3. Design and Implementation 32 engine in which it needs to process upcoming requests. The disadvantage of this approach is that everything is queued upon the progress function being called, which potentially increases the delay to process each work request. • A progress thread We decided to introduce a progress thread to simulate the iWARP capable network device. Each work request will be enqueued to specific queues and the progress thread will process them in the same order. Shared data structures such as the completion queue is required to have a mutex when inserting or removing a completion entry. Moreover, work requests can now proceed as soon as possible without waiting. In our current implementation, the progress thread is created at the RDMAP layer and it does not handle read requests. iWARP requires each successful read to have a corresponding buffer. However, if our progress thread is handling read requests, it is possible that the socket buffer has data, but the progress thread does not know where to place the data, because the read request has not been posted yet. This is equivalent to unexpected messages in MPI and requires a more complicated buffering mechanism inside the iWARP stack and it also potentially performs redundant work when integrated with the MPI middleware. As a result, we decided to not have the progress thread progress read requests. Instead, read requests will be progressed when iWARP level polling is called. One of the disadvantages with a single progress thread design is that queue pairs not related to each other may be blocked. However, this can be solved by introducing a thread per queue pair or threads at a lower level such as DDP. Our use of progress threads is merely to provide the asynchronous support for the exposed API and to simulate an iWARP Chapter 3. Design and Implementation 33 hardware device. As a result, performance can be suboptimal at times. Moreover, by introducing threads to the stack, data structures also need to be locked and unlocked appropriately in order to avoid race conditions. 3.3.5 User level A P I , Verbs The OSC-iWARP exports a set of functions and semantics that utilize the use of the RDMAP layer called Verbs [4, 26] (Figure 3.6). The API is very similar to the InfiniBand API and the Connection Management feature is not included (Table 3.1). A new version of OSC-iWARP supports OpenFabrics API [2] which is an industrial standardized API for RDMA based devices. Figure 3.6: iWARP semantics and modules The semantics of the verbs used by OSC-iWARP are endpoints are connected via Queue Pairs(QP) and send/receive or write/read requests are defined as work requests via the use of iwarp-qpjpostsqj' iwarp.qp-post-rq. Each QP is at-Chapter 3. Design and Implementation 34 Verbs Description iwarpjnic_open/close iwarp_rnic_advance Open/Close a RNIC handle Advance call to progress read requests with-out creating completion entry iwarp.qp .create / destroy iwarp_qp-post_sq iwarp.qp-post_rq iwarp_qp.active-connect iwarp.qp.passive_connect iwarp_qp_register.fd Allocate/Destroy a Queue Pair Post a send work request on behalf of the QP (SEND/RDMA.WRITE/TERM) Post a read work request on behalf of the QP (RECV/RDMA.READ) Actively connect to a QP (client mode) Passively connect to a QP (server mode) Directly register a file descriptor (fd) with the iWARP layer (either server or client mode) Disconnect a QP iwarp.qp.disconnect iwarp.cq.create/destroy iwarp_cq_poll/.block Create/Destroy a completion queue Polling a completion queue iwarp_pd_allocate/ deallocate iwarp_nsmr_register iwarp_deregister_mem iwarp_deallocate_stag Allocate/Deallocate a Protection Domain (PD) ID Memory registration, returns steering tag and a memory region Unregister a memory region Deallocate the steering tag, does not free nor de-register memory iwarp_create_sgl iwarp_register_sge Create/initialize the scatter gather list (SGL) Register/Insert the scatter gather entry (SGE) into an SGL Table 3.1: OSC-iWARP Verbs (Function calls extended from OSC-iWARP are un-derlined) Chapter 3. Design and Implementation 35 tached to two Completion Queues(CQ) for completion notifications: send and receive respectively. There are primarily two types of work requests that iWARP supports: Channel based (Send/Receive) and Memory based (Write/Read). Send/Receive is the traditional Copy-in/out model while the Write/Read al-lows the direct transfer of the message to user specified buffers. There is also a notion of Protection Domain (PD) and this acts as a context for the regis-tered buffers. The iWARP specification does not define the memory registration (pin/unpin) mechanism. The traditional Send/Receive request will map to the underlying untagged model in DDP, and tagged model for RDMA.Write/Recv. Each work request is uniquely identified by a request ID and depending on which type of work request, it is necessary to specify sufficient information including buffer information in the format of a Scatter Gather Entry (SGE), and steering tags (for the RDMA. Write /Read). For each work request, one can also option-ally request a completion signal. Polling is performed on the completion queue (CQ) via iwarp-cqjpoll and for scalability, a given CQ can be shared among many QPs. 3.3.6 Limitations • Polling from completion The poll function provided by the O S C - i W A R P verb does not allow batch polling. As a result, each execution of iwarp.cq.poll will result in only one iWARP level completion entry being returned. However, this can easily be extended by changing the API or by exposing the CQ structure to the application. • i W A R P level Connection Management O S C - i W A R P verbs provide only a synchronous version of connect and accept and this is not sufficient for the use of MPI or other applications that Chapter 3. Design and Implementation 36 may require a nonblocking connection setup. As a result, we extended from the basis of passive (or active) connect to allow the iWARP layer to directly register a file descriptor regardless of the transport layer protocols (iwarp-qp-.registerjd)\ this emphasizes the flexibility of a software iWARP stack. By directly registering a file descriptor with the iWARP layer, users can potentially create their own connection framework. One future extension for this function is to allow users to specify the mode of operation (TCP or SCTP) and the DDP layer can adapt dynamically to permit the coexistence of both transport layer protocols. 3.4 M P I C H 2 - H y b r i d In this section, we discuss our experience in integrating the software iWARP module within our c h 3 : sc tp device and some of the design choices. The final hybrid CH3 device allows a mixed use of network devices for message transfer based on the higher level functional requirements (see Figure 3.7). Moreover, the introduction of iWARP should not interfere with the operations from the original c h 3 : s c t p channel. Recently, MPICH2 with iWARP integration has been described in [46]. Our implementation differs in that there is no connection management framework, a different Verbs API is used, and we are using SCTP. As a result, challenges are different. 3.4.1 M e m o r y M a n a g e m e n t The correct use of a RDMA capable device requires proper management of the memory. In order to receive any message, memory needs to pre-posted and de-allocated dynamically. In our current implementation, we allocated a 32 Kbytes buffer for each VC during initialization and this buffer is registered with the iWARP layer. The primary use of this buffer is for untagged message transfer Chapter 3. Design and Implementation 37 MPI-1 API MPI-2 one-sided API MPICH2 CH3: HYBRID CH3:SCTP Socket Shared Data Structure iWARP SCTP Figure 3.7: Component view of ch3: hybrid of general purpose messages such as MPI envelopes or acknowledgments. The memory information is also exchanged during connection setup. As a result, this buffer can also be used for the tagged model when necessary. 3.4 .2 C o n n e c t i o n M a n a g e m e n t Each MPI process holds an RNIC handle to the simulated hardware exposed by the software iWARP stack. All the Verbs components (Figure 3.6) are allocated with reference to the RNIC handle. In order for iWARP capable endpoints to communicate with each other, at least one QP for a pair of endpoints must be al-located. The OSC-iWARP does not provide nonblocking connection management routines. As a result, we have decided to implement the connection framework within MPICH2. Furthermore, our connection framework is using the extended iWARP connection API to directly register a file descriptor regardless of the transport layer protocols; this emphasizes the flexibility of a software iWARP stack. This allows us to exploit possible optimization for MPI based connection management. The other alternative would be to use the RDMA-CM interface Chapter 3. Design and Implementation 38 defined by the Open Fabrics Alliance [2], CMA (Communication Manager Ab-straction), is a higher-level CM that operates based on IP addresses. This is a more general solution because it is transport independent and it provides an interface that is closer to the use of sockets, but it is also based on the use of specific verbs that is different than our implementation. In order to prevent the iWARP features from interfering with our origi-nal ch3:sctp device, we have allocated another SCTP-UDP style socket for handling RDMA connections. We propose the following two ways to manage iWARP QPs: • Static The idea of static connection management is to allocate a QP for each pair of endpoints during initialization. This is analogous to allocating N(N-l) hie descriptors (QPs). Moreover, this is the approach of most MPI-RDMA implementations and is acceptable for small groups of processes. However, as the group of processes becomes larger, the initialization time can be significant. Recall that each MPI process is assigned a global rank during initialization and in order to prevent connection collision, we only allow each MPI pro-cess to connect to other peers that have higher rank than itself (as shown in Figure 3.8). Hence, each MPI process will be both connecting and ac-cepting during the initialization. Barrier calls are inserted to distinguish between the different connection phases. As we have mentioned, SCTP provides both UDP and TCP style socket semantics for backward compatibility to legacy applications. It is also pos-sible to use one style of socket to connect to the other and vice versa. This allows the user to piggyback application level data with an initialization request. Our connect-all mechanism requires specific use of SCTP's API Chapter 3. Design and Implementation 39 and this is not available in TCP. Therefore, the implementation itself is not generalized, but the concept is still applicable. SCTP has a specific call, sctp-peeloff, that allows an association to be peeled off from an SCTP-UDP style socket to form a separate SCTP-TCP style connection. This operation does not require any coordination between endpoints. Hence, we can directly register this newly peeled off association with the iWARP layer and with our new QP connection extension call. The connection phase requires the initiator to send a message to the target's SCTP-UDP style listener and the message consists of its own rank and the information about the locally allocated 32 Kbytes of memory. The target endpoint re-ceives the message, peels off from the association, sends back its own local memory information back to the sender, and registers with the iWARP layer. The initiator then receives the accept-ACK, peel off and registers the connection with the iWARP layer. In the end, an one-to-one connec-tion is formed between any pair of endpoints. It is worth noting that this static approach is different from the connec-tion management in our ch.3:sctp that supports dynamic connection by default. As a result, we are proposing a dynamic approach for QP man-agement as well. • Dynamic Dynamic connection management allows the middleware to allocate QP on demand. This approach, similar to ch3:sctp, can reduce the number of file descriptors (QPs) being allocated. However, it will increase the communication cost for the first MPI message transfer. The main challenge to deal with dynamic connection management for QP is to handle collisions. In the case of two processes trying to connect to each other at the same time, we use the MPI globally unique rank in Chapter 3. Design and Implementation 40 Rank 2 Figure 3.8: Static connect all routine MPI_COMM_WORLD to break ties (Figure 3.9). Notice that collisions do not occur in ch3: s c t p due to the fact that it is based on the SCTP-UDP model and connection is done in a per uni-directional stream basis. Our implementation requires a pair of endpoints to perform an applica-tion level 2-way handshake to exchange connection packets in order to ensure that the one-to-one connection is properly formed and ready for iWARP operations. Since we are directing iWARP connection traffic to another SCTP-UDP style socket, we need to periodically poll the socket for connection requests; this code block is added to the progress engine. Dynamic connection management can help to utilize the hardware resources more efficiently. However, it also adds more complexity to the internal design in which the pending outgoing messages now need to be queued because the QP is not connected. Moreover, it does not need to be implemented within MPI and can be part of an external connection management module. On the other hand, the static connect-all approach is fairly easy to implement and the MPI messages can simply be posted to the iWARP layer. Overall, the static approach can manage less states by sacrificing some resources during initialization. Chapter 3. Design and Implementation 41 Rank 1 Connec t (send connect ion packet) C o n n e c t Request A c c e p t e d (Target rank > local rank) co^nectibnjpacket) C o n n e c t Request D iscarded , (Target rank > 0) C o n n e c t (send Peoloff. register wJ&i rWARP ;C6nnecti6'n.:A'CK' Peeloffj register rwKhi iWARP kt-'Appi-ilevel C o n n e c t i o n formed^ * Time! t Figure 3.9: Dynamic connection setup for QPs 3.4.3 P r o g r e s s E n g i n e E x t e n s i o n Recall that our ch3 : sc tp device uses pipelining and an application level event mechanism for message progression. An extension to support the iWARP layer is implemented by adding another stage to the progress engine. The pipelining stage for iWARP basically performs a. poll (iwwrp.cq-poll) and dequeue on an iWARP level completion entry. Currently in our implemen-tation, every QP is sharing a global completion queue. This is a simple approach to allow the progress engine to have only one CQ to poll for completion. Each completion entry contains the following information: QP entry, type of opera-tion, and amount of data sent or received. Therefore, it can easily be mapped to an application level event. Our application level event handling routine is also extended to de-multiplex iWARP level events based on the QP and the iWARP request type. Since iWARP denotes endpoints by QP, a hash table was added to map a QP to VC and the table is populated during connection setup. Chapter 3. Design and Implementation 42 Adding another stage to the progress engine imposes a performance penalty. Our implementation does try to minimize the impact as much as possible. One possible technique is to dynamically activate or deactivate the added stage by closely monitoring if there are any MPI-2 operations being executed. 3 .4 .4 M P I - 2 R M A s e m a n t i c m a p p i n g In this section, we discuss how the MPI-2 one-sided semantics map to iWARP and the corresponding implementation details. MPI-2 defines get, put and ac-cumulate operations for remote memory access (RMA). These operations in-volve data movement that can be initiated by the action of one process, hence the name remote memory operations (one-sided). Moreover, the MPI-2 RMA specification does not guarantee any performance improvement over send and receive. It merely serves as an extension to the MPI specification to support RDMA capable hardware. The MPI-2 calls that we will focus on are MPLWva-Create/free for win-dow creation/destruction, MPLPut/Get, and MPLWin.fence call to synchro-nize window access (Table 3.2). MPLPut allows an MPI process to directly put specific da.ta to the target's window memory without any intervention, whereas MPLGet allows an MPI process to retrieve data from local or remote memory without the target endpoint's attention. We did not implemented all of MPI-2's one-sided communication primitives. For instance, users can per-form more scalable window synchronization with MF'1'.Winstart/'complete and MPL Winjpost/wait. Our implementation is based on MPICH2's one-sided functions with some hooks added to intercept the data transmission. Even though it is not a com-plete rewrite of the RMA functions, it is sufficient to demonstrate how different communication and internal code paths can be used to explicitly separate MPI-Chapter 3. Design and Implementation 43 MPI - 2 R M A function prototypes int MPI_Put(void *origin_addr, int origin.count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target.datatype, MPI_Win win) int MPI_Get(void *origin_addr, int origin.count, MPI_Datatype origin_datatype, int target_rank, MPI.Aint target_disp, in t target_count, MPI_Datatype target.datatype, MPI.Win win) int MPI_Win_fence(int assert, MPI_Win win) Table 3.2: MPI-2 RMA function prototypes l's point to point operations from MPI-2's one-sided routines. Before RMA operations can take place, a window object that consists of all the distributed window information must be created with MPhWin.create. In order to utilize DDP's tagged model, window memory needs to be registered with the iWARP layer during the window creation process. Window creation is a collective operation in which all the participants exchange information about their local memory window, if there is any. The information typically consists of the buffer location, length, and offset. In our case, we include the steering tag for the local memory window for future use. At the end of the process, each participant has the knowledge of others' memory window and thus forms a set of remote distributed memory. A memory window also needs to be destroyed properly and collectively with MPL Win-free. During the process of destroying a window object, the pending RMA operations are completed before exiting. Moreover, each participant needs to unregister its local memory window with the iWARP layer. MPICH2 adopted a delayed approach to the window synchronization call, MPLWinJence. For instance, MPICH2 tries to queue up all the put/get oper-ations until a later fence call occurs. This method is shown to be more latency tolerant than using simple barrier calls; the delayed approach is more appro-priate for TCP and SCTP where latency is relatively high [22]. However, the performance of the window synchronization can be highly dependent on the Chapter 3. Design and Implementation 44 underlying hardware. For instance, there is shared-memory hardware that is capable in performing barrier calls more efficiently than reduce-scattering col-lective call that is being used in the delayed approach. The window fence call is a collective operation for all participants in the window creation. During the first stage, an MPI process tries to gather the total amount of window operations for all the participants via an MPI collective call. The total count of RMA operations is important because it is used for allocating MPI requests (read/write) and for fence termination. Next, it will process all of the queued put, get and accumulate operations; it is at this point where we intercept the calls to perform iWARP specific operations. The total count of RMA operations is upda.ted throughout the process. The final stage is for the progress engine to wait until all of the RMA operations are completed and the allocated resources are freed accordingly. MPICH2 performs MPI-Put by bundling an envelope with the corresponding data and this MPI message is then transmitted during the later fence call. Though this can easily be supported with our ch3:sctp with its point-to-point primitives, there is a mismatch between the application level semantics and the lower level primitives. As a result, we wish to provide a more direct mapping for MPLPut by utilizing the iWARP verbs. Recall that each MPI message has an envelope attached to the actual data and in the case of MPLPut message envelope, it contains information about the target window's buffer location, offset and length. In order to avoid the envelope from being copied to the target's memory window and to utilize both DDP's tagged and untagged model, we decided to decouple the envelope from the data. As a result, the data is being sent by using the tagged model in which it provides a direct semantic mapping via posting a RDMA.Write request to iWARP (Figure 3.10). Recall that the tagged model requires the use of a steering tag and this information can be Chapter 3. Design and Implementation 45 retrieved from the local window object. After a data send request is completed, the envelope is then sent to the target by using the untagged model and it serves as a notification to the target that its memory has been updated and is ready for access. The completion notification is necessary because RDMA.Write is completely transparent to the target; this is the intended behavior in the RDMA specification. Rank 0 MPI_Put msg | PUT OP | Data MPI IWARP level Rank 1 iWARP level MPI Timet DONE! Figure 3.10: MPLPut via RDMAJWrite during fence call In the case of MPLGet, it allows a local process to directly read data from a local/remote window and this can be analogous to a. RDM A.Read or RDMA-Write request depending on which side is responsible for carrying out the operation. There is a minor complication due to the requirement that the lo-cal buffer that originally held the upcoming data may be a different buffer than its local memory window. Therefore, before we issue a RDM A-Read/Write request, we need to register the local buffer memory whenever the buffers differ. Chapter 3. Design and Implementation 46 We now discuss how MPLGet can be implemented with either RDMA-Read or RDMA-Wiite: • RDMA_Write A straightforward way to perform MPLGet via RDMA-Write is to send the MPI envelope to the target with the steering tag as the application data by using the untagged model. The steering tag is needed to inform the target endpoint where to perform RDMAJWrite because the MPLGet buffer is not necessarily the same as the window. Upon receiving the ap-plication metadata at the target endpoint, the receiver can then perform a RDMA-Write back to the sender by using the information from the global memory window and the metadata message. After completing the RDMAJWrite request, it will send a MPLGet response envelope back to the sender to serve as a completion notice (Figure 3.11). Rank 0 Rank 1 MPI_Get msg MPI iWARP level iWARP level MPI G E T OP Send G E T OP Use Metadata to find the target buffer MPI_Get Resp. | G E T Done | Send iWARP u .•Notification J DONE! DONE! Time t Figure 3.11: MPLGet. via RDMA-Write during fence call Chapter 3. Design and Implementation 47 RDMAJRead RDM A-Read requests requires both the source and target steering tag to be specified. In the underlying design, the iWARP stack handles RDMA-Read by first issuing an untagged message of the request's meta-data to the target (Figure 3.12). The target endpoint's iWARP stack then issues a tagged message transfer of the requested data directly back to the source buffer. After the completion of the RDMA-Read request, we use the same approach as a MPLPut. We send the MPI envelope as an untagged message to the target to serve as a completion notice. RankO MPI IWARP level MPI_Get msg Rank 1 IWARP level MPI GET OP DONE! Timet RDMA Read Metadata (Src.Dest stag) j 9 £ f e t * ~ - ~ * L Use Metadata ^ • ' d the target buffer i Notiflcatton- DONE! Figure 3.12: MP I-Get via RDMA-Read during fence call The only difference between these two approaches is that for RDMA-Read, the first untagged message transfer will not notify the MPI middleware as the iWARP stack will handle it internally. In the case of RDMA-Write, The MPI Chapter 3. Design and Implementation 48 layer is responsible for handling metadata messages and performing the RDMA write. As a result, there is a shift in terms of which side is responsible for carry-ing out the actual message transfer. The actual number of messages transferred at the iWARP level for both approaches for MPLGet is the same. 3.4.5 L i m i t a t i o n s The main problem with MPI-2 remote operations is that it is not completely one-sided. It requires a global synchronization for each memory window. Moreover, one is not allowed to touch the remote memory before the synchronization call and this adds significant overhead. Recall that we integrated the iWARP poll routines into the progress engine and it is performing a nonblocking poll to prevent deadlock with the other progress engine stages. This is not optimal as it wastes CPU cycles and imposes locking overhead within the kernel. A better solution is to set the blocking mode of iWARP poll dynamically regarding the status of MPI-1 requests. However, this might be a nontrivial condition to detect. Our current implementation of MPLPut/Get does not support the use of complex I/O vectors for user defined datatypes due to lack of support in the iWARP software. We can have some preliminary support for I/O vectors by converting each I/O vector element to have proper offsetting to the target buffer, but there still exists issues with packing/unpacking complex datatypes which is out of the scope for this work. As a result, we are only supporting the use of pre-defined datatypes and this limits us from running complex MPI-2 applications. The experiments conducted in Section 4.2 only make use of primitive types to perform some base evaluations of the MPI-2 one-sided operations. We are only working with a subset of MPI-2 one-sided operations and there are more synchronization calls that have not been covered in this work. There Chapter 3. Design and Implementation 49 are also proposals [45] for the M P I - 2 R M A specification to be less restrictive in order to allow local completion. In addit ion, there are often many ways to implement M P I - 2 calls, but they are hardware dependent. Hence, performance is a difficult problem to tackle as it involves al l of these variables. Chapter 4 50 Evaluation In this chapter, we evaluate some of the components in our hybrid design and provide some preliminary results. 4.1 Experiment setup Most of the experiments were conducted on an I B M eServer x306 cluster with four nodes, each node with a 3.2 GHz Pentium-4 processor, 1 gigabytes of R A M , and three GbE interfaces. Of the three interfaces, two private interfaces were on separate V L A N s connected to a common Baystack 5510 48T switch, and the third (public) interface was connected to a separate switch, similar to the setup shown in Figure 4.1. Jumbo frames of size 9000 bytes are also used in the C M T experiments to exploit the potential GbE performance. We are using FreeBSD 6.2 with the SCTP stack installed as a patch 1 to the kernel and this is the most feature rich stack in which the C M T extension is included. In addition, this is also the most actively developed S C T P stack and it has been ported to many other operating system such as M A C / O S . Starting from FreeBSD 7, SCTP will be officially part of the OS distribution. Other implementations exist on Linux 2 and Windows platforms and a user level stack3 are also available. ' A patched version of the stack (http://www.sctp.org), compile as a kernel option 2 l k s c t p : http: / / lksc tp . sourceforge.net/ 3User Level s c t p : http://www.sctp.de/sctp-download.html Chapter 4. Evaluation 51 For the tests with SCTP and C M T we used the standard MPICH2 vl.0.5, which comes with the point-to-point channel (ch3:sctp) and extended to in-clude the hybrid channel (ch3:hybrid). We tested MPICH2 with and with-out C M T enabled and compared it to T C P from the same release of MPICH2 (ch3:sock). We also tested Open M P I vl.2.1, which supports message striping across multiple IP interfaces [53]. Open M P I was implemented with one and two T C P Byte-Transfer-Layer (BTL) modules to provide a comparison between using one interface versus two interfaces. For all tests we used a socket buffer size of 233 Kbytes, the default size used by ch3:sctp, and we set the MPI rendezvous threshold for M P I to 64 Kbytes, matching ch3:sctp and Open MPPs default. Figure 4.1: Cluster Configuration 4.2 M P I _ P u t / G e t performance In this section, we test the MPLPut/Get performance over i W A R P and compare it to ch3:sctp. The test program measures the time of MPLWin.fence to synchronize the one-sided operations over various message sizes ranging from Chapter 4. Evaluation 52 2 bytes to 4 Mbytes between two nodes. We measured the time to perform MPLWin.fence, which is where the majority of MPI-2 one-sided time is being spent and the result is shown in Figure 4.2. Message Size (Bytes) Figure 4.2: MPLPut/MPLGet latency test First observation is that the SCTP and i W A R P performance for both MPLGet and MPLPut share a similar trend. Moreover, our software implementation of i W A R P adds an overhead ranging from 2% to 8%. As well, notice that the soft-ware stack performs considerably worse when the message size increases into the Mbytes ranges. We believe that this is because the DDP layer is fragmenting messages at 16 Kbytes 4. Even though this number is adjustable, it also depends on the send and receive buffer sizes for i W A R P endpoints. In addition, we could 4 T h i s adjustable value was chosen to be 16 Kbytes because it gives the best result in our setup. Chapter 4. Evaluation 53 allow the D D P to fragment at 64 Kbytes, but the receiver may not be able to receive at the same rate. In fact, we did see a performance decrease when the D D P fragment was set to 32 Kbytes. Another reason for the difference in per-formance when switching between threads is that our ch3:sctp is performing nonblocking operations by default, consequently when requests are not finished, the progress engine is polling. Polling is necessary because it needs to progress through the i W A R P level's requests. However, it also adds overhead when go-ing through the other progress engine stages that are basically idle. We believe that the software i W A R P performance can be improved further by adopting a separate thread for reading incoming iWARP level messages. In addition, notice that MPLGet is typically slower than MPLPut because of additional metadata messages being exchanged. Our next test is a synthetic benchmark that we have written from scratch that combines use of both MPI-1 and MPI-2 functions. The attempt is to show, by separating these two independent M P I APIs internally, we can in-crease efficiency. The comparison is between our ch3:sctp and ch3: hybrid . The benchmark consists of a series of MPLIsend messages of various sizes and MPLPut and MPLGet. MPLWaitall/Waitany functions and MPLWin-fence are placed in specific places to allow overlapping between MPI-1 and MPI-2 message transferring. Moreover, this test program performs bi-directional flow of data in which both endpoints are sending and receiving from each other. The result is summarized in Figure 4.3. The result has shown that the hybrid channel under mixed usage of M P I syntax can perform better than our original SCTP channel. This is mainly be-cause MPI-2 requests are handled separately by a different component, namely iWARP, in the hybrid channel. On the contrary, the S C T P channel treats every type of M P I message the same regardless of the fact that MPI-1 and MPI-2 Chapter 4. Evaluation 54 4.51761^ ~~ 3.820722 HYBRID SCTP Middleware Configuration Figure 4.3: MPI-1/2 Synthetic test messages are independent. As a result, MPI-1 and MPI-2 messages are queued together thus having the potential of blocking each other. Moreover, by allow-ing the i W A R P module to handle the MPI-2 messages, we explicitly have two progress engines that are working concurrently and thus increasing the overlap within the middleware. 4.3 C M T Related Experiments In this section, we evaluate the performance when using S C T P with C M T to enable the multirailing support. As most computers have multiple network interfaces, it would be advantageous to be able to use all available resources to maximize the performance. We performed experiments to test both the performance and fault tolerance aspects of the design. The first test we conducted was with respect to the effect of SCTP's multi-homing feature to user applications. Recall that multihoming allows an appli-cation to fail over transparently when one of the links is down. The fail over process is performed by switching over the network traffic from the primary Chapter 4. Evaluation 55 path to the secondary path in the event of link failure. Since C M T is based on multihomirig, it inherits the fail over property as well as providing performance improvements in bandwidth. To simulate link failures, we created firewall rules5 within FreeBSD to block specific link traffic at the IP level. The SCTP stack is also tuned to recover within 5 seconds (This can be even shorter, but we chose 5 seconds to distinguish the failover and recovery phases in the result). We also tested a new extension called C M T - P F (Potential Failure for CMT) [35] in which it includes a new destination state called the "Potential Failed" (PF) state, and enhances C M T ' s retransmission polices to include the P F state. The test demonstrates how C M T - P F can recover more smoothly than C M T during failures which can be beneficial for cluster computing. The program that we will be using is a modified version of Iperf6, a band-width measurement tool, that supports SCTP and has been extended to support multiple interfaces. Iperf will report bandwidth values every 1 second and the test runs for 120 seconds in total. During the test, we inject link failures on the server side at specific time intervals while application is still running and the result is shown in Figure 4.4. One thing to note is the secondary link is not as fast as the primary link due to differences in hardware; this is also a good indication in the graph to show where the switching actually takes place. Another observation is that multihoming actually switches back to the primary path when it recovers and this is the intended behavior of SCTP with preference to the primary path. In the case of link failures, the bandwidth for multihoming will drop to zero for about 2-3s while C M T will transmit data in short bursts (about 62 Mbps) and C M T - P F can smoothly adapt to link failures by predicting the upcoming failures 5 F r e e B S D f i r e w a l l : h t t p : / / w w w . f r e e b s d . o r g / d o c / e n _ U S . I S 0 8 8 5 9 - 1 / b o o k s / h a n d b o o k / f i r e w a l l s - i p f w . h t m l 0 Iperf : h t t p : / / d a s t . n l a n r . n e t / P r o j e c t s / I p e r f / Chapter 4. Evaluation 56 2000 t=15, t=35, t=55, Primary down Primary up 2ndary down t=75, 2ndary up — Multihoming " f CMT-PF 40 60 80 Time in Seconds 100 120 Figure 4.4: Iperf SCTP test and switch to the secondary path faster. The average bandwidth over the time period for Multihoming, C M T , and C M T - P F are 911 Mbps, 1.27 Gbps and 1.32 Gbps respectively. Overall, the result is very encouraging in which it shows how C M T and C M T - P F can provide both performance and fault tolerance. The next test determined if our implementation of M P I over S C T P with C M T was able to exploit the available bandwidth from multiple interfaces. We used the OSU M P I microbenchmarks [36] to test bandwidth and latency with varying message sizes. We tested with MTUs of 1500 and 9000 bytes and found that using a 9000 byte M T U improved bandwidth utilization for both T C P and SCTP. Figure 4.5 reports our results with M T U 9000. Since there is no multirailing support for MPICH2-ch3: sock, we have de-Chapter 4. Evaluation 57 cided to include some test results from Open M P I for comparison, which is another popular M P I implementation. Open M P I performs application striping at the middleware level regardless of underlying hardware, therefore allowing us to compare and contrast the different design approaches. Notice that by com-paring two different MPI middleware, it will be difficult to analyze and draw a definite conclusion. As a result, we try to normalize as many parameters as pos-sible such as socket buffer sizes and the large message cutoff7, but the difference in behavior may also be due to the differences in the middleware implementa-tion. Consequently, we only provided the basic observation of the experiment result. 2000, 1600 1200 <n 800 400 64 1K 16K Message Size (Bytes) Figure 4.5: OSU Bandwidth test t As shown in Figure 4.5, MPICH2-ch3: sctp with C M T starts to take advan-7Because the network paths were not identical due to differences in hardware, we set network characteristics in Open MPI's mca-params. conf file. Chapter 4. Evaluation 58 tage of the additional bandwidth when the message size exceeds 2 Kbytes. At a message size of 1 Mbytes, MPICH2-ch3: sctp with C M T is almost achieving twice the bandwidth (1.7 versus 0.98 Gbps) in comparison to using one interface, and it is outperforming Open M P I (1.56 Gbps) which is scheduling messages in user-space across two interfaces within its middleware. A l l single path configu-rations are approximately the same, with a. maximum of 0.98 Gbps; however for configurations using two networks, Open M P I is the best for message sizes 512 to 2 Kbytes bytes but once message size exceeds 4 Kbytes MPICH2-ch3: sctp with C M T is performing better. One potential drawback with C M T is with respect to the linear scalability of the number of network interfaces. In the experiments, we only show the po-tential benefits of using two interfaces. However, with two or more interfaces, the host processor may be saturated with interrupt and checksum calculations. There are ways to deal with these challenges, such as enabling interrupt coalesc-ing (will increase packet latency), using jumbo frames, and acquiring hardware to perform checksum offload for SCTP. The other requirement for C M T is to design the network typology in a way that there is no shared bottleneck between network paths, which is essential if C M T is to achieve maximum performance and fault tolerance. Nonetheless, S C T P with C M T allows user applications to enable transport layer striping with very little effort and this can be very useful for applications that require large bandwidth. More C M T results can be found in our previous work [13]. 4.4 General Benchmarks In this section, we will show some of the results for ch3:sctp in comparison to ch3:sock (the T C P implementation of CH3) with a series of MPI-1 mi-Chapter 4. Evaluation 59 cro/macrobenchmarks. In some of the experiments, we also compare the differ-ence between our hybrid design and its parent design, ch3:sctp. Recall that cb.3:hybrid is based on ch3:sctp and as a result, it should not impose any major overhead as it shares the same MPI-1 code paths as in ch3:sctp. 4.4.1 M i c r o b e n c h m a r k s The first test shown is a basic MPI latency test. The latency test is performed in a ping-pong fashion in which the sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test were carried out and average one-way latency numbers were obtained (Figure 4.6). The M P I calls that are used in this test are blocking version of M P I functions (MPISend and MPLRecv). In the next experiment, we measured bi-directional bandwidth of the M P I middleware. The bi-directional bandwidth test is similar to the bandwidth test in Section 4.3, except that both nodes send a fixed number of back-to-back messages and wait for the reply. In the end, this test measures the maximum sustainable aggregate bandwidth by two nodes and the result is shown in Fig-ure 4.7. This test used the non-blocking version of M P I functions (MPLIsend and MPI.Ive.cv). In summary, for these two microbenchmarks, our implementation performs similar to the ch.3: sock in both latency and bi-directional tests. However, the T C P implementation seems to perform slightly better when the message size is small, as both cb.3: sctp and cb.3 :hybr id suffer slightly higher latency. Overall, the difference is fairly insignificant and is within 5% of each other. Since our hybrid design is based on ch3:sctp, our extensions to the design does impose some minor overhead with all of the integration described in Section 3.4, but Chapter 4. Evaluation 60 Message Size (Bytes) Figure 4.6: OSU Latency test (Less is better) we believe this is mainly due to an additional stage being added to the progress engine. Notice there are drops in both graphs at around 64 Kbytes, which is due to the rendezvous protocol taking place for large messages. 4.4.2 Real applications In this section, we conducted a series of macrobenchmarks that test the per-formance between ch3:sctp, ch3: hybr id and ch3:sock. We used the NAS Benchmarks (NPB version 3.2) [15]. These benchmarks approximate the per-formance that can be expected from a portable parallel application. The suite of benchmarks consists of eight programs written in Fortran, and we will be eval-Chapter 4. Evaluation 61 J ^ - M P I C H 2 - S C T P - K - M P I C H 2 - T C P Message Size (Bytes) Figure 4.7: OSU Bi-directional test (More is better) uating 6 of them 8: B T (Block TYidiagonal ADI), E P (Embarrassing Parallel), L U (LU Factorization), IS (Integer Sort), M G (Multi-Grid Method), SP (Scalar Pentadiagonal ADI). Furthermore, we focus only on the small and medium data-sizes, namely class A and B. The experiment is tested with four nodes and the result is summarized in Figure 4.8. As shown in Figure 4.8, MPICH2-ch3: sctp is fairly competitive to the T C P channel. In general, T C P performs slightly better than ch3:sctp and the dif-ference is typically within 25% of each other. Our hybrid design does impose some minor overhead to ch3:sctp which is within 5%. Surprisingly, there are also tests in which our implementations perform better (e.g., M G and IS). In 8 The remaining two tests, F T (Fourier Transform) and C G (Conjugate Gradient), were having compilation problems with the GNU Fortran 77 compiler. Chapter 4. Evaluation 62 BT » 300 1 250 8 200 S 150 ~ 100 E 50 - - • %• p 0 WA • Class A 0 Class B TCP SCTP HYBRID . 60 ! 1 50 8 40 » 30 ! 20 E 10 p 0 EP I: H Class A EJ Class B TCP SCTP HYBRID LU MG % 20 8 1 6 "> 10 c 4> 5 E F 0 1 Class A a Class B TCP SCTP HYBRID TCP SCTP HYBRID SP £ 500 § 400 S 300 to c 200 | 100 F 0 w> 1 1 I • • Class A a Class B 10 T 8 6 2H I P — I 1 • Class A 0 Class B TCP SCTP HYBRID TCP SCTP HYBRID Figure 4.8: NAS Parallel M P I Benchmarks (Less is better) general, the NAS benchmarks in class A and B are dominated by short messages and each test uses a specific communication topology [49] (Chain, Token Ring, etc). For instance, the average M P I message size for the class A B T test is only 957 bytes. In addition to the small message size, most of the tests are not com-munication intensive. These results do demonstrate that our implementation does not handle short messages as well as TCP. This should be investigated further to determine whether this is due to SCTP or an artifact of our design. Chapter 4. Evaluation 63 4.5 Summary Our current software i W A R P has a performance issue when message sizes in-crease to Mbytes range and this needs to be investigated further. The possible shortcoming's has been listed in Section 3.3.6 and 3.4.5. Nonetheless, we have shown by designing the middleware to handle independent M P I communication primitives with different components, the application efficiency can be improved. Over the course of our experiments, the performance of S C T P often varies. In part, this is because the stack is continually under development and we always used the latest version for the most complete features. Hence, it is hard to quantitatively measure the performance difference between T C P and SCTP. We have encountered phases in which the stack was undergoing some extensive code restructuring. There were also times when errors were discovered that needed to be fixed and we would work with the stack developers to sort out the issues. Nonetheless, we have highlighted the implementation limitations of our design in Section 3.2.6. On the contrary, T C P has been extensively tested and optimized. Overall, we are fairly satisfied with our ch3:sctp implementation as it can support most of the MPI-1/2 features and is competitive with TCP . Chapter 5 64 Related Work In this chapter, we discuss various projects and technologies that share some of the same concepts as our hybrid design. 5.1 Hybrid Design Related Open M P f 1 is another popular open source M P I implementation that is de-veloped and maintained by a consortium of academic, research, and industry partners. Similar to MPICH2, its design consists of a unique layered approach to separate the M P I semantics and message delivery. As a result, it can sup-port different types of network devices during runtime. One unique feature about Open M P I is its ability to selectively bind to multiple different network components at runtime, which is different to MPICTI2 in which network device information is determined during configuration process of installation. This al-lows Open M P I to schedule fragmented messages across multiple heterogeneous network devices [54]. From a component point of view, the lowest module for message transferring is the Byte Transferring Layer (BTL) and each network device has its own implementation. A B T L component is attached to a B T L management layer (BML) that is managed by the P M L (point-to-point manage-ment layer) and these are all part of the MPI Component Architecture (MCA) in Open MPI . The B T L itself serves as a byte mover and has neither the knowl-1Open MPI project site: http://uuw.open-mpi.org/ Chapter 5. Related Work 65 edge of the types of message it is transmitt ing nor the M P I functions. Th is is different from our hybr id design in which we define a boundary between M P I function requirements and data striping. We are also ut i l iz ing S C T P as much as we can to defer some of the complexity in the M P I middleware transport layer such as connection management and mult irai l support for both performance and fault tolerance. In addit ion, the modular design wi th dynamic binding in Open M P I allows a M P I program to be compiled once regardless of changes to the middleware. M V A P I C H 2 is a R D M A device extension to the M P I C H 2 implementation from the Ohio State University [32], Th is implementat ion 2 supports al l major R D M A capable devices such as Inf iniBand, and i W A R P via. various R D M A verbs. The extension itself is designed to optimize M P I operations for R D M A capable devices in order to deliver performance, scalability, fault tolerance and mult i ra i l support. It also includes several features like: R D M A fast path [31], ut i l iz ing R D M A operations for small message transfer, and R D M A - b a s e d one-side operations [27]. Moreover, it has been used in many high performance cluster environments. Hyb r i d design often refers to the combined usage of M P I and O p e n M P [43]. O p e n M P is another popular parallel programming model that supports shared memory multiprocessing and it is an industrial standard that is being promoted by Intel for their multi -core architecture. It is based on the idea of mult i thread-ing that enables the runtime environment to maximize paral lel ism by al locating threads to different processors. A set of preprocessor directives are inserted wi th in the code to indicate the portion that should be executed in parallel and translated by a compiler to produce an expl ic i t ly parallel code. Th is is completely different to M P I in which the user needs to be aware of message passing in order to exploit parallelism. The l imitat ion wi th O p e n M P is often 2project website: http://mvapich.cse.ohio-state.edu/ Chapter 5. Related Work 66 its scalabil ity imposed by the memory architecture and it only performs well in a shared memory multiprocessor environment. O p e n M P uses preprocessor directives al lowing incremental parallelism to only a port ion of the code and the major i ty of the complexity is hidden from the users. O n the other hand, M P I is shown to be efficient in exploiting parallelism across systems. However, the performance of this k ind of hybrid design has been shown to be highly de-pendent on the programming models and hardware architectures such as vector type systems [6, 17]. 5.2 C M T related The C M T extension is also related to projects such as NewMadeleine [9], M V A -P I C H [32], and Open M P I that support rnultihomed or mult i rai led systems in the middleware. However, these projects focus on low latency network architec-tures such as Infiniband, and not just TP networks. NewMadeleine and Open M P I both support heterogeneous networks; C M T can be used to combine all IP interfaces, which in turn can be used in combination wi th these heterogeneous network solutions. Mun iC luster [33] and R I 2 N / U D P [37] support mult i ra i l ing for IP networks at the application layer and use U D P at the transport layer. Our approach is significantly different in that C M T supports mult ihoming (or mult irai l ing) at the transport layer, as part of the kernel, rather than at the appl ication layer in user-space. These projects often require the network con-ditions such as latency and bandwidth to be determined statical ly before hand and the tests are pr imari ly done with basic networking benchmark programs. The result is then statical ly feed in as an input to the systems for making scheduling decisions. O n the other hand, C M T can determine the bandwidth and latency characteristics of the network dynamical ly and adaptively as part of the transport protocol. Chapter 5. Related Work 67 Fault tolerance is an important issue in commodity cluster environments, where network failure due to switches or software (mis)configuration do occur. S C T P has a mechanism for automatic failover in the case of l ink or path fail -ure. Simi lar mechanisms have been implemented in M P I middleware [51], but these are unnecessary for S C T P - b a s e d middleware. S C T P uses special control chunks which act as heartbeat probes to test network reachability. Heartbeat frequency is user-controllable, and can be tuned according to expectations for a given network. However, at the currently recommended setting [14], it takes approximately 15 seconds to fa i lover—along time for an M P I appl ication. Since C M T sends data to al l destinations concurrently, a C M T sender has more in -formation about all network paths, and can leverage this information to detect network failures sooner. C M T is thus able to reduce network path failure detec-t ion time to one second [35]. Th is improvement w i th C M T is significant, and future research wi l l investigate if this time can be further reduced for cluster environments. 5.3 R D M A related A n interesting extension for R D M A capable devices is the Socket Direct P roto -col (SDP) defined by the Software Work ing Group (SWG) [25]. It was defined to serve two purposes: to maintain backward compatibi l i ty for socket based applications and to deliver the performance of underlying R D M A devices. Th is extension allows socket applications to run on top of R D M A devices transpar-ently by export ing a pseudo socket-like interface (Figure 5.1). S D P provides high performance sockets, thus allowing applications to util ize the network capabilities without any modifications. The current S D P standard specifies implementation details for both Inf iniBand and i W A R P . However, these two implementations differ significantly in their connection setup. Internally, Chapter 5. Related Work 68 Socket Application Sockets API User Kernel TCP/IP Sockets Provider TCP/IP Transport Driver Driver Sockets Direct Protocol (SDP) I Kernel Bypass RDMA Semantics NIC Figure 5.1: Layer view of S D P S D P is responsible for managing all the user level buffers and flow control in order to exploit possible use of zero copying and this is done by exchanging S D P level control messages. Moreover, it currently supports only the T C P / I P style sockets and provides al l the necessary features and mapping for managing T C P port space, IP addressing, the Connect ing /Accept ing model, Out -o f -Band data, and common socket options. The current S D P design often performs copy-based communication for small messages and zero-copy communication for large messages. S D P - i W A R P holds more promise as it traverses through the tradit ional IP infrastructure and thus provides the support for wide area networks. M a n y studies have shown that S D P can provide significant speed improvements in various workloads [11]. A similar approach from M y r i c o m is the S o c k e t - G M Module [34] which is a middleware layer that mimics sockets semantics and replaces the tradit ional Ethernet protocol to allow for low latency, high speed data transfers. Chapter 5. Related Work 69 One other interesting device is called Etherfabric, a network fabric being developed by SolarFlare [5]. Th is device serves as an Ethernet accelerator by using "onload" technology that runs transport layer protocols wi th in user space to avoid unnecessary kernel crossing and thus avoids copying and reduces la -tency. Th is technology has been shown to be able to improve latency substan-t ia l ly over t radit ional Ethernet network adapters w i th a kernel based T C P / I P stack. Socket based programs can run without any modif ication and this is done by hi jacking the l inking process during runtime to a special set of Etherfabric libraries. Each process wi l l be exposed to a v i r tual interface that is directly connected to the user level transport layer protocol (Current ly only support T C P and U D P ) . Message Size (Bytes) Figure 5.2: T C P Latency test wi th N e t P I P E (Less is better) Chapter 5. Related Work 70 We had some experience wi th the 1 G b E Ethernet Accelerator as we were try ing to port S C T P to run over the device. Figure 5.2 shows an output of running N e t P I P E over T C P 3 w i th /w i thout Ethernet Accelerat ion. We are only showing output of message sizes from 2 bytes to 8 Kbytes since latency is i m -portant for smal l messages. The result has shown that Ethernet Accelerator can improve the latency by almost 40% for short messages. Compat ib i l i ty wi th existing Ethernet devices is the main advantage to its design. Th is is a com-pletely opposite approach to i W A R P and Inf iniBand as it tries to redesign and improve the existing Ethernet infrastructure. NetPIPE website: http://www.scl.ameslab.gov/netpipe 71 Chapter 6 Future work and Conclusion In this thesis, we have implemented a hybrid design of M P I middleware as a proof of concept model. Our model exploits one type of device, namely Ethernet based adapters and is shown in only one kind of functional decomposition. In the future, we are interested in exploring more functional mapping wi th respect to more network devices and their capabilities. For instance, M P I collectives and M P I long message transferring are both good candidates. The hybr id design also proposes future research in designing the M P I mid -dleware w i th multiple internal threads. Though a thread model for M P I mid -dleware may be overly complicated, it may allow more overlapping between the internal components and thus improve latency. We have also shown how S C T P can be used to support both socket and i W A R P semantics. In the future, we would like to explore the fault tolerant aspect of the protocol with respect to M P I . For instance, we can monitor the association's status v ia notifications. Th is might help middleware to make it more reactive to events indicating abnormal l ink conditions. We are currently exploring the possible use of C M T to improve the latency for M P I messages [13]. C M T was originally designed to improve bandwidth. In order to reduce latency, this would require S C T P to make more dynamic scheduling decisions other than round robin across available l inks. Moreover, we also want C M T to be dynamic and adaptive in a way that it can switch between these two modes of operations during runtime. Chapter 6. Future work and Conclusion 72 Even though tradit ional Ethernet devices along wi th transport layer pro-tocols have been crit icized for lack of performance, we have explored different techniques to improve its overall efficiency from many areas: appl ication layer, transport layer and hardware layer wi th mult i rai l ing support. Moreover, there exist roadmaps [7] in manufacturing for 40 and 100 G b E devices that wi l l be targeting the backbone infrastructure in large scale data centers such as the type of clusters deployed at Google. W i t h its low cost, availabil ity, and the deployment of IPv6, we believe that Ethernet based devices wi l l continue to evolve and become more prevalent in the near future. 73 Bibliography [1] My r i com. Myr inet . h t t p : / / w w w . m y r i . c o m . [2] OpenFabr ic Al l iance, h t t p : / / w w w . o p e n f a b r i c s . o r g / . [3] Quadrics. h t t p : / / w w w . q u a d r i c s . c o m . [4] R D M A consortium, architectural specificatioins for rdma over t c p / i p . h t t p : / / w w w . r d m a c o n s o r t i u m . o r g . [5] Solarflare Communicat ion (Ethernet Accelerator) . h t t p : / / w w w . s o l a r f l a r e . c o m . [6] Laksono Adhianto and Barbara Chapman. Performance Model ing of C o m -municat ion and Computat ion in Hybr id M P I and O p e n M P Appl icat ions. The 12th International Conference on Parallel and Distributed Systems (ICPADS), 2006. [7] Ethernet Al l iance. Overview of Requirements and Appl icat ions for 40G i -gabit and 100 Gigabit Ethernet, h t t p : / / w w w . e t h e r n e t a l l i a n c e . o r g / t e c h n o l o g y / w h i t e _ p a p e r s / 0 v e r v i e w _ a n d ° / „ _ A p p l i c a t i o n s 2 .pdf . [8] Infiniband Trade Association. Infiniband Architecture Specification, h t t p : / / w w w . i n f i n i b a n d t a . o r g / s p e c s , October 2004. [9] Ol iv ier Aumage, E l isabeth Brunet, Gui l laume Mercier , and Raymond Namyst. High-Performance M u l t i - R a i l Support wi th the NewMadeleine Bibliography 74 Communicat ion Library. In The Sixteenth International Heterogeneity in Computing Workshop (HCW 2007), workshop held in conjunction with IPDPS 2007, M a r c h 2007. [10] S. Bai ley and T. Talpey. The Architecture of Direct D a t a Placement (DDP) and Remote Direct Memory Access ( R D M A ) on Internet Protocols, f t p : / / f t p . r f c - e d i t o r . o r g / i n - n o t e s / r f c 4 2 9 6 . t x t , December 2005. [11] P. Ba la j i , S. Narravula , K . Vaidyanathan, S. Kr ishnamoorthy, J . Wu , and D. Panda. Sockets Direct Protocol over Inf in iBand in Clusters: Is it B e n -eficial. In In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2004. [12] C. Bestler and R. Stewart. Stream Contro l Transmission Protoco l ( S C T P ) Direct D a t a Placement (DDP) Adaptat ion . h t t p : / / w w w . i e t f . o r g / i n t e r n e t - d r a f t s / d r a f t - i e t f - r d d p - s c t p - 0 7 . t x t , September 2006. [13] Janardhan Iyengar B r a d Penoff, M ike Tsa i and A l a n Wagner. Us ing C M T in S C T P - b a s e d M P I to exploit multiple interfaces in cluster nodes. In Pro-ceedings, 14-th European PVM/MPI Users' Group Meeting, Par is , France, September 2007. [14] A . Caro. End-to-End Fault Tolerance Using Transport Layer Multihoming. P h D thesis, Computer Science Dept., University of Delaware, 2005. [15] N A S A Ames Research Center. Numerical aerodynamic simulat ion (NAS) parallel benchmark ( N P B ) benchmarks. h t t p : / / w w w . n a s . n a s a . g o v / S o f t w a r e / N P B / . [16] Ananth Devulapal l i Dennis Dalessandro and Pete Wyckoff. Design and Im-plementation of the iWarp Protocol in Software. In Parallel and Distributed Computing and Systems (PDCS), 2005, November 2005. Bibliography 75 [17] Nikolaos Drosinos and Nectarios Koz i r is . Performance Compar ison of Pure M P I vs H y b r i d M P I - O p e n M P Paral lel izat ion Models on S M P Clusters. The 18th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2004. [18] M P I Forum. M P I : A Message-Passing Interface Standard, h t t p : / / w w w . m p i - f o r u m . o r g , M a r c h 1994. [19] G . Burns, R. Daoud and J . Vaigl . L A M : A n Open Cluster Environment for M P I . In Supercomputing Symposium '94, Toronto, Canada, June 1994. [20] Edgar Gabr ie l , G raham E. Fagg, George Bosi lca, T h a r a Angskun, Jack J . Dongarra, Jeffrey M . Squyres, V isha l Sahay, Prabhan jan Kambadur , B r i a n Barrett , Andrew Lumsdaine, Ra lph H. Castain , Dav id J . Daniel , R ichard L. Graham, and T imothy S. Woodal l . Open M P I : Goals, concept, and design of a next generation M P I implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004. [21] Patr ick Geoffray. A Crit ique of R D M A . h t t p : / / w w w . h p c w i r e . c o m / h p c / 815242 .html. [22] W i l l i a m D. Gropp and Rajeev Thakur . A n Evaluat ion of Implementation Options for M P I One-Sided Communicat ion. In Proceedings, 12th European PVM/MPI Users' Group Meeting, pages 415-424, Sorrento, Italy, 2005. [23] Renato Recio Hemal Shah, James P inkerton and Pau l Culley. D i -rect data placement over reliable transports, h t t p : / / w w w . i e t f . o r g / i n t e r n e t - d r a f t s / d r a f t - i e t f - r d d p - d d p - 0 4 . t x t , February 2005. [24] Janardhan R. Iyengar, K e y u r C. Shah, Pau l D. Amer , and Randa l l Stewart. Concurrent mul t ipath transfer using S C T P mult ihoming. In SPECTS 2004, San Jose, Ju l y 2004. Bibliography 76 [25] E l len Deleganes James Pinkerton and Michale Krause. Socket Direct P r o -tocol (SDP) for i W A R P over T C P (vl.O). h t t p : / / w w w . r d m a c o n s o r t i u m . o r g / h o m e / d r a f t - p i n k e r t o n - i w a r p - s d p - v l . O . p d f , October 2003. [26] J i m P inkerton Jeff H i l land, P a u l Cul ley and Renato Recio. R D M A Protoco l Verbs Specification, h t t p : / / w w w . r d m a c o n s o r t i u m . o r g / h o m e / d r a f t - h i l l a n d - i w a r p - v e r b s - v l . O - R D M A C . p d f , A p r i l 2003. [27] W . J iang, J . L i u , H. J i n , D. Panda, W . Gropp, and R. Thakur . High per-formance M P I - 2 one-sided communication over Inf in iBand. In Proceedings of 4th IEEE/ACM Int'l Syrnp. on Cluster Computing and the Grid, A p r i l 2004. [28] Humai ra K a m a l , B r a d Penoff, M ike Tsa i , E d i t h Vong, and A l a n Wagner. Us ing S C T P to hide latency in M P I programs. In Heterogeneous Comput-ing Workshop: Proceedings of the 2006 IEEE International Parallel and Distributed Processing Symposium (IPDPS), A p r i l 2006. [29] Huma i ra K a m a l , B r a d Penoff, and A l a n Wagner. S C T P versus T C P for M P I . In Super-computing '05: Proceedings of the 2005 ACM/IEEE con-ference on Super computing, Washington, D C , U S A , 2005. I E E E Computer Society. [30] Sourabh L a d h a and Paul Airier. Improving multiple file transfers using S C T P multistrearning. In Proceedings IPCCC, A p r i l 2004. [31] J . L i u , J . Wu , S. K i n i , P. WyckofT, and D. Panda. H igh Performance R D M A - B a s e d M P I Implementation over Inf iniBand. In 11th Annual ACM International Conference on Supercomputing (ICS '03), June 2003. [32] J iux ing L i u , Abh inav Vishnu, and Dhabaleswar K . Panda. Bu i ld ing M u l -t i ra i l In f in iBand Clusters: M P I - L e v e l Design and Performance Evaluat ion. Bibliography 77 In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercom-puting, page 33, Washington, D C , U S A , 2004. I E E E Computer Society. [33] Nader Mohamed, Jarneela A l - Ja rood i , Hong Jiang, and Dav id R. Swanson. High-performance message striping over reliable transport protocols. The Journal of Super computing, 38(3):261-278, 2006. [34] Myr inet . Sockets -GM Overview and Performance, h t t p : / / w w w . m y r i . com/ m y r i n e t / p e r f o r m a n c e / S o c k e t s - G M / . [35] Preethi Natara jan, Janardhan Iyengar, Pau l D. Amer , and Randa l l Stew-art. Concurrent mult ipath transfer using transport layer mult ihoming: Per -formance under network failures. In MILCOM, Washington, D C , U S A , October 2006. [36] Ohio State University. O S U M P I Benchmarks, 2007. h t t p : / / m v a p i c h . c s e . o h i o - s t a t e . e d u . [37] Takayuki Okamoto, Shin' ichi M iu ra , Taisuke B o k u , M i tsuh isa Sato, and Daisuke Takahashi. R I 2 N / U D P : High bandwidth and fault-tolerant net-work for PC -c lus te r based on mult i - l ink Ethernet. In 21st International Parallel and Distributed Processing Symposium (IPDPS'07), Long Beach, Cal i fornia , M a r c h 2007. Workshop on Communicat ion Architecture for Clusters. [38] K . Vaidyanathan P. Ba la j i , H. W . J i n and D. K . Panda. Support ing i W A R P Compat ib i l i ty and Features for Regular Network Adapters. In In the pro-ceedings of the workshop on Remote Direct Memory Access (RDMA): Ap-plications, Implementations, and Technologies (RAIT), September 2005. [39] S. Bhagvat D. K . Panda R. Thakur P. Ba la j i , W . Feng and W . Gropp. A n a -lyz ing the Impact of Support ing Out -of -Order Communicat ion on In-order Bibliography 78 Performance wi th i W A R P . In In the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November 2007. [40] R. Recio S. Bai ley P. Culley, U . E lzur and J . Carrier. Marker P D U Al igned Framing for T C P Specification, h t t p : / / w w w . i e t f . o r g / i n t e r n e t - d r a f t s Z d r a f t - i e t f - r d d p - m p a - 0 8 . t x t , October 2006. [41] B r a d Penoff and A l a n Wagner. Towards M P I progression layer el imination wi th T C P and S C T P . In HIPS 2006/IPDPS 2006, A p r i l 2006. [42] D. Garc ia J . H i l land R. Recio, P. Cul ley and B. Metzler . A n R D M A protocol specification, h t t p : / / w w w . i e t f . o r g / i n t e r n e t - d r a f t s / d r a f t - i e t f - r d d p - m p a - 0 8 . t x t , A p r i l 2005. [43] Ro l f Rabenseifner. Hybr id Paral lel Programming: Performance Problems and Chances. In proceedings of the 45th Cray User Group Conference (CUG), Columbus, Ohio, M a y 2003. [44] R. Ra jamani , S. K u m a r , and N. Gupta. S C T P versus T C P : Compar ing the performance of transport protocols for web traffic. Technical report, University of Wisconsin -Madison, M a y 2002. [45] Mat thew L. Cur ry Ron Brightwell . The Case for an R D M A Extension to M P I . In ACM/IEEE International Conference on High-Performance Computing, Networking, and Storage (SC05), November 2005. [46] A . V ishnu G . Santhanaraman S. Narravula, A . R. M a m i d a l a and D. K . Panda. H igh Performance M P I over i W A R P : Ear l y Experiences. In Int'l Conference on Parallel Processing, September 2007. Bibliography 79 [47] Michael Scharf and Sebastian Kiesel . Head-of- l ine B lock ing in T C P and S C T P : Analys is and Measurements. In GLOBECOM '06. IEEE , vol., no., pp. 1-5, November 2006. [48] Randa l l R. Stewart and Qiaobing X ie . Stream Control Transmission Pro-tocol (SCTP): A Reference Guide. Addison-Wesley Longman Publ ish ing Co. , Inc., 2002. [49] Jaspal Subhlok, Shreenivasa Venkataramaiah, and A m i t o j Singh. Charac -terizing N A S benchmark performance on shared heterogeneous networks. In IPDPS '02: Proceedings of the 16th International Parallel and Distributed Processing Symposium, page 91, Washington, D C , U S A , 2002. I E E E C o m -puter Society. [50] Univers i ty of Mannheim, University of Tennessee, N E R S C / L B N L . Top 500 Computer Sites, 2007. h t t p : / / w w w . t o p 5 0 0 . o r g / . [51] Abh inav V ishnu , P rach i Gupta , A m i t h R. Marnidala , and Dhabaleswar K . Panda. A software based approach for providing network fault tolerance in clusters w i th u D A P L interface: M P I level design and performance eval-uation. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Super-computing, page 85, New York, N Y , U S A , 2006. A C M Press. [52] W . Gropp and E. Lusk. M P I C H working note: Creat ing a new M P I C H device using the channel interface. Technical Report A N L / M C S - T M - 2 1 3 , Argorme Nat ional Laboratory, Ju ly 1996. [53] T .S . Woodal l , R .L . Graham, R .H . Castain, D . J . Daniel , M .W . Sukalski, G . E . Fagg, E. Gabr ie l , G . Bosi lca, T . Angskun, J . J . Dongarra, J M . Squyres, V . Sahay, P. Kambadur , B. Barrett , and A . Lumsdaine. Open M P F s T E G point -to-point communications methodology: Compar ison to Bibliography 80 existing implementations. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 105-111, Budapest, Hungary, September 2004. [54] T .S . Woodal l , R .L . Graham, R .H . Casta in , D . J . Daniel , M . W . Sukalski, G . E . Fagg, E. Gabr ie l , G. Bosi lca, T. Angskun, J . J . Dongarra, J . M . Squyres, V . Sahay, P. Kambadur , B. Barrett , and A . Lumsdaine. T E G : A high-performance, scalable, multi -network point - to-point communica-tions methodology. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004. 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items