UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Implementation of DAFS on the Linux platform Xu, Yue 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2002-0613.pdf [ 7.62MB ]
JSON: 831-1.0051380.json
JSON-LD: 831-1.0051380-ld.json
RDF/XML (Pretty): 831-1.0051380-rdf.xml
RDF/JSON: 831-1.0051380-rdf.json
Turtle: 831-1.0051380-turtle.txt
N-Triples: 831-1.0051380-rdf-ntriples.txt
Original Record: 831-1.0051380-source.json
Full Text

Full Text

Implementation of DAFS on the Linux Platform by Yue X u B.Eng., Beijing University of Posts and Telecommunications, 1996 M.Eng., Beijng University of Posts and Telecommunications, 1999 A THESIS SUBMITTED IN P A R T I A L FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August 2002 © Yue Xu , 2002 In presenting this thesis i n p a r t i a l fulfilment of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the Library s h a l l make i t freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. I t i s understood that copying or publication of this thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of The University of B r i t i s h Columbia Vancouver, Canada Sept .»*• 1002^ Abstract The rapid growth of the Internet drives research and industry in the field of network computing systems. Various such systems have been developed to provide better performance for Internet services. With the introduction of Virtual Interface (VI) and other new interconnect technologies, a new file access protocol - Direct Access File System (DAFS) is being proposed that may leverage these technologies to create a new network computing system. DAFS is based on NFS version 4. However, since it is for a local file sharing environment and built on new interconnection technologies, it is quite different from NFS version 4 in certain aspects. By eliminating the memory copy and operating system involvement during data transfer, DAFS has been shown to have much better performance than NFS version 4 has. Building a DAFS server in the Linux kernel is essentially the motivation behind this thesis. By implementing basic file operations in DAFS, we are able to compare the DAFS performance results with those from NFS systems. Focusing on the development of the file locking mechanism in DAFS, we are interested to see how two new locks introduced by DAFS - Persist Locks and AutoRecovery Locks - work in network file systems. In this thesis, we describe the VI technologies, DAFS protocol, and issues related to the design and implementation of a DAFS server in the Linux kernel. We also describe the file locking mechanism and the implementation related issues in DAFS. i i Table of Contents Abstract i i Table of Contents i i i List of Figures vi Acknowledgements ; vii Chapter 1 Introduction 1 1.1 W H Y D A F S ? 2 1.2 RELATED WORK 4 1.3 THESIS OUTLINE 6 Chapter 2 Background 7 2.1 VIRTUAL INTERFACE 7 2.1.1 VI Architecture Components 7 VI consumer 8 VI provider 8 Virtual Interface 8 Completion Queue 9 2.1.2 VI Operations 10 2A..2.1 VI Creation and Destruction 10 Memory Management 10 Connection Management 11 Data Transfer 11 2 . 2 D A F S 12 2.2.1 Communication Model 13 Session Management 13 Message Handling 15 2.2.2 File System Operations 17 i i i NFS-Derived Operations 17 Data Transfer Operations 18 DAFS Chaining 19 2.2.3 Message Formats 20 Request/Response Header 20 Procedure-specific Components 21 Encoding 21 Chapter 3 File Locking 23 3.1 LINUX FILE LOCKING 23 3.2 NFS LOCKING 28 3.2.1 Clientid & Stateid... 28 3.2.2 Lease : 29 3.2.3 Failure Recovery 30 3.3 DAFS LOCKING 31 3.3.1 PERSIST Locks 32 3.3.2 AUTORECOVERY Locks 32 Chapter 4 Design & Implementation 34 4.1 USER LEVEL VS. KERNEL LEVEL 34 4 .2KVJPL : 36 4.3 VI MANAGEMENT 38 4.4 DAFS & VPS 39 4.5 DAFS OPERATIONS 41 4.6 DAFS LOCKING & POSIX LOCKING 42 4.7 LOCKING STATES 44 4.8 PERSIST LOCKS 47 4.9 A U T O R E C O V E R Y LOCKS 48 Chapter 5 Performance 51 5.1 VI PERFORMANCE 51 5.2 DAFS PERFORMANCE 53 iv 5.3 OVERHEAD OF D A F S LOCKS 58 Chapter 6 Conclusion 60 Bibliography 62 v List of Figures Figure 1-1 A Typical NAS System Configuration 2 Figure 1-2 File Access Process in NFS 3 Figure 1-3 File Access Process in DAFS 4 Figure 2-1 VI Architecture 7 Figure 2-2 Virtual Interface 9 Figure 2-3 Message coding in DAFS 22 Figure 3-1 POSIX locks in the Linux Kernel 25 Figure 3-2 Lock overlapping for new lock request 26 Figure 3-3 Lock overlapping for unlock request 27 Figure 4-1 DAFS client & DAFS server 36 Figure 4-2 Work Queue & Completion Queue 39 Figure 4-3 DAFS & VFS & Local File System 40 Figure 4-4 DAFS Locking & POSIX Locking 43 Figure 4-5 Data Structures used in a DAFS server 46 Figure 4-6 Two designs for A U T O R E C O V E R Y Locks 48 Figure 4-7 Buffer Lists in a DAFS server 50 Figure 5-1 c L A N vs. Gigabit Ethernet: Throughput 52 Figure 5-2 c L A N vs. Gigabit Ethernet: Latency 53 Figure 5-3 DAFS Throughput 54 Figure 5-4 DAFS C P U load 55 Figure 5-5 DAFS vs. NFS V4: Throughput... 56 Figure 5-6 DAFS vs. NFS V4: C P U usage 57 Figure 5-8 Throughput: With & Without Lock 58 vi Acknowledgements I would like to thank my supervisor, Norm Hutchinson, who both led and motivated this work. Norm was an unfailing resource whenever I had a problem in my research. This work is not possible without him. I would also like to thank Mike Feeley, for the valuable comments offered by him, and for all the things I learned from him in our DSG seminars that greatly helped this research work. M y thanks also go to Steve Matthews who carefully proofread my thesis and other people in DSG who helped me during my thesis research. vii Chapter 1 Introduction Network Attached Storage (NAS) is a proven technology in the storage system industry. Using file access protocols, NAS has unique features like fine grain data management, heterogeneous support and easy configuration that are essential to its success in today's market. However, the overhead introduced by traditional file access protocols like NFS and CJFS prevents a NAS system from getting the same performance as a Storage Area Network (SAN) system can get. With the emergence of a new file access protocol called Direct Access File System (DAFS), a NAS system now can avoid the overhead in traditional NAS systems. DAFS deploys the recent memory-to-memory network technologies like Virtual Interface (VI) that dramatically reduce C P U computing load and latency in a network architecture. One motivation for this thesis is to develop DAFS on a Linux platform. Linux has been successful as a server operating system in recent years, and will play a more important role in the future server market. Implementing DAFS on Linux is meaningful as DAFS is promising to improve the performance of current Linux servers. Since no one has done this before, we are interested to see what problems will arise when we develop DAFS on a Linux platform. How much can DAFS improve the performance of current Linux NAS boxes? File locking is very important in a network file system. DAFS file locking is based on the file locking in NFS Version 4. However, DAFS includes two new kinds of locks - PERSIST locks and A U T O R E C O V E R Y locks. A l l these locks are proposed to achieve better data integrity in a network file system. While it is always interesting to 1 see how the DAPS lock mechanism works in our implementation, a more interesting question is how the lock mechanism will affect the performance. A l l these things are what we are trying to discover in this thesis. 1.1 Why DAFS? A typical NAS system configuration is shown in Figure 1-1. Figure 1-1 A Typical NAS System Configuration In Figure 1-1, application servers access the data in storage systems through a file access protocol. A file server is running on storage systems to serve the requests from application servers that act as clients. Traditional file access protocols like NFS require the involvement of operating systems, which usually means some extra computing cycles and memory copies. Figure 1-2 shows the process of handling file access requests in an NFS client. 2 A p p l i c a t i o n Buffers i i t r UFS J L r NFS -\ T A r Buf fei-Cache TCP/IP Driver NIC Figure 1-2 File Access Process in NFS Figure 1-2 shows that there is one data copy between buffer cache and application buffers. To access a network interface card, an application needs to go through the file system layer and the networking layer, which adds extra C P U computing cycles. The use of network protocols like TCP/TP implies some extra data to be transferred, and some restrictions like maximum transfer size are applied. VI, which was proposed by Compaq, Intel and Microsoft in 1997, is a specification for a high-performance networking interface of computer clusters. VI has two new features that distinguish it from the traditional network architectures: one is remote memory-to-memory copy, and the other is direct application access to a network interface card. Two computers using VI can transfer bulk data directly between appropriately aligned memory buffers without going through network protocol processing: all message fragmentation, assembly and alignment are implemented in VI hardware. An application based on VI can access a network interface card without the involvement of operating systems. 3 DAFS was proposed by Network Appliance in 2000 to explore the new features of VI. DAFS is based on NFS Version 4, but is optimized for the local file-sharing environment to get the best performance and lowest latency. A DAFS client implementation based on VI is shown in Figure 1-3. A p p l i c a t i o n B u f f e r s NIC Figure 1-3 File Access Process in DAFS In contrast to an NFS client, a DAFS client can read and write a file without any memory copy on the client side. Using "direct" file operations, R E A D and WRITE requests coming from a DAFS client will cause a DAFS server to issue the remote D M A operations to that client, so the actual reading or writing operations won't cost any CPU cycles at the client side. The high performance of the VI interconnect architecture also reduces the response time for DAFS operations. 1.2 Related Work The DAFS collaborative [1] is a group of more than 75 companies who are interested in DAFS technologies. Currently its main objective is "to create, review and submit a 4 proposal for the DAFS protocol specification". Several companies in the DAFS collaborative are doing active research on DAFS. Network Appliance is a key player in the DAFS collaborative. It has been playing a leading role in both the protocol specification and implementation of DAFS. Network Appliance has implemented a kernel DAFS client on a Solaris-based database sever running I B M DB2 [2]. The implementation is independent of network transport as it can run on both gigabit Ethernet host bus adapters and fibre channel host bus adapters. The database server software doesn't need to be changed as it interacts with the DAFS client through standard I/O calls. Network Applicance's implementation shows that a NAS system can get high performance while keeping its unique features as well. The research group in Fujitsu Prime Software Technologies Limited (Fujitsu PST) has been focusing on the direct data transfer mechanism, and they have developed a kernel implemented DAFS/VI prototype [3]. Their tests on the prototype show that kernel implemented DAFS/VI has much better performance than NFS does: the data transfer rate in DAFS is more than 4 times as much as that in NFS, while the C P U usage in a DAFS client is less than 3.5 percent of that in an NFS client. Some universities are also doing research on DAFS. The network storage & cluster research group in Duke University has implemented a DAFS client and a DAFS compliant memory-based file server [4]. That DAFS client allows direct I/O from applications based on an external I/O toolkit (TPIE) developed at Duke. The file and storage system research group in Harvard University has completed an initial implementation of a DAFS reference server [5], and they are focusing on the caching in a DAFS client. The CITI group in the University of Michigan has been working on their NFS Version 4 Open Source Reference Implementation Project [6]. They have developed an NFS version 4 client and a server running in the Linux kernel. Currently they are working on delegation and async aspects on the client. 5 1.3 Thesis outline In Chapter 2, we will review VI and DAFS to provide the background for this thesis. We will describe the file locking mechanisms used in Linux local file systems, NFS and DAFS in Chapter 3. Understanding these locking mechanisms is necessary to understand our own implementation. The issues related to design and implementation are discussed in Chapter 4, and in Chapter 5 we evaluate some performance results from experimental tests. Finally, we present our conclusion and give a description of the future work in Chapter 6. 6 Chapter 2 Background 2.1 Virtual Interface 2.1.1 VI Architecture Components The VI Architecture includes four components: Virtual Interfaces, Completion Queues, VI consumers and VI providers [7]. Figure 2-1 shows the relationship between these components in the VI architecture. A p p l i c a t i o n OS Communication I n t e r f a c e UI User Agent V I C o n s u m e r User Mode Kernel Mode VI Prov ider V I UI Kernel Agent UI Network Adapter Figure 2-1 V I Architecture 7 VI consumer A VI consumer usually includes an application, an operating system communication interface and a VI User Agent. VI itself is a communication interface. However, applications built on the other interfaces can still use the VI service. An operating system communication interface is provided so existing applications don't have to be rewritten. A VI User Agent hides the details of the VI and VI kernel Agent from the application. Usually this agent is a Virtual Interface Provider Library provided by the VI hardware producer. VI provider A VI provider includes both hardware and software - a VI network adapter and a VI kernel agent. In the VI architecture, all message fragmentation, assembly and alignment are implemented in the VI network adapter. A VI kernel agent usually is the driver of the VI network adapter, which performs the setup and resource management functions in the operating system. Virtual Interface A VI consumer accesses a VI provider directly though a Virtual Interface. A Virtual Interface is composed of a pair of Work Queues: a send queue and a receive queue. By posting VI descriptors on the send queues and receive queues, a VI consumer can send/receive data to/from another VI consumer in a VI connection. A VI descriptor is a data structure that includes information about the data, like data memory address, length, etc. To send or receive data, a VI consumer should specify the data information in a VI descriptor, post the VI descriptor in a suitable work queue, and then wait for the completion notification sent from the VI provider. Figure 2-2 illustrates a Virtual Interface. 8 UI Consumer a i-O o a TS C <s I Send Q Descriptor Descriptor Descriptor Recu Q Descriptor Descriptor Descriptor o SE. < O o o a-Status Status UI Network Adap te r Figure 2-2 Virtual Interface Completion Queue There are two ways by which a VI consumer can know if the data transfer is finished or not: one is to poll the status of the VI descriptor in a work queue to see if it is completed; the other is to use a completion queue. A VI consumer can create a completion queue and bind work queues to this completion queue. If the status of any descriptor in those work queues is "completed", the completion queue will get a notification from the VI provider. The VI consumer should have a process monitoring the status of the completion queue so whenever the completion queue gets a notification, it knows that a data transfer has completed. 9 2.1.2 VI Operations The Virtual Interface Provider Library (VIPL) provides all the functions that a VI consumer needs to use the Virtual Interface. They can be arranged into the following four groups, according to their functions: > Creation and destruction > Memory management > Connection management > Data transfer VI Creation and Destruction A VI consumer should create a VI first in order to use it. VipCreateVi is the function used to create a VI. The VI consumer needs to provide a VI NIC handle to call this function, and will get a handle for the newly created VI instance if the operation is successful. A l l the following operations related to this VI instance will interact with this VI handle only. The VipDestroyVi function is used to destroy a VI instance created by the VI consumer. Memory Management A VI descriptor in a Work Queue has information about the memory that this VI consumer will use. A l l memory, though, must be registered before it can be used. This registration locks the physical memory allocated so it will not be swapped out by the operating system kernel later. Since the VI NIC performs D M A operations, VI should make sure that all memory addresses are mapped beforehand. VipRegisterMem is the function for the memory registration. A VI consumer should provide the starting address and length of a memory region when calling this function, and will get a handle for the memory if the operation is successful. VipDeregisterMem is the function for deregistering the memory. 10 Connection Management The data transfer in the VI Architecture is connection-oriented, which means two Vis must build a connection before they can transfer data. Similar to socket programming, VIPL provides functions like VipConnectRequest , VipConnectWait, VipConnectAccept , VipConnectReject, VipDisconnect for connection management. In a client/server model, if a client wants to build a VI connection to a server, it will call the VipConnectRequest function. The server is made ready for receiving incoming connection requests by calling the VipConnectWait function. After getting a connection request, the sever will decide whether it will accept the connection request (VipConnectAccept) or not (VipConnectReject). The VipDisconnect function is used to disconnect an existing VI connection. Data Transfer There are two data-transfer models in the VI architecture: one is the Send/Receive messaging model used in traditional network architectures; the other is the Remote Direct Memory Access (RDMA) model. The first model is what we are familiar with, while the second one is proposed to reduce the C P U computing load for data transfer in VI. In the R D M A model, it is the C P U in the VI network adapter that writes the data to the memory, not the C P U in the main machine. There are two R D M A operations: R D M A WRITE and R D M A R E A D . The VI specification specifies that all VI adapters should support R D M A WRITE, while the support of R D M A R E A D is optional. These two transfer models, however, call the same VI operations. The difference is in the VI descriptors posted in the Work Queues, which has the CONTROL SEGMENT specifying the data transfer model it uses. VIPL provides more than 10 functions related to data transfer. VipPostSend, VipSendDone and VipSendWait are used to send data on a VI connection. By calling the VipPostSend function, a VI consumer will post a VI descriptor in a Send Queue. The VI consumer can detect whether the operation has been finished or not by checking the status of the descriptor. 11 VipSendWait blocks the caller until the descriptor is completed and removed from the Send Queue, while the V ipSendDone function immediately returns a value indicating whether the descriptor is completed or not. VipPostRecv, V ipRecvDone and VipRecvWait are similar to the previous three functions except that they are used when receiving data. V i p C Q D o n e and V i p C Q W a i t are used to check the status of descriptors using a Completion Queue. As with VipSendWait and V ipSendDone , V i p C Q W a i t blocks the caller while V i p C Q D o n e does not. 2.2 DAFS "The Direct Access File System (DAFS) is a file access and management protocol designed for local file sharing or a clustered environment. It addresses two primary goals: • Provides low latency, high throughput, and low overhead data movement that takes advantage of modern memory-to-memory networking technologies. • Defines a set of file management and file access operations for local file sharing requirements. [8]" DAFS is based on NFS Version 4. Most features in NFS Version 4 (like openness, good failure recovery, A C L support, and good interoperability) are essential to DAFS. However, since NFS Version 4 is designed for a wide file sharing environment like the Internet, DAFS adds some additional operations that are specifically designed for a local file sharing environment. To take the advantage of memory-to-memory interconnect capability, DAFS changes the message format and transportation mechanism used in NFS version 4 while keeping NFS Version 4 operation semantics unchanged. The following sections describe the communication model, file system operations, and message format in DAFS. Other high performance inter-processor communication technologies like Infiniband might also be used as D A F S ' transportation layer, but for now VI is the only feasible one. The following discussion assumes VI as the transportation layer of DAFS. 12 2.2.1 Communication Model The DAFS communication model includes session management and message handling. Session management involves setting up and destroying communication channels, while message handling involves sending and receiving messages in DAFS. Session Management DAFS is a session-based protocol. Before a DAFS client sends requests to a server, it should set up a session with the server. Setting up a DAFS session provides the following functions [8]: • Authentication • Negotiation of attributes related to message handling • Linking low-level connections into a" logical entity • Providing context for recovery operations Since DAFS is designed for a local file-sharing environment, the DAFS security model is based on a trusted client-sever relationship. In the DAFS security model, there is an initial authentication process that usually authenticates a DAFS client to a server. If this initial authentication is successful, the DAFS client can pre-register one or more credentials and will use those credentials in the subsequent operations. The GSS protocol is chosen to provide security services. However, DAFS only uses its authentication services. A DAFS client can negotiate some session attributes like protocol version and maximum transfer size with the server during session initiation. These attributes provide the context for the subsequent message handling. Once set, they are kept unchanged through the whole session. In DAFS, a VI connection is called a communication channel. There are three types of communication channels in DAFS: Operation channel, Back-Control channel, 13 and R D M A Read channel. A DAFS session can have up to three communication channels - one for each type. A DAFS session must have an operation channel, which is used for basic operations initiated by a DAFS client. A back-control channel is used for operations initiated by a DAFS server, while an R D M A read channel is used only for R D M A read operations initiated by a DAFS server. A DAFS session also provides the context of file operations. A DAFS server is required to keep a response cache of recent state-modifying operations in a session. After a system failure, a DAFS client can do some recovery operations for the previous sessions by accessing the response cache in this DAFS server. DAFS provide the following operations for session D A F S _ P R O C _ C L I E N T _ C O N N E C T D A F S _ P R O C _ C L I E N T _ A U T H server. D A F S _ P R O C _ S E R V E R _ A U T H client. D A F S _ P R O C _ C L I E N T _ C O N N E C T _ A U T H authenticate the client to the server. D A F S _ P R O C _ C O N N E C T _ B I N D session. D A F S _ P R O C _ C L I E N T _ D I S C O N N E C T D A F S _ P R O C _ S E C I N F O authentication methods that the server supports. D A F S _ P R O C _ R E G I S T E R _ C R E D D A F S _ P R O C _ R E L E A S E _ C R E D D A F S _ P R O C _ R E R E G I S T E R _ C R E D D A F S _ P R O C _ C H E C K _ R E S P O N S E previous request in the server's response cache. D A F S _ P R O C _ F E T C H _ R E S P O N S E server's response cache. 14 management: Create a new session. Authenticate a client to a Authenticate a server to a Create a new session and Bind a new connection to a Terminate a session. Get the security Register the credentials. Release the credentials. Reregister the credentials. Check the response of a Fetch information from the • DAFS_PROC_DISCARD_RESPONSE Discard the response cache for a disconnected session. Message Handling As we discussed in the previous section, there are two data transfer models in VI: the send/receive model and the R D M A model. Correspondingly, the message exchange in DAFS can be grouped into two categories: inline data transfer and direct data transfer. Inline data transfer adopts the send/receiver model in VI, while direct data transfer adopts the R D M A model. In inline data transfer mode, data always comes with a message header in one request or response message. For example, if a DAFS client wants to do a write operation, in inline data transfer mode, it will post a VI send descriptor that specifies the memory addresses of both the message header and the data the client wants to write to a file. Knowing the memory addresses, VI will pack both the header and the data into one request message and send it to a server. At this time, the server should have a VI receive descriptor ready and waiting for this request message. When the request message arrives, it will be written to the memory address specified by the receive descriptor at the server side. In the same way, the server will send the reply of the write operation back to the client. The message header and the data can be address discontinuous, and the data itself can belong to different memory regions. A VI send/receive descriptor includes a list of memory address pointers, each of which can point to a different memory area. VI packs/unpacks all the data in different memory areas into/from one message - this is called VI's scatter/gather capability. Using this capability, a receiver can put the data we receive in the right place and thus avoid unnecessary memory copies. The receiver can post a VI receive descriptor including two memory address pointers: one points to a memory area for the message header; the other points to the desired memory address for 15 the data. Once the receive operation is done, the data receiver already has the data in the desired memory address. Direct data transfer mode takes advantage of the RDMA capability of VI. In this transfer mode, data is transferred separately from a message header. For a DAFS client who wants to do a write operation, first it sends a message including the request message header to the server. The message also specifies the memory address where the data is located at the client side. Sending and receiving this message is just the same as the inline data transfer. Once the server gets this message, it will call the RDMA read function at the VI layer. The server posts a VI send descriptor specifying both the source and destination addresses of the data, and VI will do the RDMA read operation and take care of the rest. The DAFS client doesn't have to post a VI descriptor for this RDMA operation. If the RDMA read operation is successful, the server should have the data in its specified memory address. So a DAFS direct data transfer actually includes one inline data transfers plus an RDMA operation. Why doesn't the DAFS client call an RDMA write operation directly in the previous scenario? While this way seems more straightforward, the DAFS client needs to get the server memory address before it performs the RDMA write operation. Depending on the memory management implementation at the sever side, the server memory address may be invalidated during a DAFS session. For security and stability concerns, a DAFS server apparently doesn't want the clients to perform RDMA operations on the invalid memory addresses. As we mentioned earlier, a DAFS session can have up to three channels. If a DAFS session has an RDMA read channel, usually all RDMA read operations are performed in that channel. RDMA read operations, though, don't have remote gather capability. If a DAFS client specifies a bunch of noncontiguous memory buffers, the server has to do one RDMA read operation for each memory buffer. 16 The two data transfer modes are indispensable in D A P S : while inline data transfer mode is efficient for small data transfers and necessary for environments where remote D M A operations are not desired, for large data transfer, direct data transfer mode provides higher throughput and lower C P U usage. 2 . 2 . 2 File System Operations D A F S includes about 40 file system operations. Some of these operations are derived from NFS Version 4, while others are new or have different functions in D A F S . 2 . 2 . 2 . 1 NFS-Derived Operations Some file operations have the same semantics and functions for all file systems, although other aspects of these operations, like message format, may be different in different environments. These file operations include file creation, file deletion, directory operations, etc. The following operations in D A F S are derived from N F S Version 4: DAFS_ .PROC. .NULL Standard N U L L operation. DAFS. .PROC. ACCESS Check an object's access rights. DAFS_ .PROC. CLOSE Close a file. DAFS_ .PROC. .COMMIT Commit data cached on the server. DAFS_ .PROC. CREATE Create a non-regular file object. DAFS_ .PROC. .DELEGPURGE Purge delegations awaiting recovery. DAFS_ _PROC_ .DELEGRETURN Return delegation. DAFS_ .PROC. .LINK Create a hard link to a file. DAFS. .PROC. .LOOKUP Look up a file object given its name. DAFS_ _PROC_ .LOOKUPP Look up parent directory. DAFS_ _PROC_ .NVERIFY Verify difference in attributes. DAFS_ .PROC. .OPEN Open a regular file. DAFS_ .PROC. .OPENATTR Open a named attribute directory. 17 DAFS_ .PROC. _OPEN_DOWNGRADE Reduce open file access rights. DAFS_ .PROC. .REMOVE Remove a file object. DAFS_ .PROC. .RENAME Rename a directory entry. DAFS_ _PROC_ .VERIFY Verify equality of attributes. DAFS_ .PROC. _BC_NULL N U L L operation in back-control channel. DAFS_ .PROC. _BC_RECALL Recall an open delegation. Data Transfer Operations There are two operations for inline data transfer in DAFS, and another two for direct data transfer: DAFS_PROC_READ_INLINE DAFS_PROC_WRITE_INLINE DAFS_P ROC_R E AD_D IR ECT DAFS PROC WRITE DIRECT Read data from a file inline. Write data to a file inline. Read data from a file using RDMA. Write data to a file using RDMA. DAFS provides batch I/O operations for data transfer. In data transfer operations discussed previously, a DAFS client sends a single I/O request in each operation. In a batch I/O operation, however, a DAFS client can send a bunch of I/O requests at the same time. If a batch I/O operation is synchronous, a DAFS server should perform these I/O operations immediately and send the results back to the client in one response message. If a batch I/O operation is asynchronous, the DAFS client can set a time recommending when the sever should finish the I/O operations. After a certain number of I/O operations is done, the server will send an acknowledgement back to the client through a back-control channel. When all the I/O operations are completed, the server will send the results in one response message back to the client through the operation channel. The functions related to batch I/O operations are: 18 DAFS_PROC_BATCH_SUBMIT Submit a batch of I/O requests to the server. DAFS_PROC_BC_BATCH_COMPLETION Notify the client of completed batch I/O operations. For meta-data transfer, DAFS provides 8 operations: D AFS_P ROC_G ETATTRJ N Ll N E • DAFS_PROC_SETATTR_ INLINE DAFS_PROC_READDIR_ INLINE directory. • DAFS_PROC_READLINK_ INLINE symbolic link. • DAFS_PROC_GETATTR_ DIRECT using R D M A operations. DAFS_PROC_SETATTR_ DIRECT using R D M A operations. • DAFS_PROC_READDIR_ DIRECT directory using R D M A operations. • DAFS_PROC_READLINK_ DIRECT symbolic link using R D M A operations. DAFS Chaining The motivation behind chaining in DAFS is to reduce the number of round-trip messages. Although the latencies in a VI network are usually low, the benefits of reducing round-trip messages are still considerable. Chaining in DAFS is similar to the COMPOUND procedure in NFS Version 4 [9], but they are not the same. In NFS Version 4, a COMPOUND procedure is used to combine more than one request into a single RPC request. The requests in a compound request lose their own identities since as there is only one request. In this sense, the COMPOUND procedure is similar to the batch I/O 19 Get attributes of an object. Set attributes of an object. Read the contents of a Read the contents of a Get attributes of an object Set attributes of an object Read the contents of a Read the contents of a operation i n D A F S . Requests in a chain, though, keep their o w n identi t ies - this means the number o f requests is unchanged no matter whether they are in a chain or not. What chain ing does is to set a chain ing f lag f i e ld fo r each request header before a D A F S cl ient sends out the requests to a server. W h e n the server gets a request, i t knows whether the request is i n a chain or not by checking the chain ing f lag f ie ld . 2.2.3 Message Formats A l l D A F S messages are either request messages or response messages. A request message starts w i t h a request header, f o l l o w i n g b y a procedure-specif ic component that contains the arguments o f the request. S imi la r ly , a response message starts w i t h a response header, f o l l o w i n g by a procedure-specif ic component that contains the result o f the operat ion. A l l f ie lds i n a D A F S message are encoded c o m p l y i n g w i t h certain rules. Request/Response Header The D A F S request header data structure has the f o l l o w i n g f ie lds: • Header_magic is always a magic sequence - " D " " A " " F " " S " , w h i c h is used to ident i fy the D A F S packets in the network. • Protocol_version is used to specify supported protocol version. • Desired_nreq is used fo r f l o w contro l . Chain_flags is f o r cha in ing i n D A F S , as we discussed before. • Analyzer is opaque to a D A F S server, wh ich s imp ly returns i t i n the response message. • Stream_id and seq_number are used to ident i fy requests i n a server response cache. • Message_checksum is opt ional checksum for data transfer. • Cred_handle is used fo r authentication in D A F S as we discussed before. • Procedure is the procedure number fo r this request. • R e q u e s t j e n is the length o f the request message. 20 The DAFS response header data structure has the following fields: • Header_magic is always a magic sequence - "D" "A" "F" "S", which is used to identify the DAFS packets in the network. • Protocol_version is used to specify supported protocol version. • Target_nreq is used for flow control. Chain_flags is for chaining in DAFS, as we discussed before. • Spec_COnd is the flag for special conditions that a DAFS client should note and act upon. • Analyzer is opaque to a DAFS server, which simply returns it in the response message. • Strearrrid and seq_number are used to identify requests in a server response cache. • Message_checksum is optional checksum for data transfer. • Status is the result code for the operation. • Response_len is the length of the response message. • Reserved is reserved for future use. Procedure-specific Components For a request message in DAFS, the procedure-specific components are the arguments of this file operation request. While for a response message, they are the result of the operation. As the name suggests, different file operations have different components. Encoding A DAFS message includes two parts: All fixed-size fields are put in the first part of the message, while variable-sized fields are put in the second part. Each variable-sized field has an entry in the first part that specifies its actual offset into the second part. The variable-size field in the second part is encoded as a counted array: the first 4 bytes represent the length of data in this field, while the following bytes include the actual data 21 and padding to make the overall field a multiple of 8 bytes in size. Figure 2-3 shows how a message is encoded in DAFS. Fixed-sized field #1 Fixed-sized field #2 Variable-sized offset Fixed-sized field #4 section Variable Field #1 size Variable field #1 data Variable-size section Figure 2-3 Message coding in DAFS 22 Chapter 3 File Locking File locking is important in both local file systems and network file systems. In a local file system, the operating system uses file locks to guarantee that multiple processes access the same file concurrently and correctly. In a network file system, a file server manages all file locks so all file operations that originate from different clients are synchronized and the correct responses are returned to these clients. File locking in a network file system is complicated compared to that of a local file system because of the existence of the network. Unlike local file locks whose users and managers are in the same system, file locks in a network file system usually involves multiple clients and file servers on different machines. A file server needs to keep more information about a lock since now this lock can be held by different processes in different machines. Usually some shorthand references are provided to make the exchange of this information more efficient. As all kinds of failures can occur in the network, special mechanisms are provided to tell which locks are still valid. In this chapter, we will discuss the file locking mechanisms used in the Linux local file system (ext2fs), NFS Version 4 and DAFS. 3.1 Linux File Locking File locking in a local file system is implemented by system calls. In traditional Unix systems, there are two system calls related to file locking: flock() and fcntl(); and thus, two types of file locks created by these system calls: BSD locks and POSLX locks, respectively. BSD locks are only used to lock a whole file, while POSLX locks can be used to lock any region of a file. Both locks can be advisory or mandatory [10]. 23 Linux supports both BSD and POSLX locks. These two types of locks can coexist safely; neither one has any effect on the other. In the Linux kernel, the file locking data structures include the following fields: • Fl_next: Each inode in Linux has a lock list that maintains all the locks placed on this file. Fl_next is used for this single linked list. • Fl_nextlink, FLprevlink: Linux also puts all locks in a doubly linked list. FLnexlink and FLprevlink are used to maintain this list. • FLnextblock, FLprevblock: When a process tries to get a lock held by another process, it will be put into a circular list of blocked processes. FLnextblock and FLprevblock are used to maintain this list. • FI_owner is used to identify a lock owner. • FLpid is the ID of the process that holds this lock. • FI_fNe is used to identify the file in the Linux kernel that holds this lock. • FLflags is set as FL_POSIX or FL_LOCK. • Fl_type is chosen from one of three values: F_RDLCK, F_WRLCK and FJJNLCK. These three values stand for applying for a read lock, applying for a write lock and releasing a lock, respectively. • FLstart and Fl_end are used to specify the data range of this lock. A BSD lock can be regarded as a special POSLX lock whose range covers the whole file. As both NFS locking and DAFS locking require locking a range of a file, the following discussion will refer only to POSIX locks. For example, a set of POSLX locks on a file in the Linux kernel is shown as in Figure 3-1 [11]. 24 i flock / i n o d e * ' / J / / j*— f l _p rev l i nk I The doubly linked l i s t of a l l locks i n the Linux kernel / / I 1 • ' I ( f l next f l nextlink r f l nextblock fl_prevblock rang=l.. 3 pid = 1 f l next*"*. ) fl_prevlink ^ f l nextlink f l nextblock fl_prevblock rang=7.. 12 pid = 2 \ I f l next 4 f l nextlink fl_prevlink f l nextblock fl_prevblock rang=2.. 4 pid = 2 Figure 3-1 POSIX locks in the Linux Kernel An implementation of a POSIX locking mechanism should be able to detect potential deadlocks. For example, in Figure 3-1, we have two processes holding locks on the same file. Process 1 has gotten a lock on range 1 to 3, while process 2 has a lock on range 7 to 12. Process 2 is also trying to get a lock on range 2 to 4, but because of the active lock held by process 1, its request is blocked and it has to go to sleep until that active lock is clear. If, at this time, process 1 wants to get a lock on range 7 to 9 and this request is accepted, a deadlock appears as both process 1 and 2 are waiting for each other. A locking mechanism should detect this potential deadlock and reject the latter request made by process 1. 25 When a process' lock request is accepted, after the deadlock detection process, the locking mechanism already knows if the requested lock is blocked by an active lock being held by other processes or not. If it is not, the locking mechanism will check whether this requested lock overlaps with any existing locks held by the process itself. File locks in the Linux kernel are either read or write locks. If a new lock is of the same type as an existing lock, the handling of the overlap is quite simple; these two locks are combined into one. If they are of the different types, however, things are a bit complicated. To clarify this process, the five possible overlap cases are shown in Figure 3-2. Existing lock New lock Become ///A A mm. V//////// m Figure 3-2 Lock overlapping for new lock request Overlapping also happens when a process makes an unlock request, which is shown in Figure 3-3. Since an unlock request is made to release an existing lock, the lock type specified in the request should be the same as that of the existing lock, otherwise the unlock request will be rejected by the locking mechanism. 26 Existing lock Unlock Become Figure 3-3 Lock overlapping for unlock request We have the following five lock overlapping cases (see both Figure 3-2 and Figure 3-3): 1. The lock range in a new request is the same as an existing lock. For a Lock request, the new lock replaces the existing lock. For an Unlock request, the existing lock is released. 2. The lock range in a new request is a subset of an existing lock. For a Lock request, the existing lock is broken into three locks, while the overlapping piece is replaced with the new lock. For an Unlock request, the existing lock is broken into two locks since the overlapping piece is released. 3. The lock range in a new request is a superset of an existing lock. This case is the same as the first case. 4. The lock range in a new request extends past the end of an existing lock. For a Lock request, the existing lock is shortened while the new lock is added. For an Unlock request, the existing lock is shortened. 5. The lock range in a new request extends into the beginning of an existing lock. This case is the same as the fourth case. 27 3.2 NFS locking File locking is not specified in NFS version 2 or 3 protocols, but in two ancillary protocols related to NFS - Network Lock Manager (NLM) and Network Status Monitor (NSM) protocols. In previous version of NFS, an NFS server is supposed to be a stateless server that keeps no information about the status of the clients. Supporting file locking implies that an NFS server has to be a stateful server, so there is no file locking support specified in NFS version 2 and 3 protocols. However, file locking is essential in most environments, so N L M and N S M protocols were proposed later to provide additional file locking service in NFS. The protocol design is improved in NFS version 4. NFS version 4 assumes that an NFS server should keep status information of the clients, and it specifies file-locking support in its protocol, so there is no other protocol like N L M or N S M required in NFS Version 4. 3.2.1 Clientid & Stateid A lock holder is identified by a process ID in a local file system. However, in a network file system, things are different since now lock holders can be processes running on different NFS clients. To identify an NFS client, NFS version 4 uses the NFS_Client_ID data structure that includes the following fields: • Verifier is used to detect client reboots. • Id is used to uniquely define a client. This may be an IP address for example. Before requesting a lock, the NFS client needs to identify itself to an NFS server using NFS_Client_ID described above. If this NFS_Client_ID is accepted, the server returns a unique 64-bit clientid which is a shorthand reference to the NFS_Client_ID. This Clientid is opaque to the NFS client, and from now on, the NFS client will use the clientid as its identification in subsequent file operations. 28 NFS uses the n fs jockowner data structure to identify a lock owner, which has fields including clientid and others like process ID for example. When an NFS client wants to open a file, it specifies a lock owner in the request using nfs_lockowner. If the operation succeeds, the server returns a 64-bit stateid data structure as a shorthand reference to the state information of this lock owner. Stateid is opaque to an NFS client, too. An NFS server, however, depends on the Stateid to identify different lock owners. Stateid can be formed in any way as long as a server can tell invalid and out-of-date lock owners. For example, the stateid can have the following fields: • A server verifier used to identify a particular server instantiation. • An index used by a server as an entry to its locking-state table. • A sequence value can be used to identify lock owners who are in the same table index. Getting stateid means that the NFS client has states built at the server. For the subsequent operations like R E A D , WRITE, and L O C K , the client will include this state information in its requests. Combining stateid with other information, a server can locate the right lock in its locking-state table. For those NFS clients that haven't built any state but want to do read or write operations, NFS version 4 provides two special stateids: the stateid of all bits 0 and the stateid of all bits 1. A server will serve a read or write request with the stateid of all bits 0 only if there are no conflicting locks. This lock checking procedure can be bypassed for a read request that has a stateid of all bits 1, but not for a write request with a Stateid of all bits 1. 3.2.2 Lease In a network file system, a file lock held by a client will become stale if that client has crashed or it is unreachable. This lock should be detected and released by the server since it prevents other clients accessing the same file. To do this, NFS version 4 provides 29 a lease for each lock. The idea behind this is quite simple: once a client is granted a lock, it should tell the server in a certain period that it is still holding the lock - this is called lease renewal in NFS version 4, otherwise when the lease time expires, the server will assume that the client is dead and that all its locks are stale. The implementation of lease renewal is important since it will affect the performance of a system. If each client sends a request to renew a lease in every lease period, the traffic in the network will increase significantly. Since a server only needs to know that the client is still alive in lease renewal, in order to reduce the overhead as much as possible, NFS version 4 considers that the following as lease renewal operations implicitly: • An OPEN operation with a valid clientid. • Any following operation with a valid Stateid (not including special Stateids): CLOSE, D E L E G R E T U R N , L O C K , L O C K U , OPEN, OPENN_CONFERM, R E A D , RENEW, SETATTR, WRITE. This approach makes lease renewal scale well. In a typical case the client doesn't need any extra RPC call for lease renewal. However, in the worst case, one R E N E W request is required every lease period. In NFS Version 4, a lease renewal applies to all locks that a client currently holds, so a client only needs to do one lease renewal in a certain time. 3.2.3 Failure Recovery The failures in a network file system include client crashing, server crashing, and network partitions. Each of these failures will cause the states on the server to become stale. To recover from a failure, it is important to know when the failure occurred and how to deal with it. Here we will discuss the recovery process for each of these kinds of failure. 30 As we know, a lease is used in NFS version 4 to detect a client crash. When any of a client's leases expire, the server will assume that client is dead. A client should use a new verifier as a client ID when it restarts. As a result, it will get a new clientid from the server. If a client restarts in a lease period and ask for a new clientid, the server also regards it as a new client. The failure recovery for a client crash is quite simple - the server releases all locking states related to that client either right away or when a conflicting lock request is received. A sever crash usually means losing locking states in memory, so to do the recovery, the server should keep those states in some kind of stable storage media. Once a server restarts, the clients will be given some time known as the "grace period" to reclaim locks they previously held. During the grace period, the server rejects all R E A D and WRITE operations and non-reclaim locking requests that conflict with an existing lock until that lock is reclaimed. The clients who send such requests during the grace period will get an N F S 4 E R R _ G R A C E error. If locks are not reclaimed within the grace period, the server will assume that the clients who hold these locks are dead and will release these locks. Once a client's locking states are released, it cannot use its previous clientid and Stateid in the subsequent requests, otherwise the server will return NFS4ERR_STALE_CLIENTID and NFS4ERR_STALE_STATEID, respectively. If a network partition occurs and persists beyond the lease period, a server cannot receive the lease renewal from the clients in the lease period and will thus release all locking state. Once the network is back, the R E A D , WRITE and other state related requests sent from a client will be rejected since its clientid and stateid are no longer valid, and the client will get the NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEED errors for those requests. 3.3 DAFS Locking 31 DAPS locking provides two new locks compared to NFS locking: PERSIST locks and A U T O R E C O V E R Y locks. PERSIST locks survive failures like client crashing, server crashing and network partitions, while A U T O R E C O V E R Y locks have rollback function. These two locks are provided to improve failure recovery in DAFS. 3.3.1 PERSIST Locks Unlike locks that are released by a server if a certain failure event happens, a PERSIST lock will be kept by the server until they are reclaimed. To get a PERSIST lock, a client will specify the PERSIST lock type in the L O C K request. If a client crashes after receiving a PERSIST lock and does not come up within the lease period, that PERSIST lock will become broken. At this time, the server has released all locking states for this client except the PERSIST locks it holds. Another client or a new instance of this client will reclaim those broken PERSIST locks and hold them. To reclaim a broken PERSIST lock, a client will send a L O C K request with the R E C L A I M option set. When a PERSIST lock is broken, it acts like a normal lock to the conflicting read and write operation requests. Those operations will get errors like D A F S _ E R R _ L O C K E D or DAFS_ERR_STALE_STATEID. If a client tries to acquire a conflicting lock or get information about this lock, it will get the D A F S _ E R R _ B R O K E N error instead. 3.3.2 AUTORECOVERY Locks A U T O R E C O V E R Y locks provide a limited UNDO or rollback recovery service. To get an A U T O R E C O V E R Y lock, a client should set the A U T O R E C O V E R Y option in the L O C K request. 32 If no failure event ever happens, A U T O R E C O V E R Y locks are just the same as the usual locks. If there is any failure event, however, a DAFS sever guarantees that all modifications made to a file under the protection of A U T O R E C O V E R Y locks are undone before the locks are released. A DAFS client can forcibly rollback an A U T O R E C O V E R Y lock by setting the ABORT option in a lock request. The recovery service is limited since it is only provided for the A U T O R E C O V E R Y locks that exist when a failure occurs; for the completed A U T O R E C O V E R Y locks before the failure event and other locks, the server provides no recovery service related to them. If a client sets both the PERSIST and A U T O R E C O V E R Y options in a L O C K request, the rollback associated with this lock is delayed once the failure event occurs. When the client restarts, it can reclaim the lock and continue its operation or do rollback forcibly. 33 Chapter 4 Design & Implementation The objective of this thesis work is to develop a DAFS server that provides basic file operation services and file locking services. Our design is to build a DAFS layer on top of the VFS layer. This DAFS layer will have all functions necessary for providing DAFS service, so we don't need to change the VFS layer. This layered design is clean and makes our implementation and testing relatively easy. There are quite a few interesting issues raised by the development of the DAFS server, like how to manage the memory resources allocated and registered to the VI connection; the relationship between DAFS locking and POSIX locking in the Linux kernel; and different designs for A U T O R E C O V E R Y locks. The following discussion will cover all these issues. 4.1 User level vs. Kernel level A DAFS server can be implemented in user space or in kernel space. These two implementations are different in various aspects. Generally speaking, it's easier to test and debug a user level application. However, since user level applications are built upon API, sometimes it is difficult to implement certain functions in a user level application. For example, it is difficult to create a workable file handle in a user level DAFS server. If a file handle includes a device number and an inode number, the user level DAFS server has to convert its information to the file name. We can do this by adding a new system call, but that means we need to change the Linux kernel as well. A kernel DAFS server has the right to get any information it wants in the kernel, but this also means it is easy to crash a machine if there is anything wrong in this server. 34 Since every system call made by a user application involves a context switch, a user level DAFS server has extra overhead compared to its counterpart in kernel space. Usually, there is also an extra memory copy from user space to kernel space in a user level server like an NFS server. But this memory copy is eliminated in a DAFS server thanks to the direct application access capability of the VI technology. So the performance of a kernel DAFS server is supposed to be a bit better than that of a user level DAFS server, but the difference is not obvious. We first built a user level DAFS server that implements some basic file operation requests, then before implementing advanced file operations, we ported the DAFS server to the Linux kernel. Figure 4-1 shows the components in a user level DAFS client and a kernel lever DAFS server. Note that the DAFS sever is built on c L A N Kernel VI Provider Library (KVIPL), which talks to c L A N device driver directly. 35 DAFS Client Application DAFS Client Library cLAN VIPL 1 > cLAN Device Driver cLAN Host Adapter i User Space Linux Kernel DAFS Server DAFS Server cLAN KVIPL cLAN Device Driver cLAN Host Adapter ; 4 Figure 4-1 DAFS client & DAFS server 4.2 KVIPL While the c L A N Virtual Interface Provider Library (VIPL) is provided for user level applications, the K V I P L is provided for kernel processes. As we know, a device is regarded as a special file in Linux. A device file usually provides an ioctl (short for input output control) function through which user level applications can act on the device. Depending on the type of the operations that an application wants to perform on the device, ioctl calls different functions in the device driver. The main difference between the VIPL and K V I P L is that VIPL calls ioctl to interact with the device while K V I P L calls those functions in device driver directly. 36 Emulex provides a complete VIPL for its c L A N host adaptor, however, the K V I P L used for its L A N emulation driver is not complete. Quite a few functions related to the connection establishment, completion queue management and data transfer are missing. These functions are essential for developing a kernel DAFS server. We developed the following functions in KVIPL: • KvipCreatePtag - create a tag for memory protection. • KvipDestroyPtag - destroy a memory tag. • KvipConnectWait - wait for a connection request coming from another VI component. • KvipConnectAccept -accept a connection request coming from another VI component. • KvipConnectReject - reject a connection request coming from another VI component. • KvipDisconnect - release a VI connection built earlier. • KvipQueryVi - detect the status of a VI descriptor. • KvipCQWait - detect if there is any completed VI descriptors related to a completion queue. • KvipSendWait - post a send VI descriptor in a VI and returns when the data transfer is finished. • KvipRecvWait - post a receive VI descriptor in a VI and returns when the data transfer is finished. • KviplnitSignals - initiate a c L A N fast signal. • KvipSendFastSignal - send a fast signal over a VI connection. • KvipWaitForSignal - wait for fast signals coming over a VI connection. KvipCQWait , KvipSendWait and KvipRecvWait call the gni_wait function in the c L A N driver, which is used to wait for completion on a VI or completion queue. In short, the logic in gni_wait function is like this: First the process checks the status of a VI descriptor or a completion queue, if the status is "not complete", the schedule() function in the Linux kernel is called so the process can go to sleep. When this process 37 runs again, it will repeat the previous step until the status is "complete" or it times out. KvipWaitForSignal calls gni_sigwait function that implements a similar function for fast signals in c L A N . The gni_sigwait function, however, uses a different approach. The process calls the interruptible_sleep_on function to go to sleep, if an interrupt is received, it will be woken up when the wake_up_interruptible function is called in interrupt handling routine. The problem is that it is possible that the wake_up_interrupt ible function be called before the process goes to sleep, and that process will never be woken up again. To applications built on K V J P L or VIPL, fast signals just get lost in some cases. This bug in c L A N driver is fixed if the gni_sigwait function calls the schedule function instead of the interruptible_sleep_on function to go to sleep, just as the gni_wait function does. 4.3 VI Management VI is expensive since it has to register the memory buffers used by the VI descriptors. What this registration process actually does is to lock the allocated memory pages so they will not be swapped out by the Linux kernel. This registration process, though, is time consuming since it requires the c L A N driver to maintain a table for memory mapping, so usually it is not wise to allocate and register these memory buffers dynamically. In our implementation, before a server accepts a VI connection, it pre-allocates and registers all memory buffers needed for this VI connection. Later data transfer will go through these pre-registered memory buffers. This design implies that a DAFS server should detect dead VI connections from time to time so the memory resources allocated to them can be reused. In our implementation, whenever a DAFS receives a VI connection request, it will test the current existing VI connections. If there is any dead VI connection, the memory resources allocated to them can be released and used for the new connection. If all VI connections are alive and there are available memory resources, the new VI connection is accepted after the necessary memory buffers are allocated and registered. Otherwise, 3 8 the VI connection is rejected. As the KvipQueryVi function that is used to test the status of a VI connection only costs a few C P U cycles, and as the number of VI connections that a DAFS server has is usually not big because of the local file sharing environment, this design proves to be efficient. In our implementation, all VI receive descriptors posted on the work queues are linked to a completion queue and there is a thread in the DAFS server polling the state of the completion queue. Whenever one of these VI receive descriptors is finished, a notification status will be placed in the completion queue. The checking thread will detect this and remove the completed VI receive descriptor from its work queue. The Figure 4-2 shows the relationship between work queues and completion queue. Figure 4-2 Work Queue & Completion Queue 4.4 DAFS & VFS 3 9 Like NFS, DAFS is built on top of VFS. To perform file operations like R E A D and WRITE, DAFS calls the functions in the file_operations data structure provided by VFS. To do the directory operations like LOOKUP, DAFS will use the inode cache and dentry cache provided by VFS. VFS, eventually, will call the local file system to do the actual file operations. The Figure 4-3 shows the relationship between the DAFS, VFS, and local file system like ext2 file system. DAFS Server inode cache dentry cache VFS Local File System (ext2, etc.) Figure 4-3 DAFS & VFS & Local File System As in NFS, using the VFS interface in DAFS means an extra memory copy from the page cache to the VI memory in the DAFS server for R E A D operations. This memory copy is not avoidable in current implementation for two reasons. First, VI requires that the memory it uses be registered. Avoiding a memory copy means that the memory used for page cache has to be pre-registered, which is not realistic because of the huge volume of the memory involved. Second, even if all memory is pre-registered and thus usable, VI requires that the memory used for data transfer be address contiguous, which is not true for the page cache. Similarly, for WRITE operations, this memory copy is still unavoidable. 40 In a user level DAFS server, the VI memory is in the user space, so now the memory copy is performed between user space and kernel space (page cache). In all other aspects, the process is just the same. 4.5 DAFS Operations The following file operations are implemented in the current DAFS server. • OPEN - Open a file. • CLOSE - Close a file. • GET_ROOT_HANDLE - Get the root file handle of file system that a DAFS sever exports. • L O O K U P - Looks up a file object given its name. • LOOKUPP - Looks up parent directory. • GET_FSATTR - Gets the attributes of the file system where a file handle resides. • GET_ATTR_INLINE - Get the attributes of an object. Contents of the attributes are returned in line. • GET_ATTR_DIRECT - Get the attributes of an object. Contents of the attributes are written into client memory buffers directly using R D M A write. • READDIR_INLINE - Read the contents of a directory. Contents of the directory are returned in line. • READDIR_DIRECT - Read the contents of a directory. Contents of the directory are written into client memory buffers directly using R D M A write. • READ_INLINE - Read the data from a file. The data transfer is done in line. • RE|AD_DIRECT - Read the data from a file. The data is written into client memory buffers directly using R D M A write. • WRITE_INLINE - Write the data to a file. The data transfer is done in line. • L O C K - Create a lock in a file. • L O C K T - Test a file lock. • L O C K U - Remove an existing lock in a file. 41 And the following operations related to session management are also implemented. • C O N N E C T _ A U T H - Create a DAFS session and authenticate client. • CONNECT_BnvTD - Binds a new VI connection to an existing DAFS connection. • DISCONNECT - Terminates a DAFS session. • REGISTER_CRED - Registers the credentials that a DAFS client will use in the following operations. In a DAFS server, the processing of each request from DAFS clients proceeds as follows: Whenever the completion queue gets a completion notification, the server searches the VI connection list to find which connection has the completed VI descriptor. The server then removes this completed VI descriptor from the corresponding work queue. Using the completed VI descriptor, the DAFS server gets the data buffer which holds the received request message. After parsing the request message, the DAFS server knows which operation is requested and calls the right function to handle this request, then the response message is sent back after the function call is completed. 4.6 DAFS locking & POSIX locking As we mentioned in previous chapter, there are two new types of locks in DAFS: PERSIST locks and A U T O R E C O V E R Y locks. For other types of locks, we can use POSLX locking in the Linux kernel to implement locking operations. These two new locks require some special mechanisms in DAFS. We build these special mechanisms on top of POSLX locking as Figure 4-4 shows: 42 DAFS Server DAFS LOCKING POSIX LOCKING VFS Figure 4-4 DAFS Locking & POSIX Locking In this design, a DAFS server handles basic lock requests by using the POSIX locking mechanism in the Linux kernel. For those special lock requests, some pre-handling is required before they are passed to the POSLX locking mechanism. An alternative approach is to integrate DAFS locking into the POSLX locking layer, however, as the overall design is aimed to build a DAFS layer separate from VFS layer, we chose the first design in our implementation. The file_lock data structure in the Linux kernel has already included enough information for a basic lock in DAFS. To implement DAFS locking in a DAFS server, another data structure dafs_palock is used which has the following fields: • Type: this attribute is used to specify if this lock is a PERSIST lock, A U T O R E C O V E R Y lock, or PERSIST and A U T O R E C O V E R Y lock. • Lock_type: this is used to specify if this lock is a R E A D lock or WRITE lock. • Vfs_file: this is copied from the VFS layer. 43 • Dev, ino, generation: these three attributes are used to identify an inode in the Linux kernel. • Start, end: these are used to specify the lock range. • Wbuffers: this is used for A U T O R E C O V E R Y locks. • Clientidjen, clientid: used for PERSIST locks. • The final two fields are defined as two double linked lists. While a basic lock in a DAFS server owns a file_lock data structure in the Linux kernel, an A U T O R E C O V E R Y lock or a PERSIST lock owns a dafs_palock data structure in DAFS in addition to a file_lock data structure in Linux kernel. We let these two special locks hold POSLX locking status in the Linux kernel so they are visual to other applications running on the server. This also has an additional effect: it makes it easier to find the conflicting locks in the Linux kernel against a new lock request. When the DAFS server gets a lock request from a DAFS client, it will call the posix_test_lock function to test if this lock request conflicts with an existing POSLX lock in Linux kernel. If there is no conflict, the request is accepted and a new lock is created for this DAFS client; otherwise the request is rejected. Since now each lock in the DAFS has a POSLX lock in Linux kernel, this checking process still works regardless whether the request is for a special lock or for a basic lock. 4.7 Locking States The locking states are used by a DAFS server to manage all locks held by the DAFS clients. To "remember" the locking states of different DAFS clients, a DAFS server needs to keep a lot of information. The most important information includes client information like clientid used to identify different clients, lock owner information used to identify different lock owners within a DAFS client, and information about files that a DAFS client has opened. 44 To understand how a DAFS server maintains the locking status of DAFS clients, it is necessary to have a brief description of some important data structures used in our implementation. • dafs_client: This data structure is used to identify a DAFS client. • d a f s j o c k o w n e r : This is used to identify a lock owner. A DAFS client can have more than one lock owner, so each dafs_client includes a linked list of d a f s j o c k o w n e r . • dafs_fi le: this is used to identify a file in a DAFS server. • dafs_openfi le: If a DAFS client open a file in a DAFS server, this data structure is added to a linked list of dafs_fi le. More than one DAFS client may be connected to a DAFS server at the same time, and each DAFS client may have more than one lock owner, and open a bunch of files. To manage all this state information efficiently, in our implementation of the DAFS server, hash tables and linked lists are widely used to facilitate the searching, adding and deleting data structures related to state information. The relationships between those data structures we discussed before are shown in Figure 4-5. 45 Figure 4-5 Data Structures used in a DAFS server A new instance of dafs_client is created when a client is connected to a server, while a new instance of several other data structures (dafs_fi le, dafs_openfi le, and d a f s j o c k o w n e r ) is created by the OPEN operation. As we can see in the figure, the data structure dafs_openfi le is essential as it has pointers to the other data structures, and stateid field is also include in this data structure. As discussed above, each lock related operation provides the value of stateid. When a DAFS server gets the request, it can locate the dafs_openfi le structure in the hash table by checking the value of stateid. Then, using a pointer to inode or to fs_palock data structure, the DAFS sever can get the locks that a client is currently holding. 46 4.8 PERSIST Locks In a DAFS, a PERSIST lock will survive failure events in a network file system. Usually, if a DAFS server regards a client as a dead client either because that client does crash or because a network partition event happens, it will release all locks held by that client. However, if a DAFS client had been holding some PERSIST locks before it dies, the DAFS server still needs to keep those broken PERSIST locks. The DAFS client can reclaim these locks when it restarts later; other clients can also reclaim the PERSIST locks left by a dead client as long as they provided the correct authentication information. The data structure dafs_palock includes all information that a DAFS client needs to do the reclaim operation for a broken PERSIST lock. The combination of five attributes: dev, ino, generat ion, start, and end can uniquely identify a lock. Other attributes like clientid and cl ientid_len can be used to do the authentication if it is necessary. When the DAFS server releases a client's locks, it will move the dafs_palock data structures representing that client's PERSIST locks into a hash table. The hash value is generated from those five attributes: dev, ino, generat ion, start, and end. When a client wants to reclaim these broken PERSIST locks, it provides the same attribute values. By using the hash table, the DAFS server can find the PERSIST lock that the DAFS client wants to reclaim quickly. Another thing needs be considered when a DAFS sever releases the locking states. If a DAFS grants a PERSIST lock to a client, a POSIX lock is implicitly created. The owner of this POSIX lock is that client. If the client dies and that PERSIST lock becomes broken, the DAFS server has to find a living lock owner for the POSIX lock. In our implementation, the DAFS server has a default lock owner for the POSLX lock. This solves another issue: if another client wants to get a lock that conflicts with a broken PERSIST lock, a server can detect this by checking the lock owner of the POSIX lock related to the PERSIST lock. If the lock owner is the default lock owner, the DAFS sever will return D A F S E R R _ L O C K _ B R O K E N to the client. 47 4.9 AUTORECOVERY Locks To implement A U T O R E C O V E R Y locks in a DAFS server, we have two designs as shown in Figure 4-6. Figure 4-6 Two designs for AUTORECOVERY Locks As an example, it is assumed in Figure 4-6 that the whole file is under the protection of an A U T O R E C O V E R Y lock. In the first design, if a DAFS server receives two WRITE operation requests from a DAFS client to modify parts 1 and 2 of the file, it will not write the data to the file immediately, instead it will keep the data in a linked list So the file is untouched. When the server receives a R E A D operation request later, 48 besides reading the data from the file, it also checks the linked list which reflects the recent modifications to this file. Only when the client releases the A U T O R E C O V E R Y lock will the server commit modifications to the file. If the lease of the A U T O R E C O V E R Y lock expires, the server simply releases the linked list, and the file remains untouched. In the second design, a DAFS server commits the modification immediately if it receives a WRITE operation request from a client. However, at the same time, the server keeps the original data (part 1 and 2) in a linked list. For the following R E A D operation requests, the server only needs to read the file since the modifications have been committed. When the client releases the A U T O R E C O V E R Y lock, the server discards all the data in this list since all modifications have been committed. If the lease of this A U T O R E C O V E R Y lock expires, however, the server rewrites all the data in the linked list back to the file to undo the modifications made by previous WRITE operation requests. A l l file operations go through buffer cache, so a R E A D or WRITE operation does not necessarily mean an actual I/O. The first design implies extra C P U cycles and a memory copy for each R E A D operation request, so it is suitable for a write-often server. The second design, however, implies an extra R E A D operation for every WRITE operation request, so it is ideal for a read-often server. The first design, however, has the potential drawback that the modifications are not visible to other applications running on the server until the A U T O R E C O V E R Y lock is released. We choose the second design in our implementation. The dafs_wbuffer data structure is used in our implementation to hold the original data read from the file. The following fields are in this data structure. • Offset specifies where the data starts in a file. • Count is the byte number of the data. • Buffer points to a buffer that stores the data. • The final field is to define a linked list of dafs wbuffer. 49 Figure 4 - 7 shows dafs_wbuffer lists attached to A U T O R E C O V E R Y locks in a DAFS server. pa Jock Figure 4-7 Buffer Lists in a DAFS server When a DAFS server receives a WRITE operation request that is under protection of an A U T O R E C O V E R Y lock, a new dafs_wbuffer data structure is added to the head of the linked list. When the server does the rollback recovery, it will begin the rewrite process from the head of the linked list. Once the data held in dafs_wbuffer data structure is written to the file, that data structure is released. When a DAFS client releases an A U T O R E C O V E R Y lock, it means that all write operations performed before are valid, and all dafs_wbuffer data structures belonging to this lock are simply released. 5 0 Chapter 5 Performance This thesis considers three performance tests: VI performance tests, DAFS file operation performance tests, and the tests of file locking overhead to a DAFS server. These tests are performed in two Linux machines under a client/server architecture. Each machine include 1GHz Pentium C P U with 512MB E C C RDF memory and one c L A N 1000 host adapter. 5.1 VI Performance The two most important performance features of a c L A N network are throughput and latency. Viptest can be used to test them. Viptest is a utility program whose source code comes with the c L A N driver source code and is commonly used to do the tests of VI performance. The idea behind the viptest is quite simple: a requester sends a message to a responder using VI interface, and the responder will send an echo message back. Viptest counts the elapsed time between sending out messages and receiving echo messages, which is the latency of a c L A N network. Knowing the size of the messages, Viptest can also get cLAN's throughput. To do the comparison, we also tested the throughput and latency in a Gigbit Ethernet network. We used NetTraf to perform the test. NetTraf tests a Gigabit Ethernet network in a manner comparable to what Viptest does on a c L A N network. Figure 5-1 and Figure 5-2 shows the results. 51 - • - c U N -i-BhEmet C O C V J - ^ - O O C O C M - ^ - C O O O C J O i L O - i — C \ J - ^ " C D C D O O I ~ - ~ C O U O i — c v i m o o o - i — c o m i ^ T - m -r- CM co -3- co Relet Sze(bytes) Figure 5-1 cLAN vs. Gigabit Ethernet: Throughput Figure 5-1 shows that when the message size is larger than l k bytes, the throughput of a c L A N network remains stable at 112 MB/sec, which is close to cLAN's one-way bandwidth (1.25Gb/s). The throughput of a Gigabit Ethernet remains about 68 MB/sec once the packet size is larger than 32 Kbytes. 52 -•-cLAN • Bhemet 2000 CO CM CO CO CM CD CO CM to CM •<* Oi 00 CO i n CM IO O o O T — CO IO r*. i n CM 00 CO CM CM CO o> i n CO Packet Sze (bytes) Figure 5-2 cLAN vs. Gigabit Ethernet: Latency The latency results shown in Figure 5-2 are consistent with what Figure 5-1 shows. We observed that with a small packet size, the latency in a Gigabit Ethernet network is 5 times larger than that in a c L A N network. This confirms our expectation that the effect of TCP/IP overhead is significant in small data transfers. 5.2 DAFS Performance The DAFS performance tests focuses on the read and write file operations. Both the throughput and the C P U load of the DAFS client are measured for each operation. Our tests are sequential I/O tests, which means that the DAFS client will send out a request only after the response of its previous request is received. To reduce the effects on the test results coming from other factors, all read and write operations are performed in 53 memory so there is no disk I/O operation involved. The following DAFS operations are tested for performance: R E A D J N L I N E , READJDIRECT and W R I T E J N L I N E . For each of these operations, different packet sizes varying from 4k bytes to 32k bytes are used in the tests. And for each test, 10M bytes are sent over the network. Figure 5-3 shows the throughput of these DAFS operations, and Figure 5-4 shows the C P U load in a DAFS client. • Readlnline iReadDirect •Witelnline 4096 8192 16384 Packet Size (bytes) Figure 5-3 DAFS Throughput 54 • Readlnline • FteadDirect • Writelnline GOT 4096 8192 16334 32768 Racket Sze (bytes) Figure 5-4 DAFS CPU load As can be seen from Figure 5-3, the throughput increases as the packet size increases. Before sending out the next request, a DAFS client has to wait for the reply message of the previous request it sent. This is the round trip delay in a DAFS system. Since a larger packet size means fewer round trip messages in a c L A N network, the throughput is therefore increased. This also applies to Figure 5-4, which shows the C P U load decreases as the packet size increases. So in order to get better performance, it is necessary to adopt a large packet size for VI communication. However, large packet size means bigger memory buffer capacity that should be pre-registered for VI. Eventually this is a trade off between the performance and the available resources. 55 Both figures show that the R D M A write capability of VI makes the best performance of the READ_DIRECT operation in DAFS. When the packet size is 32Kbytes, the C P U load of a DAFS client for READ_DIRECT is only one third of that for R E A D INLINE. The CITI group in University of Michigan has implemented NFS Version 4 on Linux platform, and Fujitsu PST has tested the sequential I/O performance of this NFS Version 4 reference implementation on a c L A N network [3]. A comparison of those test results and the ones we obtained is offered. Figure 5-5 shows that using the same hardware and same platform, the read throughput in NFS Version 4 is about one-sixth of that in DAFS, while the write throughput in NFS Version 4 is about one-fifth of that in DAFS. These operations are performed from memory to memory, and there is also no disk I/O involved. The results confirm what was expected: the bypass of the TCP/TP protocol stack and operating system involvement greatly improves the performance of a file system. • Read I Write 80 70 60 | 50 a. 40 O) S 30 IJjtk V t *,| NFSv4 DAFS Figure 5-5 DAFS vs. NFS V4: Throughput 56 Figure 5-6 shows the C P U usage of an NFS Version 4 client and a DAFS client. It took about 150 milliseconds for an NFS Version 4 client to read a lOM-byte file, while it only took 10 milliseconds for a DAFS client. For writing a lOM-byte file in DAFS, the C P U usage is about one-eleventh of that in NFS Version 4. Figure 5-6 DAFS vs. NFS V4: CPU usage We should note that both our tests and NFS V4 tests are sequential I/O tests. We are interested in sequential VO tests because they best reflect the actual performance difference between DAFS and NFS. The sequential I/O test results, however, are not necessary the read and write throughput we get at the client side. As we know, some mechanisms like client caching, read ahead and delayed write are implemented in NFS clients to improve their throughput performance. These mechanisms are not involved in our tests, and we believe the throughput of a DAFS client can also be improved if these mechanisms are implemented in DAFS. 57 5.3 Overhead of DAFS Locks It's learned from the previous discussion the A U T O R E C O V E R Y locks in DAFS are the most expensive locks. Once a DAFS client gets an A U T O R E C O V E R Y lock, its write requests will be under the protection of this lock. This usually means extra memory copies at the server side. To show the impact of the overhead to the file system performance, the throughput of the WRrTE_INLINE operation both with and without protection of A U T O R E C O V E R Y locks is measured. • Without Lock • W i t h Lock . 7 0 60 50 4096 8192 16384 32768 Packet Size (bytes) Figure 5-8 Throughput: With & Without Lock Figure 5-8 shows that the overhead of auto-recovery locks reduces the throughput of WRITEINLINE operation in about 20 percent. Obviously, when the data 58 cache for auto-recovery locks becomes large and there is not much available memory resource in the system, the performance of DAFS supporting A U T O R E C O V E R Y locks will become worse. 59 Chapter 6 Conclusion DAFS is a new file system protocol proposed by the DAFS collaborative. Taking advantage of the new interconnection architectures like VI and Infiniband, DAFS is designed to achieve the best performance in a NAS system. This thesis contributes with the designing and developing a DAFS server in the Linux kernel. Most of the basic file operations in DAFS are implemented, and tests were run for these operations to show that DAFS has much better performance than NFS version 4 has. The file locking mechanism in DAFS was also implemented. The PERSIST locks in DAFS survive client crashing, and the A U T O R E C O V E R Y locks protect data integration in any failure event happens. Our research proves that implementing DAFS on a Linux platform is both feasible and beneficial. It is a challenging work to build a new network file system in the Linux kernel, especially a file system like DAFS exploring VI technologies. Emulex's c L A N product has better performance compared to other VI products in today's market, however, this product is not well supported by Emulex. Lack of a complete K V I P L in c L A N means extra work needs to be carried out to develop a kernel DAFS server, and later research shows that there are still some inaccuracies in the current c L A N driver. Unlike TCP that provides all flow control, failure detection and correction mechanisms, VI requires that applications provide such mechanisms. This makes some complications in building a VI application. More work is still needed for the VI layer in current DAFS implementation in order to make the server more robust. 60 The PERSIST locks in the current implementation don't survive server crashing. If a DAFS server crashes, all locking states it keeps are lost. To make the PERSIST locks survive server crashing, a DAFS server should log locking states to hard disks periodically. Once the DAFS server restarts after a crash, it will rebuild these locking states from the log files in the hard disks, and thus the clients' PERSIST locks survive. DAFS inherits some file system features in NFSv4, like file system migration and replication. Implementation of these features is also a part of the future work. 61 Bibliography [I] http://www.dafscollaborative.org [2] http://www.dafscollaborative.org/events/demos/NetappDAFS.pdf [3] http://www.pst.fuiitsu.com/english/dafsdemo/index.html [4] http://www.cs.duke.edu/ari/dafs/ [5] http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html [6] http://www.citi.umich.edu/proiects/nfsv4/index.html [7] Compaq, Intel, Microsoft. Virtual Interface Architecture Specification Version 1.0 December 16, 1997 [8] Network Appliance. DAFS: Direct Access File System Protocol Version 1.0 September 1, 2001 [9] Spencer Shepler, Brent Callaghan, David Robinson, Robert Thurlow, Carl Beame, Michael Eisler, David Noveck: NFS version 4 Protocol [10] Daniel P. Bovet, Marco Cesati. "Understanding the Linux Kernel". O'Reilly & Associates, Inc. [II] Marshall Kirk Mckusick, Keith Bostic, Michael J. Karels, John S. Quarterman. "The Design and Implementation of the 4.4 BSD Operating System". Addison-Wesley Inc. 62 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items