UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improving disk read performance through block-level replication into free space Lifchits, Andrei 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2008_spring_lifchits_andrei.pdf [ 1.99MB ]
Metadata
JSON: 24-1.0051359.json
JSON-LD: 24-1.0051359-ld.json
RDF/XML (Pretty): 24-1.0051359-rdf.xml
RDF/JSON: 24-1.0051359-rdf.json
Turtle: 24-1.0051359-turtle.txt
N-Triples: 24-1.0051359-rdf-ntriples.txt
Original Record: 24-1.0051359-source.json
Full Text
24-1.0051359-fulltext.txt
Citation
24-1.0051359.ris

Full Text

Improving Disk Read Performance through Block-Level Replication into Free Space by Andrei Lifchits B.Math, The University of Waterloo, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science)  The University of British Columbia January, 2008 © Andrei Lifchits 2008  Abstract Disk performance for random access fares significantly worse compared to sequential access . Time required to transfer random blocks to or from disk is dominated by seeking and rotational delay . To improve the throughput and reduce the latency, one can apply techniques to increase the sequentiality of disk accesses, such as block rearrangement and replication. We introduce an approach to improve read performance by replicating blocks into file system free space at the block level . This makes the replication module independent of the file system and therefore easier to implement and verify . A solution that requires no changes to the file system is also easier to adopt . Supporting a new file system is a matter of writing a user-space component that understands its free block data structures . We implemented a prototype as a stacked device driver for Linux and evaluated its performance on a number of workloads .  ii  Table of Contents Abstract  ii  Table of Contents  iii  List of Tables  v  List of Figures  vi  1 Introduction  1  2  Related Work  5  2 .1 File System Layout  5  2 .2 Adaptive Block Rearrangement  6  2 .3 Replication  7  2 .4 Block Address Indirection  9  2 .5 I/O scheduling  9  2 .6 Utilization of Free Space 3 Design  10 11  3 .1 Architecture  12  3 .2 Replication  15  3 .3 Using Replicas  19  3 .4 Replica Reclamation  19 iii  Table of Contents  4  Implementation  26  4 .1 Block Layer  26  4 .1 .1 Major Data Structures  27  4 .1 .2 Request Completion Handler  31  4 .1 .3 Operation  31  4 .1 .4 Conflicting I/O  35  4 .2 Free Space Extractor  37  5 Evaluation  38  5 .1 Test Environment  38  5 .2 Best Case Scenario  38  5 .2 .1 Multi-threaded workload  42  5 .2 .2 Replica reclamation  43  5 .3 Patterns with Noise  44  5 .4 Overhead  45  5 .5 Git   46  5 .6 Summary  48  6 Future Work  50  7 Conclusion  53  Bibliography  55  iv  List of Tables 4 .1 Combinations of new and in-flight requests  35  v    List of Figures 1 .1  Block Group Allocation in ext2fs  3 .1  Example of multiple available replicas  20  3 .2  Example of LRU replica ordering  22  3 .3  Example of reclamation  23  5 .1  Postmark iterations with and without replication  40  5 .2  Postmark iterations on read-only file system  41  5 .3  Multi-process replication performance  42  5 .4  Postmark iterations on full disk  43  5 .5  Noisy pattern-based workload  45  5 .6  Performance of Bonnie++ benchmark (sequential)  46  5 .7  Performance of Bonnie++ benchmark (random)  47  5 .8  Git  48  status  command  2  vi  Chapter 1  Introduction Disk read performance is largely determined by how well the data layout matches actual disk access patterns . Due to the mechanical nature of a hard disk, random accesses are substantially slower than sequential ones; jumping from one location on disk to another introduces a seek delay while the disk head is repositioned to the new track and a rotational delay as the correct sector rotates under the head . These two delays dominate the total service time in random access I/O . Logically, file systems try to minimize non-sequential accesses by allocating physically contiguous blocks for individual files, placing metadata closer to the corresponding file data, and by clustering files from the same directory close together on disk. Despite the file system's efforts, the static layout will not always be optimal . The ext2 file system [19], for example, starts to scatter big files or heavily populated directories across the entire disk with the quadratic hash block allocation function when the original block group fills up, thus breaking sequentiality of files or file blocks . Figure 1 .1 illustrates this behaviour: it shows the number of blocks allocated (non-free) in each block group after we sequentially create 4 GB worth of files in an initially empty 80 GB ext2 partition . We can see that the file system started by placing files in the middle of the disk, but as the middle block groups filled, more and more files were scattered outside the center . Thus, some files that were created  1    Chapter 1 . Introduction  200  300  block group number  Figure El : Block group allocation in ext2fs containing one directory filled with 572,000 files distributed across 100 subdirectories and totalling 4 GB. The 80 GB file system has 596 block groups (32,768 blocks each). together ended up very far away from each other . Shared libraries pose another challenge to data layout . It may be difficult to cluster all applications together with all shared libraries they use even initially, let alone incrementally over a long period of time . Yet, having application files far from library files will result in continuous seeking back and forth between multiple locations on disk [11] . Finally, copy-on-write systems such as Parallax [28] cannot by their nature ensure sequentiality of all blocks for all files at the same time, introducing seeks even for sequential file reads. One solution to the problem of poor layout is to rearrange blocks on disk and move related blocks together . Disk defragmentation is one example of this approach . As mentioned above, file systems try to arrange logically consecutive file blocks sequentially on disk for the same reason . However, if 2  Chapter 1 . Introduction a set of blocks is involved in multiple different access patterns, rearrangement will not work. Other problems require adaptive block rewriting to improve access locality. For example, as copies are being modified in copy-on-write file systems, the files become more and more fragmented . The sequentiality of evolving copy-on-write systems can only be restored by dynamically replicating related blocks close together . Adaptive rewriting is also required if read patterns change or evolve with time. One promising technique in optimizing the layout is to adaptively rewrite blocks on the read path when reads are detected to be inefficient . The new blocks can either replace or replicate the existing blocks . Adaptive rewriting is more effective than static rewriting when read patterns are unknown ahead of time or can change . Replicating the same blocks in a different order might not only minimize seek latency, but also reduce the rotational delay. One way to dynamically replicate disk blocks is to modify the file system, as was done in FS2 [11] . The FS2 file system is an extension of ext2fs that introduces the notion of a "hot region" on disk where most of the disk activity is currently observed . FS2 replicates requests for blocks outside of the hot region into free space within the current hot region, in the same order . This reduces seeks to blocks outside the hot region as well as removes rotational delay if the same access pattern reoccurs. In this thesis we take a different approach . We perform replication to the free space at the block level . The information necessary to adaptively regroup blocks, namely the file system read patterns, are available at the block level . We can also distinguish the I/O submitter thread at the block level so that we can separate independent request streams . Replicating at the block level has several advantages . First, it is a more convenient modularization of 3  Chapter 1 . Introduction the change ; it is simpler to implement because instead of injecting changes in the complex file system code, the changes are encapsulated in a separate module . Second, the solution is file system-agnostic and will work with any file system on top of it . The adoption of such a system is easier because the file system need not be tracked and patched with every new release . Finally, replication becomes transparent to the user and the module can be packaged together with the disk or the driver so long as there is a corresponding file-system-specific component available. There is a number of challenges associated with replicating at the block level . Working below the file system, we cannot know which blocks are free and which ones are live . In turn, the file system is ignorant of replication, and is free to write over valid replica blocks at any time . Being at the block level also means that cache hits will obscure the actual application request patterns. To validate the approach, we have built a prototype system in Linux in the form of a stacked device driver on top of a real block device . We have observed that workloads exhibiting repeating I/O patterns can substantially gain from block replication, while random workloads that do not benefit from replication can be prevented from incurring undue overheads .  4  Chapter 2 Related Work Various approaches can be taken to improve disk performance . A bigger cache would absorb more reads, and fast persistent memory such as flash memory can be used as disk write cache to speed up writes . Better I/O scheduling reduces the amount of seeking and the average latency. Careful static layout by the file system can greatly improve access times, while dynamic rearrangement or replication can adapt the layout to the workload's access patterns on the fly.  2 .1 File System Layout The Unix Fast File System (FFS) [15], and its descendants, the ext2 [19] and ext3 [27] file systems, take performance into consideration when managing disk layout . The FFS family file systems try to place metadata closer to the data blocks by subdividing the disk into equal-sized block groups and placing files and the corresponding file metadata in the same block group. Blocks containing indirect data block pointers are interleaved with the data blocks to avoid seeking . In general, file systems try to place file blocks sequentially and cluster files from the same directory together . If sequentiality of file blocks is broken, which can happen as the file system ages, disk defragmentation tools can restore it. To further improve static layout, Ganger et al [10] suggest embedding 5  Chapter 2. Related Work entire inodes into directory entries (which contain only the file name and inode number in ext2 and ext3 file systems) as well as grouping data blocks of small files from the same directory together as one unit. These approaches aim at optimizing the initial layout of the file system, but they cannot address performance problems when the access patterns differ from the expected behaviour, e .g ., when files are not read sequentially. The Sprite Log-based File System (LFS) [21] improves disk write performance by writing all blocks sequentially in a log and indexing and garbage collecting later . For operations involving small files, it can greatly improve the write throughput . However, when many files are large or when the file system is becomes more full, the advantages are reduced [25] . Moreover, most real-life workloads are read-intensive [11] and optimizing write performance may not be beneficial.  2 .2 Adaptive Block Rearrangement Adaptive block rearrangement has been explored by many researchers [22, 4, 3] . In [22], cylinders are shuffled at the end of each day with the most frequently used ones arranged using the organ pipe heuristic . Carson [4] compares cylinder shuffling with block shuffling using disk simulation and finds that though more resource-intensive, block shuffling is more effective. Akyurek and Salem [3] implement block shuffling, monitoring block usage and copying more frequently used blocks to a reserved area in the middle of disk, arranging them using the organ pipe heuristic . Subsequent reads and writes to hot blocks are routed to the copy in the center using the in-memory block map . As blocks cool off, they are evicted from the central area and new hot blocks are brought in . To save memory, only a fraction of all blocks 6  Chapter 2. Related Work are tracked for usage ; the less-frequently used blocks are ignored. This approach rearranges blocks, moving heavily-used blocks together. One potential problem that can arise is that moving blocks can break the sequentiality of files or directories . Replicating blocks gives us both statically and dynamically optimized block layouts . Also, the above approaches require reserving a portion of the disk, which is intrusive and visible to the user space . Replicating into free space presents no extra storage requirements and is completely transparent to the user.  2 .3 Replication Instead of shuffling the data on disk in response to the observed I/O patterns, accesses can be replicated . Hierarchical File System [1] and Smart File System [26] copy small files that are frequently used ("hot") to a reserved area on disk so that the disk head can spend most of its time in that area. Replicating at the block level has several advantages over this approach: metadata blocks can be moved as well as data blocks ; blocks are replicated in the access order, which reduces the rotational delay when the access pattern is encountered again ; working at the block granularity allows big files to benefit from replication as well as small files. One approach to reduce rotational delay is to replicate blocks in multiple locations within the same track (e .g ., one diametrically opposite of the other), as suggested in [29] . Once the head is positioned on the right track, the closest replica can be accessed, provided the exact rotational position of the disk is known or is accurately predicted . However, knowing exact rotational position (or the disk head position) is not possible outside the firmware without hardware support, and predicting such information 7  Chapter 2 . Related Work  reliably is not easy. FS2 replicates blocks into free space at the file system level . The disk is split into regions and requests are monitored to see which region is currently "hot", i .e ., where the disk head spends a disproportionate amount of time. Any read request for a block outside of the hot region is then copied into free blocks inside the hot region . When the hot region fills up, replication stops until the hot region changes to one that still has free blocks . When replicating, destination free blocks are actually allocated by the file system, so the disk space visibly diminishes as more replicas are created . When the free space reaches a low watermark, some replicas are discarded in batches until the free space increases to an acceptable level. As pointed out earlier, FS2 modifies the ext2 file system in order to have access to the free block information . Instead of modifying the file system, we work at the block level . This avoids changing the complex and mission-critical file system code, as well as avoiding the associated difficulty of correctness verification . Adoption also becomes easier when the stable file system code is not disturbed . It also permits us to replicate under any file system as long as we create a simple module to obtain free block information from its on-disk structures . Because we do not have a concept of hot regions, we avoid the problem of clogging a region while there still is more free space available . We simply replicate anywhere on disk while there is free space, and deal with the low free space situations by reclaiming old replicas . Not splitting the disk into regions also avoids the problem of having to estimate the seek penalty from logical block numbers, which may not be accurate if we are dealing with a device that is not a single disk .  8  Chapter 2 . Related Work  2 .4 Block Address Indirection Another approach to block layout optimization is to introduce a layer of indirection so that the logical block numbers do not directly specify physical locations ; blocks can then be shuffled around on disk without disturbing the upper layer by requiring updates to file system indices . In Logical Disk [6], this is done in software, which gives more flexibility for layout optimizations. Any scheme for block rearrangement can be implemented transparently to the file system during disk idle periods . In Loge [9], the block map resides in the disk controller, where more accurate physical information about the disk is available . It uses the current disk head position to determine the optimal location for the current write request, thus improving the write performance to near-sequential . Because read patterns will often follow write patterns, read performance will not necessarily suffer in this approach. These techniques can optimize the layout to match the application I/O patterns, but if there are multiple access patterns for the same set of blocks, optimizing for one may degrade the performance for others . Only replication would be able to optimize all patterns at the same time.  2 .5 I/O scheduling Disk throughput can be improved with better I/O scheduling, which strives to minimize disk head movement and average request latency . Much work has been done in this area : algorithms such as SSTF, SCAN [7], C-SCAN [18] and LOOK [16] aim to reduce seeking . Rotational delay has been addressed in [13, 20, 24, 12] . The default Anticipatory scheduler [14] in Linux introduces explicit delay before carrying out the requests in case more re-  9  Chapter 2 . Related Work quests for nearby blocks arrive in the meantime . I/O scheduling is orthogonal to block layout optimizations, and we can benefit from better scheduling whether it is done before or after replication.  2 .6 Utilization of Free Space Free disk space is a resource that goes unutilized most of the time . Because file systems are not usually full [8, 11], there is a non-trivial amount of space to be exploited, as was done in FS2. The Transparent File System (TFS) [5] provides unreliable contributory storage to large distributed applications such as SETI@home by harnessing users' free disk space . TFS strives to make contributory storage transparent not only in terms of capacity, but also performance, so that the bandwidth is not visibly reduced for the user's own applications . Unfortunately, it requires the user to first migrate to TFS. The Elephant File System [23] uses free space to store old versions of files automatically without any user intervention . A cleaner is used to remove less important versions of files, while keeping Landmark versions that cannot be discarded to free disk space. We use free space in a completely transparent manner : even good replicas are not guaranteed to persist, but the file system allocation behaviour is not affected by replication either .  10  Chapter 3  Design The continually increasing design complexity and competition pushes hard drive manufacturers to hide more and more information about the internals of their products . Early disks had straightforward geometry expressed in the number of cylinders, tracks per cylinder, and sectors per track . Having such simple and accurate low-level information facilitated layout optimizations and enabled I/O scheduling algorithms based on the precise knowledge of the disk head position at all times . However, modern hard drives do not have a simple organization, and sophisticated on-board electronics manages the disk operations, remapping block addresses transparently and even scheduling I/O on their own . These design trends make assumptions about disk layout and head position at best tenuous. Taking the above trends into account, we try to infer the minimum amount of information from logical block numbers . Specifically, we assume that sequential block numbers generally map to adjacent physical blocks (or at least to blocks that can be accessed together without a seek penalty) and that a relatively small difference between two block numbers usually implies a small seek penalty. If the difference between two block numbers is large, on the other hand, we make no assumptions about the cost of jumping between the two blocks . This makes our design also suitable for multi-disk storage systems where logical-to-physical mapping may not be linear .  11  Chapter 3 .  Design  Thus, we put the emphasis on replicating non-adjacent but temporally close accesses sequentially with the expectation that the same pattern of reads will repeat in the future, at which time we would be able to read it sequentially. Instead of trying to determine the "hot region" where the disk head spends the most time and replicating within the hot region, we replicate anywhere on disk . If a replicated pattern of n requests reoccurs, it allows us to replace up to n seeks (in the best case) with just 1 seek to the beginning of the replica chain . This approach also enables us to replicate patterns liberally without worrying about filling up the free space of only a small region on the disk.  3.1 Architecture The Linux block layer is found just below the Virtual File System, which is an abstraction decoupling specific file systems from the block layer . It includes an I/O scheduler and the drivers for individual devices . We introduce our module just below the Virtual File System, and hence we neither see any file system meta-information nor know which blocks on disk are free and which are live . A separate component in charge of periodically scanning the file system's free block structures supplies this information to the in-kernel replicator . Alternatively, a file-system-agnostic way of finding free blocks could be employed by allocating a big balloon file and writing into its blocks. When the file system frees a block, we do not know about it until we scan the file system's free block maps again . Delayed information about freed blocks reduces our ability to benefit from them, but does not compromise correctness . On the other hand, when the file system allocates a block and 12  Chapter 3 . Design  writes into it, we immediately see that the block is no longer free r . Even if the write to the newly-allocated block is buffered and we continue to use the block for replication (unaware of the allocation), correctness is not compromised because the file system write has not reached the disk and the new contents cannot be overwritten. I/O Scheduler We place our replicator above the I/O scheduler . Implementing replication upstream from the scheduler as opposed to below or within a scheduler is a trade-off . Not having to extend a scheduler has an obvious advantage of avoiding the difficulty of modifying complex code . Modern I/O schedulers employ highly tuned algorithms which would have to be perturbed by such a major feature . Being decoupled from a specific I/O scheduler also enables us to work with any scheduler that is appropriate for a particular workload. If we tried working below the scheduler (e .g ., in the device driver), we would not be able to see the natural order of requests by the applications and would only see the scheduled ordering from all processes merged together — which can fluctuate with time even if workload access patterns were highly regular due to non-deterministic task scheduling . Finally, being upstream from the scheduler means we can still benefit from the I/O scheduling ourselves. When we issue replication write requests to the underlying block device, we need not concern ourselves with manually scheduling and prioritizing the writes to avoid jumping back and forth between original blocks and the current replication region ; the scheduler will take care of that in the normal course of its operation. This assumes the file system does not issue writes into unallocated blocks, of course, which it has no reason to do .  13  Chapter 3 .  Design  At the same time, working above the I/O scheduler means that we need to make decisions based on incomplete or inaccurate information . In order to decide whether to use replicas or replicate original blocks, we need to consult the history of recent I/O requests, and this history will not be the actual sequence of request the driver will see ; the scheduler will delay, rearrange and merge some requests . It will store requests in its queues for an unknown amount of time before issuing them to the driver, which means that we cannot know whether a request has already started or not . We might also potentially do redundant work if we try to optimize a sequence of requests by replicating which could be trivially optimized by scheduling . In such cases, we could benefit from knowing the internal scheduler state, being aware of its queue contents and working symbiotically by replicating only those patterns that could not be sufficiently improved through scheduling alone . However, we judge that the benefits of replicating upstream from the I/O scheduler outweigh the problems. Prefetcher and Buffer Cache Working below the Linux Virtual File System means being below the readahead subsystem and the buffer cache, which are located above the block layer . This again is a trade-off . If we modified the read-ahead subsystem, we could prefetch non-sequential blocks that are part of the currently used replica chain instead of prefetching only sequential blocks and stopping readahead altogether when non-sequential access is detected . On the other hand, the prefetcher is a complex and evolving piece of software and decoupling from it simplifies the implementation . Read-aheads of sequential blocks do not pose a problem because sequential reads are ignored by the replicator  14  Chapter 3 . Design anyway. Being below the buffer cache prevents us from seeing the original application access patterns, as the cache absorbs some of the more frequent accesses . As a result, we see less locality than there actually is . One can certainly imagine advantages of having a replication-aware cache, but we have not explored this direction.  3.2 Replication The two principal questions we need to answer about replication is where to replicate, and when.  Since one of our design decisions is to replicate  anywhere on disk as long as read requests are replicated sequentially and in the same order, we simply ensure that all blocks from temporally-related accesses are replicated contiguously. The second question, when to replicate, is not a trivial one . On the one hand, we want to minimize replication overhead and the associated interference ; replicating involves seeking to the replication location and back, and uses up the free space with each new replica . On the other hand, it would make no sense to replicate a single block that happens to be out of the way with respect to the current request stream to another (random) location using our scheme, because the replica might end up even farther . If we replicate an occasional out-of-the-way request together with the subsequent requests that are sequential, the benefit of having such a replica chain may not outweigh the cost of creating it. Ideally, we would like to replicate only those I/O patterns that would recur in the future : otherwise, replicating would only cause overhead without producing any benefits . Since we cannot know the future, we will not know 15  Chapter 3 . Design how useful new replicas will be at the time of replication. Of course, sequentiality of requests does not have to be defined in terms of individual requests . If we limit ourselves to looking only at the current request to determine whether it is sequential, we will miss situations where requests come in pairs of sequential reads, for example, but the pairs are not sequential with respect to each other . In general, there can be streaks of an arbitrary number of sequential requests that are not sequential relative to each other . Depending on the scope of our access pattern analysis, we can replicate in streaks of up to n sequential read requests . Increasing n causes heavier replication activity, but also increases potential benefits, for cases where the accesses are continuously jumping between 2 or more locations.  Monitoring Request Stream We keep the history of the most recent disk accesses and compare the current request to the previous requests to form decisions about replicating or using replicas . If the current request is already close to a recent request(s), it is considered a "good", sequential access . To determine sequentiality, we can compare the current request either with just the last (most recent) one or with all requests in the history, optionally giving more weight to the more recent entries . We can also define "sequential" as having at least k > 1 close requests in the history. Looking only at the most recent request when determining sequentiality causes more aggressive replication (with potentially higher payoff), but also makes us more susceptible to noise . The value of k similarly affects the definition of sequentiality (and hence the rate of replication) : the lower the value, the more conservative the replication will be. In our implementation, we look at the entire history and use k >  1,  giving  16  Chapter 3 . Design all requests an equal weight. Similarly, we can include only read requests in the history because write requests are not affected by replication and are often unrelated to reads. This causes replication to be more conservative (more sequentiality will be observed) . In our implementation, we do not include write requests in the history. Delaying Replication Decisions We know that we do not want to replicate blocks read sequentially ; such accesses will not be improved by creating sequential replicas . When the read patterns are not sequential, replicating could be beneficial (if the same blocks will be read again in the same order) . If reads patterns are a mix of sequential and random accesses, it might be logical to try to replicate only the non-sequential portions . However, if the sequential and non-sequential stretches are all short, this introduces a problem of creating temporallydiscontinuous, unrelated replicas . If the same read pattern reoccurs, the disk head would be jumping between two clusters of blocks : the subset of the original blocks that are read sequentially and the replica region that contains copies of blocks from all non-sequential requests . While this would still be an improvement over the no-replication scenario, we run into a problem if some of the replicas are invalidated . In that case, a third replica chain would be started to optimize the out-of-the-way blocks that used to have replicas (which became invalidated) . If replicas continue to be invalidated for any reason at a non-negligible rate, this approach would cause the number of clusters for the same pattern to increase until the replica fragmentation negates the benefit of having replicas in the first place . To prevent this  17  Chapter 3 . Design problem, we want to avoid replicating too few temporally-sequential requests at a time . In other words, we want to replicate for at least some minimum amount of time once we do start replicating. Since we want to replicate only requests that are out of the way relative to the majority of the current requests, but at the same time avoid replicating if the number of out-of-the-way requests is too small (as discussed above), we choose to accumulate replica creation (write) requests without actually issuing them until we can see how sparse they are and whether they are indeed worth creating . If the non-sequential requests are frequent enough, we commit to replicating and replicate all subsequent requests for some time even if they are sequential. If the workload is truly random and the access patterns never repeat, we will never be able to benefit from replication . However, it is not easy to determine at any point in time whether the workload access patterns are random or not . A process can have an arbitrarily long cycle where access patterns are all different before they start to repeat . The accessing process might never repeat its access patterns, but another process could benefit from replication at the same or a later time . Therefore, if a process is not benefiting from its own replication, it does not necessarily mean it is creating useless replicas ; either another process or the same process might start using them after some more time passes . However, it might still be beneficial to curb replication of a process that has not been benefiting from replication. In our implementation, when free space is low processes are prevented from replicating until they start to use more replicas than they create .  18  Chapter 3 . Design  3 .3 Using Replicas If we have replicas for non-sequential reads, it is a matter of finding the appropriate replica chain and diverting read requests to replica blocks . If we have a choice between multiple replica chains however, we need to choose one of them . We only see one request at a time, of course, and do not even know what kind of I/O pattern will follow . Ideally, we would like to select a replica chain that will cover the maximal number of following requests. However, we cannot simply hold requests and wait until we see more of them because this will cause a deadlock with processes performing blocking I/O: the next request will not be issued until the current one is completed . Figure 3 .1 illustrates the issue . Suppose we encounter a non-sequential request for block A, and we see that there are 5 replicas of A we can pick from . Only the fourth one will be optimal given the subsequent requests (which we don't see yet) . To maximize the chance of picking the best replica chain, we could use additional information about replicas, such their usage counter or LRU position . The prototype does not currently use any additional information and simply picks the replica with the lowest block number.  3.4  Replica Reclamation  Sooner or later, if we keep replicating into free space, we will run out of free blocks . At that point, we cannot continue replicating unless we discard some of the existing replicas . Whether we should discard old replicas to make room for new ones depends on how well existing replicas are used . If all existing replicas are being actively used, then reclaiming them to create new replicas is unlikely to improve anything : in the best case, we would save  19    Chapter 3. Design  Request stream  A B C D E F  Sequential replica chains  A B X Y Z A B C K L M A D E F G H H best  A B  C  D  E  A P  Q  R  S worst  Figure 3 .1 : Example of multiple replica chains . When we see the request for block A, we can choose among any of the 5 replicas available . Given the subsequent requests, the fourth replica would be the best because that replica chain covers the most requests .  20  Chapter 3 . Design time serving a new request pattern, but the gain would be offset by losing the replicas for an old pattern . Moreover, we will soon want to re-replicate the pattern that lost its replicas because that pattern still keeps occurring. On the other hand, if some replicas are not used much, or have not been used in a long time, it would make sense to reclaim their blocks for new patterns . Otherwise, these replicas are just taking up blocks and are no more useful than free space . Thus, we want to reclaim replicas that are not being used, and keep those that are being used . To find good candidates for reclamation, we can keep track of replica usage and reclaim either least frequently used or least recently used ones . We choose to reclaim least recently used (LRU) replicas because new replicas will always be the least frequently used replicas, and we do not want to reclaim freshly created replicas . Also, most frequently used replicas can become obsolete as patterns shift, and we would not be able to tell that just from the usage counters. Problems with LRU Reclamation Reclaiming old replicas strictly in the LRU order will not work, however : the LRU order will not necessarily coincide with the physically sequential ordering of replica blocks, as illustrated in Figure 3 .2 . If we reclaim replica blocks in the LRU order and write new replicas in the same order, we might end up replicating into non-sequential blocks, which would defeat the purpose of our replication scheme : rewriting non-sequential blocks sequentially . Therefore, we must ensure that we reclaim sequential chunks of blocks . Combining the two requirements, the goal becomes to find a sequence of adjacent blocks such that the average rate of utilization of replicas in this sequence is the lowest among all the sequences that we could reclaim . Since it would be  21  Chapter 3 . Design  9  8  7  6  5  4  3  2  0  Figure 3 .2 : Example of LRU replica order with respect to the physical block order. impractical to examine the set of all possible sequences of replica blocks, we approximate it as follows . Suppose we want to reclaim N sequential replicas at a time . Taking some fixed bottom fraction of the LRU replica chain, we sort that set of replicas by block numbers and scan through them sequentially, looking for the largest number of replicas that fall within the range of size N . This is illustrated in Figure 3 .3. It is possible that the LRU order destroys the sequentiality of the replicas too much . In the worst case, every replica in the bottom portion B of the LRU chain is adjacent to replicas outside of B (i .e ., higher up the LRU chain) . This would mean that for all replica sequences that we could possibly reclaim using our scheme, only one replica in each sequence is actually at the bottom of the LRU chain . Reclaiming any such sequence would mean discarding replicas that might still be useful simply because there is one replica among them that was not well used . To avoid accidentally discarding 22    Chapter 3 . Design  sorted bottom-LRU replicas k  z' "N.\ \\ k+1 k+2 k+6 k+7  k+N-1 k+N k+N+5  Zzz  k+N-1  k+N  sequence of N adjacent blocks for reclamation  Figure 3 .3 : Example of trying to reclaim N blocks by examining replicas from the bottom of the LRU chain in sorted order. good replicas, we require some minimum number of replicas from B to be present in the reclaimed sequence (currently 50%) . If no such sequence is found, we do not reclaim any blocks at all and wait until more replicas fall out of use. The relative position of old replicas within a reclamation sequence may also be significant ; if old replicas (in B) are intermixed with newer ones within a sequence, it makes the new replicas less useful (because they are less sequential) than if they were all clustered together . Thus, we might want to have a lower threshold for the proportion of replicas from B in a sequence for reclamation if the there is such intermixing . The prototype does not currently consider this case. With only the LRU ordering information, we might not be able to notice when some replicas are no longer used, because the relative ordering of replicas at the bottom of the LRU list will not change at all if they are not  23  Chapter 3 . Design being used . To address this problem, we add the age information for each replica ; every time we fail to reclaim anything, we increment the ages of all replicas . The next time we try to reclaim, we consider all replicas from the bottom fraction B of the LRU chain, plus all replicas beyond B that are older than the threshold age (if any) . We then search for the maximum sequential set as described above . Every time a replica is used, its age is reset back to zero. Reclaiming sequential replicas with the majority of them from the bottom of the LRU chain ensures the availability of adjacent blocks for replication, but it still does not preclude us from reclaiming good replicas . If all replicas are being used, including those at the bottom B of the LRU chain, and the LRU order has not fragmented the sequentiality of replicas too much, we will not even notice that we are reclaiming good replicas under our scheme . This can happen if the rate of reclamation exceeds the rate of replicas falling into disuse . However, reclaiming active replicas can be dangerous . If replicas are active, that means the corresponding read patterns keep reoccurring, and if we discard their replicas, we will likely have to re-replicate those patterns again . If we reclaim still more replicas to re-replicate discarded ones, we might get into a feedback loop where we need to replicate more and more patterns as we are discarding more and more replicas to accommodate the new replications, which triggers yet more replication . Eventually, this can lead to thrashing, where we keep replicating most of the requests and reclaim fresh replicas without having had a chance to use them, in order to accommodate the new ones . We have observed this condition in practice when the active set of access patterns of the workload is larger than the amount of free space. To make sure that we do not reclaim recently-used replicas, we can in24  Chapter 3. Design crement the ages of the bottom-LRU replicas while looking for reclamation candidates, and not touch replicas from the bottom past the first replica with age zero . This would only work, however, if we come back for another reclamation attempt after a sufficient amount of time . Otherwise, there would not be enough time to access all active replicas and reset their ages . This is essentially a trade-off between waiting sufficiently long to avoid reclaiming good replicas and the system responsiveness in reclaiming free blocks when they are needed . It is not implemented in our prototype .  25  Chapter 4  Implementation We have implemented a prototype replication system for the Linux 2 .6 kernel consisting of the replication kernel module and the user-space free block list extractor for the ext2 file system, which is simply a python script.  4 .1 Block Layer The replication module is implemented as a stacked device driver which obtains exclusive access to the underlying block device and exports a virtual device to the kernel through the standard block device interface : the replication is completely transparent to the user . The driver receives I/O requests from the kernel, processes them and passes them (modified or unmodified) on to the actual driver of the underlying block device . Processing requests may involve updating replication data structures, determining whether to divert the request to another area on disk (i .e ., to read replica blocks instead of the original blocks), or whether to replicate the blocks being accessed. Hard drives are subdivided into sectors, normally 512 bytes in size, which is the smallest addressable unit when transferring data to or from disk. Most file systems work in units of blocks, which is an integral number of sectors, normally a power of 2 . Thus, file systems such as ext2 and ext3 treat the disk space as a simple one-dimensional array of blocks . The file system block size is frequently an adjustable parameter that can be specified 26  Chapter 4 .  Implementation  when formatting a volume . A common block size is 4 KB, or 8 sectors. All normal file operations involve an integral number of blocks, unless the file system is accessing meta-information, such as reading the superblock during mounting . We work in units of blocks rather than hardware sectors because all block requests in the normal course of file system operation are block-aligned and comprise an integral number of blocks . The coarser block granularity reduces the amount of memory needed for replication data structures . Any requests that are not block-aligned or do not consist of an integral number of blocks are ignored by the replicator and are passed through to the underlying device unchanged. When analyzing distances between blocks to determine whether two requests are close (needed for decisions on creating and using replicas), we use the cut-off value of 1000 blocks : any two blocks (and only those) whose difference in logical addresses is less than 1000 are considered close to each other . As discussed in Section 3, the cut-off value should be small enough to avoid having to make unfounded assumptions about disk geometry, but we also do not want to make the value too restrictive and replicate blocks that are already very close to each other . Since an 80 GB file system will have over 20 million 4 KB blocks, 1000 is a relatively small number. 4 .1 .1 Major Data Structures The kernel module maintains the following major data structures : perprocess recent request history, lookup tree of replica blocks, lookup tree of replicated blocks, lookup tree of free block extents, and replica LRU list.  27  Chapter 4 . Implementation Per-process Information The module records the recent I/O requests (both reads and writes) that it has seen in its per-process request history . Each request will consist of 1 or more blocks (requests with non-integral number of blocks are ignored) . Requests are recorded in the order they come in, even though the I/O scheduler will reorder and merge them before issuing the requests to the driver . The request history is maintained for each process separately so as to avoid nondeterministic interleaving of I/O that confounds real application patterns. The prototype implementation has a hard-coded limit for the maximum number of processes and recycles the history structure of the least recently used process for each new process . The number of requests in the history is set to 64, which is comparable to the average number of requests queued in a typical I/O scheduler2 . This size is motivated by our usage of the history; we look at previous requests to determine whether the current request is close to any of them . If it is, we expect that it can be merged with a recent request that is still in the I/O scheduler queue, or that it can be issued immediately while the disk head is still close to the target blocks . In contrast, if a previous request is too old and has already been serviced, the disk head might already have moved on . Knowing that the current request is close to this old request is not useful, as the I/O scheduler will not be able to take advantage of the fact . As an alternative to a hard-coded value for the history size, we could maintain only those recent requests that have not yet completed. Other per-process information includes the block number where the last 2 The actual number of queued requests will depend on the scheduler's algorithms, maximum queue size, and the disk utilization rate .  28  Chapter 4 . Implementation replica was written, which allows the replicator to continue writing new replicas sequentially if possible. Replica Information In our implementation, a replica is always a single block that contains a copy of one file system (data or metadata) block . It is conceivable to have a notion of a replica chain as an entity instead, which would mean 1 or more adjacent replica blocks holding copies of the corresponding number of original blocks . This would reduce the amount of memory required to maintain the replica information but increase the complexity of maintenance and the replica finding algorithm, as well as reduce the flexibility of using replicas that happen to be non-sequential but close enough . We choose to keep one "replica" structure for each replica block . Each replica structure contains the block number where the replica resides, replica age, and the "in use" flag, which is set while the replica is being read (see Section 4 .1 .4). All replicas of the same original block are connected into a single linked list in sorted order . The replicas are stored in a radix tree for efficient lookups, which we need to perform to detect when a file system-originating write is overwriting (and invalidating) a replica. We also store one "original block" structure for each replicated block. These structures are stored in another radix tree for efficient lookups needed to detect file system-originating writes that update the contents of the replicated block and thus invalidate its replicas . Each " original block" structure contains the block number, pointer to the list of replicas of the block, and a list of incomplete (in-flight) replicas (see Section 4 .1 .4). An LRU list of all replicas is maintained for replica reclamation purposes.  29  Chapter 4 . Implementation Each time a replica is used (read), it is moved to the tail of the LRU list and its age is reset to zero . When reclaiming, we look at the replicas from the head of the list. Each replica structure takes 24 bytes and each "original block" structure needs 20 bytes, which means that 80 GB worth of replicas will require roughly 1 GB of RAM (or less) for the data structures . When the replicator estimates that the memory is running low, it stops creating new replicas to prevent more memory allocations for new replicas . The effect of running out of memory is therefore the same as the effect of running out of free space. Free Block List The module stores free block information as extents of free blocks, which is a more compact representation of the free disk space . File systems normally try to avoid fragmenting free space, which is advantageous to this representation . The extents are stored in two red-black search trees simultaneously: once by extent start block, and once by extent length . We maintain the search-tree-by-start-block to efficiently detect free block allocations by the file system, which causes us to lose the free blocks . The search-tree-by-length is maintained in order to efficiently find an extent that is sufficiently large for a particular replication request . When the current extent of free blocks has been filled with replicas and we need to find another region with free space, we pick the currently largest extent such that it is not too close to the blocks we are replicating (to avoid creating a replica close to the original block) . We do not replicate into extents that are smaller than the minimum reclamation amount (1024 blocks). Each free block extent structure is 32 bytes . The memory requirements  30  Chapter 4. Implementation for storing free block information will depend on how fragmented the free space is.  4.1 .2 Request Completion Handler Because disk transfers complete asynchronously with request submissions, we have a separate kernel thread to handle completions of relevant requests. There are several types of requests whose completions need our attention. When a request that has been diverted to replica blocks finishes, we need to update the state of the corresponding replicas to clear their "in use" flag. When a read request that we decided to replicate completes, we need to initiate replication . Finally, when the replica write request completes, we need to augment the replica data structures with the new replicas . Each time a request is signaled as complete, the completion handler wakes up and processes the completed requests.  4.1 .3 Operation We process I/O requests as they are submitted to the block layer. Write requests . We must check the destination of all write requests : if it overlaps with blocks that have replicas (the file system is overwriting existing blocks), we must invalidate the corresponding replicas ; if it overlaps with blocks that are replicas (the file system is allocating free blocks), we invalidate those replicas ; if it overlaps with blocks on our free list (the file system is allocating free blocks), we remove the blocks from the free list. Also, if the write conflicts with an in-flight replica I/O, we need to handle it as described in Section 4 .1 .4. Read requests . When we see a read request, we can divert it to replica  31  Chapter 4. Implementation blocks if possible, or replicate the accessed blocks . We begin by counting how many previous sequential requests this request is close to . If it is sequential with at least 8 recent accesses, we pass it unmodified with the expectation that it will be efficient . The number 8 is the cut-off value ; we assume that sequential streaks of 8 or more disk accesses are efficient enough and do not require replication . If the current request is sequential to fewer than 8 previous ones in the history, we look for replicas . We check each replica chain (if any) that covers the entire request and take the first one that is close to one of the requests in the history . If there is no such replica chain (e .g ., no replica chain covers all blocks in the request, or no replica chain is close), we conclude that this request is a candidate for replication. Replicating In order to replicate a request, we first need to wait until the data has been read from disk ; a replica write request cannot be issued until the data is in memory. Therefore, we alter the request to be notified of the request completion and pass it to the underlying device . When the request completes, we duplicate the original read request, copy the block data pages, substitute the target disk address with the new replica location, and change the direction of I/O to write . If we did not copy the data pages by value and instead used pointers, we would have to hold off notifying the original read request issuer until the replica has been written . Otherwise, the data pages might be released and their contents clobbered before replica writing is over . Since we queue and delay our replica write decisions until we see more read requests, we cannot delay completion notifications for the reason described in Section 3 .3 .  32  Chapter 4 . Implementation When we deem that a read request should be replicated, we do not issue a replica write request right away . As described in Section 3 .2, we want to avoid replicating a single request because there is no sense in having a replica chain one request long . Thus, we queue the replica writes and wait to see how sequential the following read requests turn out . If the current request turns out to be the only one that is non-sequential, followed by sequential reads, replication will be of dubious value . Hence, we wait until the queue has 8 (tentative) replica writes and count the number of sequential requests (each relative to the preceding request) among them. If there are 4 or more sequential requests out of the 8 enqueued, we elect not to replicate and cancel the enqueued replica writes . Otherwise, we initiate replica writes . The small numbers (8 and 4) should be sufficient to prevent noise from causing unnecessary replication ; larger numbers might be necessary if a more involved algorithm is used for replication decisions. If we do not get 8 replica writes in the queue before the timeout, set at 2 seconds, we cancel the enqueued replica writes ; it is pointless to try to improve disk performance if the application is not I/O-bound . Short term delays in replica writing poses no problem, since if the application issues a read for the same blocks too soon, they will likely be fetched from the cache anyway. Once we start replicating, we keep replicating at least the next 64 requests, whether they are sequential or not . We do this precisely because we do not want to create uselessly short replica chains . At the same time, we do not want to replicate unconditionally for too long in case the accesses become sequential. Sometimes replica writes may be cancelled due to technical reasons . For example, memory allocation for a new data structure might fail, or the write 33  Chapter 4 . Implementation queue of the underlying device might be congested . When we detect write queue congestion, we wait for 256 reads without attempting to create new replicas in order to allow the write queue to clear up. Replica Reclamation When replicating, we write replicas sequentially into free blocks until we fill the entire contiguous extent . We then choose another free block extent that is at least 1024 blocks long . When there are no such extents left, we look for old replicas to reclaim . To do that, we take the bottom 10% of the replicas in the LRU .order (or all replicas older than age 250, whichever is greater) and sort them using the in-place 0(n) bucket sort algorithm . Once the list of candidate replicas is sorted, we find a 1024-block long range that has the most replicas from the bottom of the LRU chain . To achieve this, for each replica in the list of candidates for reclamation (say, starting at block i) we count the number of replicas in the list that fall within the range (i, i+1024), which can be done in 0(n) . If a 1024-block range with the maximum number M of replicas from the candidate list is such that M > 512, we reclaim the entire consecutive range (M replicas from the candidate list and 1024 — M victim replicas from outside the candidate list that happen to be adjacent to the best candidates) . Checking whether replicas even in the bottom 10% of the LRU list are active is not implemented in the prototype . If M < 512, we do not reclaim anything and increase the ages of all replicas by 1 . We then wait for N/250 (where N is the total number of replicas) reads without trying to reclaim replicas to allow the LRU order to change. If the subsequent reads do not cause the LRU order to change significantly, no reclamation will occur .  34  Chapter 4. Implementation  4.1 .4 Conflicting I/O Because replication is not an atomic process, it is possible that while we are waiting for the replica access to finish, the file system issues a request that conflicts with it in some way . For example, if the file system overwrites the replicated block, we need to invalidate the replica that has not even completed yet . To systematically examine all these possibilities, we need to consider what happens when the file system issues an I/O request for a replicated block or a replica block while an I/O for a related (or the same) block is in flight . Note that the file system will never read replica blocks because they reside in free, unallocated blocks . Table 4 .1 lists the possible combinations of accesses. new request R  a) °:  R  W  W  original  original  replica  original  OK  RC  INV  replica  OK  RC  RR  original  RC  RC  -  replica  OK  INV  B  Table 4 .1 : Combinations of new and in-flight requests From the table, we can see that if the in-flight request and the new request are both reads, there is no problem (marked as "OK") . If we are in the process of writing a replica (replicating a file system block) and the file system issues a read for the original block, this creates no conflict either. If a write to a file system block is in progress and another read comes  35  Chapter 4. Implementation in for the same block, it introduces a race condition (marked as "RC"), which originates from above the block layer (we never issue writes to file system blocks) . If the file system or direct-I/O user allows this case, it must be aware of it and prepared to deal with the race condition, because the block level provides no ordering guarantees (unless barrier requests are used) . Therefore, this case is not a problem for us. If a write to a file system block comes in while a read or a write for the same block is in progress, this again is a race condition introduced above the block layer and is consequently the block layer user's responsibility, not ours . If a write to a file system block arrives as we are reading a replica of that block, it could potentially result in a different outcome than if we were reading the original block, but because the request ordering at the block layer is opaque to the users and no guarantees are given, this does not violate the semantics of block I/O. If we are in the process of creating a replica and the file system issues a write to the original block, this effectively foils our replication attempt . In this case, we mark the replica write as invalid so that when it completes, we do not add the new replica to our replica list . This situation is marked as "INV ." If a read to a file system block is in progress and the file system allocates a block that contains a replica of the block being read, we simply need to invalidate the replica. This does not affect the reading of the original block. If we are in the process of reading a replica of a block and the file system decides to allocate the block containing the replica at the same time, we need to either delay the write, or discard the replica read result and repeat the read, this time on the original block . This case (marked as "RR " ) is not currently handled in the prototype implementation . The case has not 36  Chapter 4 . Implementation arisen during testing and evaluation. If we are in the process of writing a replica, and the file system decides to allocate the same block, we need to delay the file system write to make sure the I/O scheduler does not change the order of the writes and cause the file system write to be overwritten with our replica write . We handle this case (marked as "B") by not submitting the file system write to the underlying device until the replica write has completed . The prototype does not currently preserve the barrier semantics in this case, which is required by journaling file systems. Finally, if a file system write is in progress, we will have invalidated all replicas of the target block, so the file system cannot issue a write to a replica of the block being written into . This case is marked as "-" and is impossible.  4 .2 Free Space Extractor The free space extractor for the ext2 file system is implemented as a simple user-space Python script that accesses the disk directly and reads free maps in all block groups of the volume, collecting free block information in the extent format . The program output is then fed to the sysfs interface of the kernel module, which updates its free block extents accordingly.  37  Chapter 5  Evaluation We have evaluated our replication implementation performance on some synthetic workloads that exhibit repeated access patterns, as well as on a microbenchmark to see the overhead of replication.  5 .1 Test Environment The tests were run on a server machine with one 3 .2 GHz Pentium 4 CPU with 2 hyperthreads, 1024 MB RAM and two identical 80 GB SATA drives: one used by the operating system (for the root file system and the swap partition) and one for the data used by the test workloads, with no other data stored on the disk and no other processes accessing it . Each hard drive is a Western Digital WD800 Caviar SE with the transfer speed of 66 .7 MB/s UDMA, 7200 RPM, 9 .5 ms average seek time, 8 MB cache, and Native Command Queuing (NCQ) feature.  5 .2 Best Case Scenario The first test presents the best case scenario in order to determine how much improvement is possible . We run multiple iterations of the same sequence of random reads caused by repeatedly running Postmark . Postmark is a file system benchmark that tests basic operations : reads, writes, file cre-  38  Chapter 5 . Evaluation ation, deletion and appending . It strives to realistically simulate Internet mail or USENET news system by performing "typical" operations on many small files (between 500 and 10,000 bytes in size) . The benchmark runs by first creating all files, then running "transactions" on them (reading from or appending to a file, followed by creating or deleting another file), and finally deleting all of the files . We modified Postmark to only perform the "transactions" on existing files, without creating the files in the beginning or deleting them at the end . If files were deleted at the end and recreated in the next iteration, all replicas would become invalidated at each iteration, erasing all benefit of replication . We also fixed the random seed so that the " transactions" in each iteration would be identical . We used Postmark vl .51, with the following parameters : initial set of 572,000 files 4 GB in total spread over 100 subdirectories, no appending (only reading), no creation or deletion of files, fixed random seed . All other parameters were left at their default values. Each iteration of the modified Postmark without replication completes in just over 5,500 seconds, or 1 .5 hours, as shown in Figure 5 .1 . The first iteration takes slightly longer because the cache is initially cold . Running the iterations with the replication results in approximately 19x speedup by the seventh iteration . The first iteration takes about 18% longer because of the replication overhead : approximately 63% of (active) replicas are created in the first iteration . Not all replicas are created in the first iteration because of the replication throttling that kicks in when write queue congestion is detected . In the subsequent iterations the rate of replication slows down significantly, but does not reach zero . The main reason for this is that as the files are being accessed, their access times change, and the file system issues writes that update atime fields of the corresponding inodes . Because 39    Chapter 5 . Evaluation  7000  -55l5  6000 — 3744 0*.  5524 5541  • ext2  5000 —  • replication  2000 -  \  1574  1000  \  459  ~.. I  3  4  5  331 I 6  I 7  I 8  9  10  11  12  13  I 14  I 15  I 16  I 17  I 18  I 19  I 20  iteration number  Figure 5 .1 : Repeating modified Postmark iterations with and without replication. inode reads were part of the access patterns and therefore replicated, the replicas of inode blocks are being constantly invalidated as file access times are updated, causing re-replication in every iteration. This explanation is corroborated by running the same experiment on file system mounted with noatime option, which disables updates to file access times . The results are shown in Figure 5 .2, and we can see that the second iteration is already as good as the subsequent ones . The replication activity in subsequent iterations is halved compared to the average iteration when atimes are updated.  Even with the test running on the file system mounted with the noatime option, each iteration still replicates about 0 .2% of all reads after the first iteration . The reason for this continuing replication is the non-determinism introduced by the buffer cache ; because of caching, the block layer does not 40    Chapter 5 . Evaluation  7000  6000  5000 —  • ext2 • replication sequential read  2000 —  loon —  2  3  4  5  6  7  9  10  11  12  13  14  15  16  17  18  19  20  iteration number  Figure 5 .2 : Repeating modified Postmark iterations with and without replication when the file system is mounted in read-only mode. see identical patterns for each iteration, and the mismatch between what is replicated and what is actually requested causes additional replication. We can compare the above performance with the best-case performance of reading the disk sequentially with no seeks at all . Reading the same number of blocks as what Postmark reads in one iteration with the program dd  completes in 60 to 100 seconds . Sequential reads are faster at the edge  of the disk (beginning of the block address space) than at the pivot of the disk (end of the address space) due to the Zoned Constant Angular Velocity technology [17] that packs more sectors on tracks with larger physical radius.  41    Chapter 5 . Evaluation  12000 — 11000 10000 —  9000 —  • n  ez12 replication  8000 — , 7000 —  ai E  6000 —  °  5000 ti\ 4000 —  3000 — 2000 —  1000 — 0  I  I  4  6  9  10  1 '1 11  1'2 12  13  14  15  16  17  iteration number  Figure 5 .3 : Running two Postmark loops simultaneously. 5 .2 .1 Multi-threaded workload To test the performance of a multi-threaded workload, we duplicated the 4 GB Postmark-generated dataset to a second directory and ran iterations of two instances of Postmark, each on its own directory, simultaneously. As expected, running two instances simultaneously takes slightly longer than running two instances sequentially because there is more seeking between the two Postmark directories . Figure 5 .3 shows the performance of one of the processes in this two-process experiment . As we can see, running Postmark instances with replication enabled shows a more gradual improvement . The patterns are not fully replicated sequentially until closer to the 15th iteration . This is due to other constraints that sometimes prevent replicas from being written, such as temporary back-off during write queue congestion .  42    Chapter 5. Evaluation  7000  Z  6000 —  empty  5000 —  full  1e 4000 i  2000 —  loon —  21  31  41  51  61  iteration number  Figure 5 .4 : Repeating modified Postmark iterations with and without replication when the file system is 95% full. 5 .2 .2 Replica reclamation To test replica reclamation, we run the same Postmark iterations on the disk that is 95 .3% full and has only 989,000 free blocks . This forces the replicator to start reclaiming old replicas after the 30th iteration . Figure 5 .4 shows the performance on the full disk for 100 iterations . We can see that the performance is about 1 .4x slower than that on the empty disk . This can be explained by the fact that the location of free blocks is now more constrained, and the replication might have to be done farther away from the workload's dataset blocks . However, once the reclamation begins (starting at iteration 31), the performance penalty is negligible because only old replicas are reclaimed, and the rate of reclamation (0 .35% in this experiment) does not exceed the rate of replicas falling into disuse .  43  Chapter 5 . Evaluation  5 .3 Patterns with Noise To validate our approach in a more realistic scenario, we have devised the following workload . Given a set of files and a set of patterns (sequences of files), one process reads patterns of file sequences (entire files) in a random order, intermixed with some "noise", when it reads the same files in a random order and not as part of any pattern . Another process reads the same (entire) files in a random order all the time. We took the same 4 GB dataset of 572,000 files generated by Postmark. The files are on average 7 .3 KB in size . We then generated 1144 patterns, between 10 and 200 files per pattern (105 files per pattern on average), such that the total size of all patterns is about 860 MB . We also left only 2 GB free space on the disk . Figure 5 .5 shows the performance of the two processes, one reading the patterns 80% of the time (and reading files in a random order 20% of the time), and the other reading files in a random order all the time . The circle points correspond to the performance without replication, while the square points show performance with replication . We can see that the first process reading patterns performs faster even without replication, benefiting from caching, while the second process has fewer cache hits and thus runs slower . When we replicate, the speed of both processes initially is slightly lower than their corresponding speeds without replication, but after about 4000 seconds all patterns have been replicated and process 1 begins to enjoy the benefits having sequential replicas : the rate of reading increases . Process 2 always performs slightly worse with replication because of the replication overhead that never pays off . Replica reclamation begins by about the 10th iteration, and clearly only unused, useless replica chains of randomly-read files are reclaimed, while the good replicas of patterns are  44    Chapter 5 . Evaluation  1e+06  900000 • eot2 patterns  '800000  o eot2 random ' replication patterns o  replication random  700000  600000  0 E  2  500000  400000  300000  200000  100000  1000  2000  3000  4000  5000  6000  7000  8000  9000  10000  11000  total time, s  Figure 5 .5 : Performance of a pattern-based synthetic workload with noise. kept, resulting in the speedup for process 1.  5 .4 Overhead To assess the overhead of replication, we have used the Bonnie++ microbenchmark . Bonnie++ is a file system and hard drive performance benchmark that tests sequential and random reads and writes, as well as file creations and deletions . We used 4 GB for the total dataset size parameter and ran it on the initially empty disk. The sequential I/O part of the tests shows that the replication introduces less than 0 .6% overhead, as can be seen from Figure 5 .6 . The number of replicas created during the sequential tests is negligible, since sequential  45  Chapter 5 . Evaluation  60000  50000  40000  20000  10000 -  re~  Figure 5 .6 : Results of the sequential block I/O part of the Bonnie++ microbenchmark with and without replication. reads do not trigger replication. When performing random seeks, on the other hand, we observe a 6 .3% degradation when replicating compared to no-replication performance (Figure 5 .7) . This is the expected outcome of replicating without reaping any benefits of the replication . The random I/O portion of the tests is too short even if there were patterns for any benefits to be visible.  5 .5 Git Git [2] is a version control system used, among others, for Linux kernel development . It is a distributed source management tool, containing the entire repository in every copy, with no need for a central repository for the operation . Git was designed specifically with the Linux project in mind and 46  Chapter 5. Evaluation  0-  5E  4.12  50  .acoo.  120 -  80  s0  40  20  0  Figure 5 .7 : Results of the random block I/O part of the Bonnie++ microbenchmark with and without replication. built for scalability and efficiency. Every commit to the repository is stored in a file named after the hash value of the contents of the commit, causing non-sequential disk accesses when going through the history. To test the performance of git with replication, we have obtained a repository containing the Linux development tree versions 2 .6 .11 through 2 .6 .23 .11 (as of December 2007), occupying 484 MB on disk, and ran the status command, which returns the list of files that have been modified since the last commit (or "nothing to commit" message if nothing was modified, as was our case) . Running git status on the unmodified system takes on average 9 .3 seconds to complete . We ran the command repeatedly, remounting the file system between the repetitions to minimize buffer cache effects . With replication enabled, the command takes 12 .4 seconds (35% slower) to coin-  47    Chapter 5 . Evaluation  •  ezt2 replication  10  0 21  I 31  41  51  61  71  I  I  81  91  101  iteration number  Figure 5 .8 : Running git status command on the Linux kernel development tree. plete the first time, but by the fifth iteration there is a 45% speedup, with the command completing in 5 .1 seconds on average . Although the vast majority of the accesses are concentrated within 400,000 adjacent disk blocks (out of 20 million total), meaning that the seeks will never be too long, sequential replication also reduces the rotational delay, which helps to reduce the total execution time in this workload.  5 .6  Summary  We can see that workloads with that exhibit repeating access patterns can be greatly improved by replicating ; the more regularity there is in the access patterns, the more improvement is possible . The dataset must of course be large enough so that the patterns are visible by the block layer and not  48  Chapter 5 . Evaluation distorted too much by the cache hits . Replication results only in overhead when workloads have no locality and access the disk in random locations. Fortunately, real-life application rarely behave this way . When the access is sequential, there is no performance overhead associated with replication, because no replicas are created . In general, writes have no effect on performance unless they overwrite or invalidate replicas . If the application rapidly allocates and frees disk space, the benefit of replication will be diminished, as many replicas would be invalidated without having been used .  49  Chapter 6 Future Work There are a number of improvements and directions for further investigation that could be applied to the design of our block-level replication. Choosing replicas . Currently, we select the first replica that we find to be close to a previous request . When there are multiple replicas that could be used, the selection could be performed more thoughtfully. For example, the more recently used replica, or the replica that is part of the longest replica chain could be favoured. Restoring invalidated replicas . In the current implementation, when a replicated block is overwritten by the file system, all of its replicas are invalidated and their blocks are added back to the free list . As we have seen, this can split replica chains that have metadata blocks intermixed with data blocks when the metadata blocks are updated, causing re-replication of patterns that already have almost-complete replica chains . If instead of invalidating replicas we propagate the updates to replicas, we might be able to save some replication effort . Clearly, update propagation will introduce more complexity, and care would have to be taken to avoid introducing undue overhead . For example, updating a replica might be delayed until the next time we see a request for nearby blocks. Controlling the replication rate .  As the previous chapters have  shown, controlling the replication rate is difficult . It is hard to predict how  50  Chapter 6 . Future Work  useful a particular replica will turn out to be, and whether it is worth to replace one replica with another . When the active set of replicas exceeds the amount of free blocks that are available to us, the reclamation rate will exceed the rate of replicas becoming old, and in the worst case we might end up overwriting replicas too soon, without having had time to benefit from them . Even though we have a throttle for this case that prevents unbounded growth in overhead, it is possible to do better than that . For example, simply stopping the replication when running out of free space can result in better performance for a workload with regular accesses, at least for some time . The workload would benefit from the existing replicas all the time instead of throwing them out and recreating them again, as well as avoid the overhead of creating new replicas . Detecting such situations and employing this simple strategy would improve performance. On the other hand, when the disk is idle, it would not hurt the foreground tasks if we replicate more aggressively . Idle periods could also be used for delayed replication or restoration of invalid replicas. When a process accesses disk in a truly random fashion, ideally we want to recognize this behaviour and disable replication altogether . One way to throttle replication is to monitor replica usage and creation rates by each process and disallow replication if a process hasn't been using enough replicas . Since one process can benefit from replicas created by another, replication quota can be influenced by the "child" replica usage (adding the "creator" information to each replica) . To allow for the changing I/O behaviour of the processes, the replication quota can be increased once in a while as well. Replica reclamation . We have seen the importance of not reclaiming active replicas . Currently, we examine the bottom 10% of the replicas in 51  Chapter 6. Future Work  the LRU order . There is no guarantee however that this amount does not include actively used replicas . Instead of hard-coding the 10% value, we can start with a lower proportion and increase it over several iterations until we find a satisfactory block sequence to reclaim . Investigating different aging techniques (when to age, what age thresholds to use) and using the age information when selecting a reclamation sequence would also be worth exploring .  52  Chapter 7  Conclusion We have described an approach to improve disk read performance by replicating file system blocks into free space at the block level . The replication is implemented as a Linux kernel module that resides below the Virtual File System and relies on the free block information supplied by a separate component in the user space . Intercepting the I/O requests, the kernel module improves non-sequential reads by using sequential replicas of blocks . Neither applications nor the file system are aware of the replication, and the free space on disk is not allocated explicitly to store replicas . When replicas fill the free space completely, new replicas begin to replace old ones. The advantage of replicating blocks rather than rearranging them in a more optimal order is the ability to optimize multiple read patterns at the same time, e .g ., improve non-sequential read patterns without degrading sequential read performance . Replicating into free space removes the auxiliary storage requirement, and works well most of the time, since file systems are generally not very full . Working at the block level untangles the implementation from the file system code and makes it completely transparent to the block layer users. Optimizing disk performance is a well-explored topic, and much work has been done in areas of disk caches, I/O scheduling, optimizing file system layouts, and adapting block layout dynamically . While caching improves  53  Chapter 7. Conclusion workloads with high locality and I/O scheduling reduces unnecessary head movements, improving layout or trading disk capacity for performance is sometimes the only remedy for long seeks . This work explores the point in the design space where advantages of working at the block level are combined with the transparent use of the file system free space . Read-intensive workloads that exhibit repeated access patterns can benefit at little extra cost .  54    Bibliography [1] HFS Plus Volume Format .  http ://developer .apple .com/  technotes/tn/tn1150 .html,2004. [2] Git - Fast Version Control System . http : //git . or . cz/, 2007. [3] Sedat Akyurek and Kenneth Salem . Adaptive block rearrangement. ACM Trans . Comput . Syst ., 13(2) :89-121, 1995. [4]S . D . Carson . A system for adaptive disk rearrangement . Softw . Pract. Exper., 20(3) :225-242, 1990. [5] James Cipar, Mark D . Corner, and Emery D . Berger . TFS : A Transparent File System for Contributory Storage . In Proceedings of USENIX Conference on File and Storage Technologies (FAST), pages 215-229, San Jose, CA, 2007. [6] Wiebren de Jonge, M . Frans Kaashoek, and Wilson C . Hsieh . The logical disk : a new approach to improving file systems . In SOSP '93: Proceedings of the fourteenth ACM symposium on Operating systems principles, pages 15-28, New York, NY, USA, 1993 . ACM. [7] P.J . Denning . Effects of Scheduling on File Memory Operations . In AFIPS Spring Joint Computer Conference, pages 9-21, 1967 .  55  Bibliography [8] John R . Douceur and William J . Bolosky . A large-scale study of filesystem contents . In SIGMETRICS ' 99 : Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 59-70, New York, NY, USA, 1999 . ACM. [9] Robert M . English and Alexander Stepanov. Loge : a self-organizing disk controller . In Proceedings of the Winter USENIX Conference. USENIX Association, 1992. [10] Gregory R . Ganger and M . Frans Kaashoek . Embedded inodes and explicit grouping : exploiting disk bandwidth for small files . In ATEC'97: Proceedings of the Annual Technical Conference on Proceedings of the USENIX 1997 Annual Technical Conference, pages 1-4, Berkeley, CA, USA, 1997 . USENIX Association. [11] Hai Huang, Wanda Hung, and Kang G . Shin . FS2 : dynamic data replication in free disk space for improving disk performance and energy consumption . SIGOPS Oper. Syst . Rev ., 39(5) :263-276, 2005. [12] L . Huang and T . Chiueh . Implementation of a Rotation Latency Sensitive Disk Scheduler . Technical Report HPL-CSP-91-7, HP, 1991. [13] L . Huang and T . Chiueh . Implementation of a Rotation Latency Sensitive Disk Scheduler . Technical Report ECSL-TR81, 2000. [14] Sitaram Iyer and Peter Druschel . Anticipatory scheduling : a disk scheduling framework to overcome deceptive idleness in synchronous I/O . In SOSP '01 : Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 117-130, New York, NY, USA, 2001 . ACM . 56  Bibliography [15] Marshall K . McKusick, William N . Joy, Samuel J . Leffler, and Robert S. Fabry . A fast file system for UNIX .  ACM Trans . Comput. Syst .,  2(3) :181-197, 1984. [16] A .G . Merten .  Some Quantitative Techniques for File Organization.  PhD thesis, 1970. [17] Rodney Van Meter . Observing the effects of multi-zone disks . In ATEC'97: Proceedings of the Annual Technical Conference on Proceedings of the USENIX 1997 Annual Technical Conference, pages 2-2, Berkeley, CA, USA, 1997 . USENIX Association. [18] R .A . Lind P.H . Seaman and T .L . Wilson . An Analysis of AuxiliaryStorage Activity . In IBM System Journal, 5(3), pages 158-170, 1966. [19] T . Ts'o R . Card and S . Tweedie . Design and Implementation of the Second Extended Filesystem . In First Dutch International Symposium on Linux, 1994. [20] L . Reuther and M . Pohlack. Rotational-position-aware real-time disk scheduling using a dynamic active subset (DAS) . In Real-Time Systems Symposium, pages 374-385, 2003. [21] Mendel Rosenblum and John K . Ousterhout . The design and implementation of a log-structured file system . ACM Transactions on Computer Systems, 10(1) :26-52, 1992. [22] C . Ruemmler and J . Wilkes . Disk Shuffling . Technical Report HPLCSP-91-30, HP, 1991 .  57  Bibliography [23] Douglas S . Santry, Michael J . Feeley, Norman C . Hutchinson, Alistair C. Veitch, Ross W . Carton, and Jacob Ofir . Deciding when to forget in the Elephant file system . SIGOPS Oper . Syst. Rev ., 34(2) :18-19, 2000. [24] M . Seltzer, P. Chen, and J . Ousterhout . Disk Scheduling Revisited . In Proceedings of the USENIX Winter 1990 Technical Conference, pages 313-324, Berkeley, CA, 1990 . USENIX Association. [25] Margo Seltzer, Keith A . Smith, Hari Balakrishnan, Jacqueline Chang, Sara McMains, and Venkata Padmanabhan . File system logging versus clustering : a performance comparison . In TCON'95: Proceedings of the USENIX 1995 Technical Conference Proceedings on USENIX 1995 Technical Conference Proceedings, pages 21-21, Berkeley, CA, USA, 1995 . USENIX Association. [26] C . Staelin and H . Garcia-Molina . Smart Filesystems . In USENIX Winter, pages 45-52 . USENIX Winter, 1991. [27] Stephen Tweedie . Journaling the Linux ext2fs Filesystem . In LinuxExpo, 1998. [28] Andrew Warfield, Russ Ross, Keir Fraser, Christian Limpach, and Steven Hand . Parallax : managing storage for a million machines . In HOTOS'05 : Proceedings of the 10th conference on Hot Topics in Operating Systems, pages 4-4, Berkeley, CA, USA, 2005 . USENIX Association. [29] Xiang Yu, Benjamin Gum, Yuqun Chen, Randolph Y . Wang, Kai Li, Arvind Krishnamurthy, and Thomas E . Anderson . Trading capacity for performance in a disk array . In OSDI'00 : Proceedings of the 4th con58  Bibliography ference on Symposium on Operating System Design 4 Implementation, pages 17-17, Berkeley, CA, USA, 2000 . USENIX Association.  59  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051359/manifest

Comment

Related Items