Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Using idle workstations to implement parallel prefetch prediction Wang, Jasmine Yongqi 1999

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1999-0641.pdf [ 4.81MB ]
Metadata
JSON: 831-1.0051684.json
JSON-LD: 831-1.0051684-ld.json
RDF/XML (Pretty): 831-1.0051684-rdf.xml
RDF/JSON: 831-1.0051684-rdf.json
Turtle: 831-1.0051684-turtle.txt
N-Triples: 831-1.0051684-rdf-ntriples.txt
Original Record: 831-1.0051684-source.json
Full Text
831-1.0051684-fulltext.txt
Citation
831-1.0051684.ris

Full Text

Using Idle Workstations to Implement Parallel Prefetch Prediction by Jasmine Yongqi Wang B. Sc.,Beijing Information Technology Institute, Beijing, P. R. China, 1997  A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF M a s t e r of Science in THE FACULTY OF G R A D U A T E STUDIES (Department of Computer Science)  We accept this thesis as conforming to tjje required standard  The University of British Columbia August 1999 © Jasmine Yongqi Wang, 1999  ln  presenting  degree  this  at the  thesis  in  partial fulfilment  of  University of  British Columbia,  I agree  freely available for reference copying  of  department  this or  publication of  and study.  this  his  or  her  Department The University of British Columbia Vancouver, Canada  /4#n* I*. Iff 9  DE-6 (2/88)  that the  representatives.  may be It  thesis for financial gain shall not  permission.  requirements  I further agree  thesis for scholarly purposes by  the  is  an  advanced  Library shall make it  that permission for extensive granted  by the  understood be  for  allowed  that without  head  of  my  copying  or  my written  Abstract  The benefits of prefetching have been largely overshadowed by the overhead required to produce high quality predictions. Although theoretical and simulation-based results for prediction algorithms such as Prediction by Partial Matching (PPM) appear promising, practical results have thus far been disappointing. This outcome can be attributed to the fact that practical implementations ultimately make compromises in order to reduce overhead. These compromises limit the level of complexity, variety, and granularity used in the policies and mechanisms the implementation supports. This thesis examines the use of idle workstations to implement prediction-based prefetching. We propose a novel framework to leverage the resources in a system area network to reduce I/O stall time by prefetching non-resident pages into a target node's memory by an idle node. This configuration allows prediction to run in parallel with a target application.  We have implemented a revised  version of the GMS global memory system, called GMS-3P, that provides parallel prediction-based prefetching. We discuss the different algorithms we have chosen and the policies and mechanisms used to control the quality of predictions. We have also implemented a low overhead mechanism to communicate the history fault trace between the active node and the prediction node. This thesis also explores the needs of programs which have access patterns that cannot be captured by a single configuration of PPM. The dilemma associated with conventional prediction mechanisms that attempt to accommodate this behaviour is that varying the configuration adds overhead to the prediction mechanism. By moving the prediction mechanism to an idle node, we  ii  were able to add this functionality without compromising performance on the application node. Our results show that for some applications in GMS-3P, the I/O stall time can be reduced much as 77%, while introducing an overhead of 4-8% on the node actively running the application.  in  Contents Abstract  ii  Contents  iv  List of Tables  vii  List of Figures  viii  Acknowledgements  ix  Dedication  1  x  Introduction  1  1.1  Overview and Motivation  1  1.2  Goals of Parallel Predictive Prefetching in GMS  3  1.3  Issues for Prediction based Prefetching  4  1.3.1  Overhead of Predictive Prefetching  4  1.3.2  Support of Multiple Prediction Algorithms  5  1.3.3  Decisions on What to Prefetch  6  1.4  Contributions of this thesis  6  1.5  Organization of this thesis  6  iv  2 Background  7  2.1 Global Memory System 2.2  2.3  2.4  7  Prefetch in Literature  . -.  9  2.2.1  Background  9  2.2.2  Prefetching Through Hints  10  2.2.3  Prefetching Through Predictions  12  Prediction by Partial Matching  15  2.3.1  The Algorithm  15  2.3.2  Temporal and Spatial Locality  22  2.3.3  Other P P M Variants  23  2.3.4  Limitations of P P M in Practice  23  Using Idle Workstations  24  3 System Design and Implementation  26  3.1 The Big Picture  27  3.2  Architecture of the Active Node  28  3.2.1  Tracing Fault Stream  29  3.2.2  Handling Prefetched Pages  30  3.3 Architecture of the Prediction Node  31  3.3.1  Prediction by Partial Matching  31  3.3.2  Multiple Prediction Algorithms  31  3.3.3  Tracking the Active Node's Memory Content  32  3.3.4  Decisions on What to Prefetch  34  3.3.5  Data Maintained on the Prediction Node  35  3.4  Architecture of the Other Nodes  37  3.5  Multiple Prediction Nodes  38  3.6  Trace Communication and Collection  39  v  3.7  3.6.1  Granularity of the Trace Information  40  3.6.2  Alternate Trace Collection  41  3.6.3  Implementation in GMS-3P  41  Design Tradeoffs and Limitations  43  3.7.1  Prefetching From Global Memory into Local Memory  43  3.7.2  Having Prefetched Pages as Duplicated Pages  44  3.7.3  Using the GCD Node to Track the Active Node's Memory  45  3.7.4  Keeping the History of Prefetched Pages  46  4 Performance Evaluation  5  47  4.1  Experimental Setup  47  4.2  Microbench marks  48  4.2.1  Access Latency  48  4.2.2  Prediction Latency  49  4.2.3  Comparison of gms_getpage in GMS and GMS-3P  51  4.3  Performance of the Synthetic Applications  53  4.4  Application Performance  55  4.4.1  LU Decomposition  55  4.4.2  0 0 7 Benchmark  57 59  Conclusions and Future Work 5.1  Summary  59  5.2  Future Work  60  Bibliography  63  vi  List of Tables 4.1  Latency of page access  48  4.2  Prefetch latency  50  4.3  Comparison of different order different depth of P P M predictions  53  4.4  The statistics for LU decomposition with different configuration of P P M  56  4.5  The effect of the hysteresis accuracy control in 0 0 7  57  vii  List of Figures 2.1  The trie for A B C  17  2.2  A Prediction Example  21  3.1  The System Architecture of GMS-3P  27  3.2  Getpage control flow in GMS-3P  29  3.3  Handling prefetched pages on the active node  30  3.4  Tracking active node's memory from prediction node  33  3.5  Data maintained on the prediction node  36  3.6  Sender and Receiver Rings in GMS-3P  42  4.1  Detailed timeline of GMS-3P operations during a prefetch process  49  4.2  Latency of gms_getpage in GMS and GMS-3P  52  4.3  Comparison of sequential access between GMS and GMS-3P  54  4.4  comparison of the prediction accuracy histogram of different P P M configurations . . 56  4.5  Comparison of I/O wait time in 0 0 7 between GMS and GMS-3P  viii  58  Acknowledgements As I write this acknowledgment, I can hardly believe that I have made it through. There were times I was very frustrated with myself, but I owe many thanks to my supervisor, Dr. Mike Feeley, who has stood by me through all this, and offered me countless hours of help and encouragement. His support and ideas have been invaluable to me. I also want to thank Dr. Norm Hutchinson, not only for reading my thesis while he is still on leave, but also for his support and trust over the last two years. Thanks to Dr. Mark Greenstreet for his initial suggestions about this work. I owe this thesis to my dear grandparents and parents. During my early years of living with my grandparents, they have planted a seed in my heart to pursue my studies. My parents gave me the opportunity to have an education and a desire to continue on. Without their encouragement, teaching, and love, I cannot imagine how I could have reached this stage. Many of my teachers have made my education. I am so grateful to have met so many great people to show me the way in different phases of my life. It is the teachers I admire that give me the inspiration and strength to go on. I also want to give thanks to many of my friends, who have shared this journey with me in many different ways. I am blessed to have such wonderful friends, who never fail to give me support when it is mostly needed. Many people in the CS department helped make the last two years an unforgettable experience. In particular, I want to express my deep thanks to everyone in the DSG lab: Yvonne Coady, for her cheers throughout rains and sunshine, and also for doing many of the figures in this thesis; Joon Suan Ong, for numerous discussions about GMS and non-GMS related issues, and for writing the light weight trace communication module; Ashley Wijeyeratnam, for sharing the lab with me for the longest time; Douglas Santry, for his jokes and keen knowledge of computers and non-computer related fields; Alex Brodsky, for his encouragement and for proof-reading my thesis draft; and Jan Pedersen, for being part of the lab.  JASMINE YONGQI W A N G  to My Grandparents and Parents  Chapter 1  Introduction 1.1  Overview and Motivation  Processor speed is increasing at a faster rate than storage systems; the running time of many applications can not improve with the processor speed if they need to access storage systems often. One way to reduce running time of data-intensive programs is to use parallel disks, such as RAID [PGK88], employing parallel disk I/O to increase the bandwidth of the underlying storage systems. Caching recently used data can reduce the I/O overhead if the application's workload is small or has high locality, in which case the working set of the application fits into main memory [Den68]. This is not effective, however, for many applications that access a large amount of data; prefetching is another way to reduce storage access time. Prefetching is a technique to improve the performance of data-intensive applications by reducing the number of memory misses. It addresses both the temporal locality, which is partially taken advantage by caching, and the spatial locality access characteristics of the program. If data is transferred on demand when it is accessed, the I/O system is unable to optimize disk transfers and thus cannot deliver anything approaching peak disk throughput; furthermore applications are unable to overlap computation with disk latency. If on the other hand, the system is able to fetch data before it is accessed, disk reads can be grouped together for better disk throughput and data  1  can be delivered to memory before it is needed, thus hiding latency and improving application performance. Key to the success of any prefetching technique is the ability to know or predict data accesses before they happen. Existing techniques fall into two categories. First, applications can be instrumented to provide hints  [VAK+98, PGG+95, TPG97, CFKL96] to the system about  future accesses. Unfortunately, adding hints to an application adds significant burden on the programmer, particularly if access patterns are complex or data-dependent. Second, the reference history of an application can be used to predict future references. This is the main technique used by most commercial file systems; for example, most UNIX file systems automatically detect sequential access and when detected, start prefetching data sequentially ahead of the referencing program. Detecting more complex access patterns is, of course, more difficult, but significant progress has been made from a theoretical perspective [CKV93, VK96]. The idea of using predictive methods to prefetch originates from pattern detection. The problem of detecting patterns from an access stream for the purpose of predicting a future access is similar to data compression [CKV93], so we could use compression algorithms to predict future accesses. To make use of prediction-based prefetching, the system must be able to predict future references accurately and efficiently. The key limiting factor in traditional approaches for predictionbased prefetching is prediction overhead, in terms of computation and memory usage. Increasing prediction accuracy slows the target application, because in these systems all of the overhead is borne by the same node that is running the application. This thesis describes our solution to this dilemma. We move the prediction process from the application node to an idle workstation in the network. Exploiting idle workstations for hosting guest computation  [ADV+95, AES97, AS98, CNSW97], or hosting memory for other nodes  [DWAP94, FPK+95} has drawn increasing interest in the recent past. We introduce a framework to employ both the computational power and memory facility of an idle workstation in the System Area Network (SAN) to perform prediction and prefetching for an active node which runs I/O intensive applications. 2  By using an idle workstation, we are able to increase prediction-algorithm complexity, execute multiple algorithms in parallel, or refine trace-data granularity without adding overhead to the target application. Currently we have used two configurations of the P P M (Prediction-by-PartialMatching) algorithm. P P M is a series of Markov models of the same and less orders. models are finite-context models, which are based on finite-state machines  Markov  [BCW90]. They have  a set of states 5,-, and a set of transition probabilities P{j that give the probability that when the model is in state Si it will next go to state Sj. We will discuss this in detail-in Section 2.3. We.have implemented our approach as part of the Global Memory System (GMS) [FPK+95]. G M S is a kernel-integrated facility that manages the memory of a workstation cluster as a global resource, paging data to and from remote memory and implementing a global replacement policy based on the Least Recently Used (LRU) policy. Our system, called G M S - 3 P ( G M S with Parallel Predictive Prefetching), extends G M S to include parallel predictive prefetching. G M S - 3 P captures access trace data on the application node and forwards them to a designated idle node that implements the prefetch policy. This node may forward the trace to other nodes if multiple policies are being computed in parallel. The prediction nodes then prefetch the data from global memory. If the page is. resident in global memory, it is transmitted to the application node. Subsequent access on that page will result in a prediction hit, which reduces the latency of servicing the page fault.  1.2  Goals of Parallel Predictive Prefetching in GMS  The goals of this thesis are: 1. Minimize the overhead involved in predictive prefetching by building up a framework to employ idle-workstation resources; 2. Identify the benefits of using higher order and deeper depth as the parameters to the P P M algorithm and the benefits of using different algorithms to capture the access pattern phase changing of an application;  3  .3. Verify'that certain mechanisms are needed to control the prefetch requests to be issued upon potentially good predictions; The first goal is to make the overhead of making predictions, issuing prefetch requests and getting prefetched pages as low as possible. By moving most of the prefetch process to another idle node in the cluster, we will show that this goal is achieved. The second goal relates to making accurate predictions. By using multiple algorithms it is possible to adapt to the access phase-changes of an application. Using higher order and deeper length for the P P M algorithm, we have a better chance of making the right predictions. The third goal is to discern good prefetch candidates from the valid predictions. Employing multiple prediction algorithms and using higher order deeper length P P M leads us to more predictions. We use hysteresis to monitor how well a prediction algorithm is doing to decide what to prefetch among the predictions made by the different algorithms. This allows the use of multiple algorithms in parallel where each performs well on only a subset of access patterns.  1.3  Issues for Prediction based Prefetching  This section introduces the key issues that must be addressed to achieve our goals. First, we discuss the overhead of prediction based prefetching. Second, we discuss the support for multiple prediction algorithms to capture the different access patterns. Third, we discuss the method of deciding what to prefetch on behalf of another node. The discussion includes references to the sections of this thesis where each issue is further addressed.  1.3.1  Overhead of P r e d i c t i v e Prefetching  Theoretical and simulation results have been very promising [CKV93, VK96] on using compression algorithms to prefetch, but practical results have thus far been disappointing. In [BKL+99], Batels et al. report that the order-one Markov predictor only improves the performance at most by 2% for some applications. We believe that prediction overhead is one of the key obstacles to improving  these practical results. First of all, prediction algorithms tend to be computationally and memory intensive, particularly as the algorithm's predictive power is increased. Improving a Markov-based predictor to detect strings of length three instead of length two, for example, may increase prediction time by a factor of two and increases memory consumption by a factor equal to the number of data pages in the file accessed by the target program.  Overhead thus limits the level of acceptable  complexity, and consequently the predictive power, of the algorithm. We propose to use an idle workstation in a cluster to make predictions on behalf of a target node which runs I / O intensive applications. Thus, the target node only needs to send its fault stream to the prediction node. We will show the quantitive difference between making predictions and communicating the fault trace in Section 4.2.2.  1.3.2  Support of Multiple Prediction Algorithms  One limitation of a single predictive method is that an algorithm will only be able to model one set of access patterns. Programs that exhibit multiple access patterns that are suited to different prediction algorithms may sacrifice many good predictions if there is only one algorithm. Two common techniques, for example, could be to base predictions on page addresses or on the differences between subsequent page addresses. The first technique is effective for detecting temporal locality that occurs when a program accesses the same set of pages repeatedly. The second technique is effective for detecting spatial locality that occurs when a program accesses data in a stylized way, for example when striding through an array. The key point is that many programs exhibit multiple behaviours over time and that different algorithms are specialized for each different behaviour. For the best prediction accuracy it would be useful to run multiple algorithms in parallel to catch the behaviour-shifts of the running application. By moving the prediction process to another node, we are able to run two algorithms modeling the different patterns an application may have and not introduce more overhead on the active node. We will discuss the two algorithms in Section 2.3.1 and-show that this scenario exists in Section 4.4. 5  :  1.3.3  Decisions on W h a t to Prefetch  Both Section 1.3.1 and Section 1.3.2 have introduced new challenges for deciding what to prefetch. The prediction node must know what is currently cached on the application node to avoid overhead on redundant prefetches. Traditionally, the quality of the predictions are controlled by a threshold parameter, only possibilities higher than the threshold would be considered prefetch candidates.; This is not sufficient when using more algorithms. Different algorithm may predict different candidates which are all above the threshold. Thus, the number of predictions will possibly go up linearly with the number of prediction algorithms employed. To avoid these thought-to-be-correct predictions, we use a hysteresis accuracy threshold to combine the predictions with the knowledge of current access patterns and thus reduce the risk of hurting the performance. We discuss the solutions in Section 3.3.4.  1.4  Contributions of this thesis  This thesis makes the following contributions: 1. It presents a design of a parallel prediction based prefetching system. 2. It describes a complete implementation of GMS-3P as part of the GMS global memory system. 3. It analyzes the performance of this system.  1.5  Organization of this thesis  The remainder of this thesis is organized as follows. Chapter 2 provides the necessary background and reviews related work. We then describe the design and implementation of our GMS-3P system in Chapter 3. Following that we present the performance of our prototype and analyze the system. in light of the performance. Finally we summarize and conclude in Chapter 5.  6  Chapter 2  Background This chapter provides background information for this thesis. First, we briefly introduce Global Memory System (GMS) [FPK 95, Fee96], which is the foundation of this work. Second, we review +  the current approaches utilizing the prefetch technique in various situations. Third, we discuss the algorithm employed and the specific parameters involved. Fourth, we introduce the idea of using an idle workstation in the system area network to preform the prediction and prefetching on behalf of the active node.  2.1  Global Memory System  The Global Memory System [FPK+95, Fee96] is a kernel integrated facility that manages the memory of a workstation cluster as a global resource, paging data to and from remote memory, and implementing a global replacement policy based on the Least-Recently-Used (LRU) policy. The goal of GMS is to use global memory to increase the average performance of all cluster applications. It adds a.new layer to the traditional memory management routines and affects all three key memory management sub-policies: fetch, placement, and replacement [MBKQ96]. GMS uses a single, unified, but distributed memory management algorithm at the lowest level of the operating system so that all system-and higher-level software, including virtual memory  7  system, file system, transaction systems, and user applications can benefit from available network memory. When the system replaces a local-memory page, it uses an efficient distributed algorithm to check the age of other pages in the network. If older pages exist elsewhere, the local page is forwarded to the node with one of the globally-oldest pages where it replaces that page; a reload of that page may therefore avoid a disk read and occur much faster. When the system handles a page fault, it uses the G M S global directory to determine whether the desired page is stored on a remote node. If the page is located, either because of a previous page out or because the page is shared, G M S transfers the page to the requesting node, thus avoiding a disk access to read the page. G M S identifies a page in the system with a 128-bit network wide unique ID (UID), which is determined by its location on disk. Specifically, the UID consists of the IP address of the storing node, device number of the storage system, inode number, and page offset within the inode. G M S locates a page via a "distributed directory database. The key data structures of the managing database are: • a per-node page-frame-directory ( P F D ) that describes every page resident on that node. A successful UID lookup in the P F D yields information about the physical page frame containing the data. • the global-cache-directory ( G C D ) , a distributed cluster-wide data structure that maps a UID to the IP address of a node P F D which caches that particular page identified by the UID. • a replicated page-ownership-directory (POD) that maps a page UID to a manager node (storing the G C D entry) responsible for maintaining location information about that page. The P O D is only changed when machines are added to or deleted from the cluster. The two most important operations of G M S are getpage and putpage. On a getpage request, G M S maps the UID using the P O D to find the G C D node for that page and sends a request to that node; the G C D node receives that request, checks whether the page is cached in  8  the network. If found, a message is sent to the cache side and the page is transfered to the original requester; if not, a message is sent back to the requester which then reads the page from disk. On a putpage operation, the node performing the putpage picks a storing PFD node using the global distributed age information to find the host for a page, older than this one, sends the page to that node and updates its own PFD status and the GCD node which stores the information for this page. If the nodes fails to locate any page older than the current one, it discards the page and sends a message to GCD node regarding this action. Adding prefetching to GMS is almost like adding another layer to the memory subsystem. Between the directly accessible local memory and global memory configuration, there is a layer that stores prefetched pages in local memory. At the time of a V M page fault or file system cache miss, the prefetch layer is first checked to see whether it has the desired page. If the page has already been fetched in, the fault is serviced; otherwise, it follows the original GMS get page routine to service the fault.  2.2 2.2.1  Prefetch in Literature Background  The memory management system is a central component of any operating system [MBKQ96]'. The memory resources available to a machine are typically layered in a hierarchical fashion, with access time inversely proportional to their proximity to C P U . Traditionally, main memory and secondary storage—local disk are the main components of the memory system; data are moved from disk to memory when needed and from memory to disk as the permanent store. Network memory comes between these two because new technology advances have made network transfers faster than disk reads. Remote disk (e.g., NFS) is the furthest from the CPU, since it includes both a disk read and a remote transfer. The goal of any memory management policy is to optimize the placement of data at the right level of this hierarchy to minimize access time by eliminating unnecessary disk and remote-memory accesses, thus achieving the best possible performance. 9  Normally, accesses on non-resident-pages from main memory are fetched-on-demand from the higher level of the memory system. The processor either stalls on the current faulting process or switches to another process while it waits for the page fetch. A demand-fetch system can not supply anything more than the disk throughput, and sometimes, less than the disk throughput when the I/O system can not saturate the bandwidth of the underlying disk, e.g., when parallel disks are employed. To alleviate the disk access latency, prefetching is introduced to fetch pages, other than the one currently causing the page fault. If we can successfully prefetch the desired non-resident pages, the subsequent accesses would be faster because the. I/O is. overlapped with computation. With the increasing discrepancy between the disk access time and processor speed, a considerable amount of research has been conducted on prefetching the desired future references into the lower memory hierarchy to reduce I/O stall time. Prefetch is a well understood technique and has been utilized in many areas including microprocessor architectural designs [CB92, MG91, TE93], databases I/O [CKV93, PZ91], and file systems [PGS94, PG94, KL96, LD97]. More recently, there have been active research on hiding Internet latency by prefetching in the Web context, including prefetching between Web servers and browser clients, prefetching between Web servers and caching proxies, and prefetching between proxies and browser clients [KLM97, PM96, FCLJ98, JK97]. Key to the success of any prefetching technique is the ability to know or predict future references. There are two main approaches targeting this problem. First, applications can be instrumented to provide hints to the system about future accesses. Second, the reference history of an application can be used to predict future references. There have been a few research projects focusing on prefetching either through hints of future references or speculations on such information. Below we detail systems that exploited either of these techniques.  2.2.2  Prefetching Through Hints  Patterson ef al.'s Transparent Informed Prefetching (TIP) [PGS94, PG94, PGG+95] uses hints from an application to prefetch file data. They point out that better performance can be achieved by hints 10  to parallel disk arrays because of greatly overlapped disk I/O [PG94}. For I/O bound applications, simply utilizing prefetch to overlap computation with I/O has no significant effects. They advocate the disclosure of application knowledge of future accesses to enable informed prefetching and caching [PGG 95]. They estimate the cost and benefit of prefetch decisions to handle the interactions +  between cache replacement and prefetching. Cao et al. [CFKL96] considered the integration of prefetching, caching and disk scheduling in the presence of hints for future references. The difference of the above two is that TIP also uses hints to infer good replacement candidates while Cao uses application controlled cache management policy. Cao et. al. considered their approach would be better when the hints for the reference stream are wrong but the replacement policy is correct. GMS-3P differs from the above systems in two ways. First, GMS-3P is built on top of GMS, thus it may possibly benefit all user level applications. Second, GMS-3P does not need to know about the future reference pattern or the replacement policy, the prefetching decisions are made automatically. Voelker et al. developed PGMS  [VAK 98], a system using cooperative prefetching and +  caching in the presence of hints of future reference demand on top of GMS. They have implemented prefetching from disk to global memory, from global memory to local memory and from disk to local memory. PGMS requires a knowledge of future reference stream hinted by the applications. Another approach of using prefetching by hints is proposed by Steere and Satyanarayanan [SS95]. They present a general abstraction called dynamic  sets  to allow parallel prefetching of  objects that will be required in the near future. It is possible, for instance, to load a web page then specify a dynamic set containing all links on the page. The server could then prefetch the links in parallel. Mowry et al.  [MDK96] present a compiler to produce executables. that, automatically  generate prefetch requests for data that will be accessed in the near future. The compiler is designed for scientific codes, and performs static analysis to uncover many common access patterns, including, complex strided accesses. Those informed approaches have an advantage over the speculative ones in that prefetching 11  is driven by certain knowledge provided in advance by higher levels, rather than by induction made after snooping. There is no danger that incorrect predictive prefetching decisions might trash the cache and introduce unbearable overhead to the whole system. On the other hand, it is not trivial for most of the applications to disclose their future references, reference patterns or the replacement policy. Recoding the existing applications and getting the reference patterns right place substantial workloads on the programmer. 2.2.3  Prefetching T h r o u g h P r e d i c t i o n s  The intuition of using predictive methods to prefetch future references is based on the observation that certain reference patterns appear in the execution of the program. The question is how to model these patterns and thus to predict future references efficiently and effectively. The original research began when Gindele and Smith introduced the sequential one-block lookahead strategy [Smi78, Gin77]. Many Unix file systems incorporate prefetching by using read ahead. Once the system detects that a file is being read sequentially either backward or forward, it starts prefetching in the same direction. This approach can substantially boost performance, but it is unfortunately limited to benefit applications that make sequential references to large files. Griffioen and Appleton's work, automatic prefetching [GA94, GA96], uses probability graphs to predict future file references based on past file requests. They use the open call to gather file access traces and construct a dynamic graph connecting opened files within a lookahead period: In their graphs the files are nodes and the edges between nodes represent sequential (or nearly sequential) accesses or invocations. They argue that other events such as read, write, seek are too frequent to be feasible of monitoring. They use likelihood for consecutive file accessing and a threshold value to filter predictions. Tait and Duchamp [TD91] propose an approach that prefetches all the files in an application's working set as soon as the specific instance of the application being executed can be uniquely identified. Their algorithm employs a working graph, recording file activity (working sets) with the graph. As soon as a unique pattern has been recognized within the trees, an attempt is made to 12  prefetch the entire working set. Later, Lei and Duchamp [LD97] use a similar approach based on access trees that include both the current working trees and the pattern trees (previous working trees). They base their predictions on history references of whole files and prefetch files that match the pattern, but they only prefetch the initial portion of the files. Among, the different approaches been examined, Vitter et al.  [VK96] proposed that the  algorithms used to compress data, such as Prediction-by-Partial-Matching (PPM), are suitable for predicting future references. P P M are a set of Markov models capturing the state transitions in a finite-state machine. Each transition is labeled with an identifiable state, and no two transitions out of one state have the same label. Thus any given sequence of states defines a path through the model, and this path is unique if it exists. Fitting into predictive prefetching, the states could be used to represent the identities in the prediction bases, e.g., accessed files or pages; the transitions capture the relationships between those identities. Kroeger and Long [KL96] have used P P M to predict file access patterns. They treat the file open operation as an event and compose tree-based data structures with these events. They evaluate their work with simulation based on Sprite file system traces [BHK 91] and conclude +  that this method is very beneficial. However, many questions were left unanswered. The key issue they did not address is the timing factor. Though the above listed systems all use prediction based approach to investigate prefetching, they have based their interests on the granularity of whole file while GMS-3P uses page information as the prediction history base. The prediction algorithms also vary, we will get into more details in Section 2.3.1. Palmer and Zdonik [PZ91] use predictive prefetching in databases. They propose Fido, a cache manager that employs an associative memory technique to predict future accesses in a particular isolated context. They show some simulation results suggesting that predictive prefetching . :  within the database domain is a promising approach. Joseph and Grunwald built a Markov prefetcher as an interface between on-chip.and off-chip cache-of the memory system. They use a one level P P M algorithm to predict multiple references' 13  and prioritize those predictions. They also add specific hardware to achieve in-time predictions. Alexander et al. predict and prefetch from D R A M to SRAM on the chip [AK96]. Chen et al. use Markov based predictors to predetermine control flow at branch sites [CCM96]. One difference with GMS-3P is that we prefetch from global memory into local memory. More importantly, the sources to make predictions are different and they use an order-one predictor. Batels et al.  [BKL+99] also investigate using Markov predictions to prefetch in GMS  based on past fault history. They conclude that for some scientific applications, Markov order-one .. predictions are more accurate than Markov order-two. Liberatore [Lib99] studied the validity of order-one Markov chains as a model for page reference patterns and showed that order-one Markov chains are not a statistically valid model for instruction page traces. For these traces, he shows that Markov order-two predictions achieve better prediction rates than Markov order-one predictors. The difference is that Batels et al. look at faulted data pages while Liberatore consider full instruction pages access streams. In the context of the Web, many approaches have used predictive methods to send the . possible requesting candidates to the clients side before clients actually request the pages. Fan et al.  [FCLJ98] proposed to prefetch between caching proxies and clients by proxy-initiating.  They observe patterns from past accesses from all the clients under the proxy to predict the future accesses of individual clients. Their algorithm is P P M with three parameters, the order, the look ahead distance, and the threshold, which is very close to the algorithm we used. It is not clear how they manage the connection between different orders. Bell et al. [BCW90] suggested using vine pointers to save search time and we find it essential to assure efficient predictions, which is very critical in GMS-3P. Padmanabhan et al.  [PM96] use dependency graph which is very similar to  Griffioen et al.'s algorithm [GA96] to predict future client requests between the Web clients and the Web server. The server makes predictions for clients and checks with clients whether to initiate the prefetch or not.  14  2.3  Prediction by Partial Matching  The idea of using compression algorithms to predict future accesses originated with Vitter et al. [CKV93, VK96]. The intuition is that if data compressors can compress data successfully, then its probability distribution on the data must be realistic and can be used for effective predictions. This insight has led to several prediction techniques based on compression algorithms such as P P M , Huffman coding [BCW90] and Ziv-Lempel compression [ZL78]. In each case, the assumption is that program access traces can be modeled by a compression process. If true, then a suitably configured compression model should be able to anticipate the partial strings given the ones currently seen in the stream. Borrowing this idea to fit in predictive prefetching, we use the compression algorithms to detect the patterns existing in the reference stream and complete them by prefetching the remainder of the matched string. Previous work has shown that, in practice, P P M provides the best compression and thus the best prefetch predictions [VK96]-. Some of the other work employing P P M also show satisfactory results [FCLJ98, KL96, JK97]. For this reason, we have chosen P P M as our prediction algorithm in our GMS-3P prototype implementation. It is important to point out, however, that our parallel prefetching framework could be used with any algorithm that uses a program's access history to make prefetch predictions.  2.3.1  The Algorithm  The P P M predictor we have implemented is closely based on the  prediction-by-partial-matching  compressor described by Bell et al. [BCW90]. The algorithm processes the online page fault traces of a program to build a set of Markov predictors for that trace and then uses them to predict the next likely accesses. Each Markov predictor organizes the trace into substrings of a given size and associates probabilities with each string that corresponds to its prevalence in the access history. Given an input history of A B C A B D A B C , for example, an order-two Markov predictor that stores strings of length three, would record the fact that the string AB is followed by C with probability  15  2/3 and by D with a probability of 1/3. The order-two P P M predictor will also record the fact that for order-one Markov prediction, B has a 100% following A , C has a 2/3 probability following B and D with probability 1/3 to follow B. P P M is a set of Markov predictors with the same order or less, so that P P M order-two predictions will include both order-two and order-one Markov predictions. At each step of the algorithm, P P M receives information about the program's most recent access and it updates the Markov predictors accordingly. Then it attempts a partial match with the number of references from recent accesses equivalent to the prediction order against the Markov predictors. If a match is found, the predictors provide a list of accesses that have followed the history string in the past along with their probabilities. An access with sufficiently high probability is considered a prefetch candidate; we use a threshold to filter the low probability predictions. In the next step, the algorithm checks the current contents of memory of the active node for the prefetch candidates. Any candidates that are not currently in memory are valid prefetch requests. The last step performs a hysteresis accuracy rating to decide whether or not to issue the prefetch requests; this is discussed in detail in Section 3.3.4. Parameters The P P M algorithm has three parameters: • o: order is the length of the history substring that the algorithm uses to find a match; • d: depth is the number of accesses into the future the algorithm attempts to predict; • t: threshold is the minimum probability an access must have in order to be considered a prefetch candidate. A P P M of order o and depth d consists of o + d Markov predictors of order i, where 1 < i < o + d. Here we do not consider the order zero Markov predictor, which is a variation of most-frequently-used policy, because it simply counts the number of references and chooses the one with highest appearance frequency to be the prediction. This is opposite of our intent of using the 16  Figure 2.1: The trie for A B C . access contexts to make future reference predictions. A Markov predictor of order o, depth one, is a trie of height o+ 1 (we will discuss tries in detail in the following section). Starting at the root, there is a path in the trie for every string in the input stream of length o + 1 or less. The height of the trie needs to be o + d if we are going to make order o Markov predictions with depth d. For instance, to be able to make order three, depth three predictions, the height of the trie needs to be six. The first three levels are used to match for the third order predictions, the other three to predict the next three accesses. We use threshold to choose the higher probability successors as the predictions. We calculate the probability by dividing a node's reference count with the number of occurrences of its prediction order minus 1.  The Data Structure The P P M algorithm of order o depth d (PPM(o, d)) maintains a data structure that keeps track of the number of the d access traces following one access trace, a sequence of two accesses, and up to a sequence of o access traces; thus the longest sequence it keeps is o + d. The data structure is a collection of trees with a counter recording the number of accesses of the sequence, which is recognized as a trie. A common start node works as the root of the history access trace, under which are the-accesses that have appeared at least once. We call this the first level access trace, which is used to match for order one predictions. Every node has a counter denoting how many times this access has occurred. Directly after the first level are the pages that follow each le'vel17  one page immediately. The third level encodes the third access of all strings in the trace. When making predictions, the probability of a certain sequence is determined by dividing the last node's appearance counter with the number of occurrences of. the current order minus 1, since this 1 is increased because of the current appearance [F.CLJ98]. The P P M predictors are represented and updated simultaneously using a single trie with vine pointers as suggested by [BCW90]. For every string of length / in the trie, a vine pointer links the last node in the string to the same labeled node of the string of length / — 1, formed when the last character of the string of length / is added. Figure 2.1 shows a very simple trie, the nodes represent accesses and the straight arrows show the access sequence, while the curved arrows denote the vine pointers. For example, given the string ABC, C s vine pointer points to C in the string BC; and in the sequence BC, C s vine pointer points to C in the first level access, which then points to Null. Using vine pointers, we reduce the prediction time because we only need to search for the highest order's root, and use the vine pointer to locate the lower orders' sequences. Locating an access by searching is a comparably slow process as the history net grows. After we have located the accessed page, we can use its vine pointer for lower order predictions until there is no access following that page, namely from order m to order one. As in the example showed in Figure 2.1, if the most recent three accesses are ABC, to make predictions we first check order three by matching A B C , which is done by searching A in the first level of the history net; once ABC has been found and predictions have been made, we are ready to move to make order two predictions. We simply follow C s vine pointer to move up one level, which is equivalent to searching for B in the first level then locating BC. It is much faster to do a pointer traversal compared to the search. Once we are done with order two, we continue on to follow C s vine pointer to make the order one predictions. The vine pointer of nodes in the first level points to null, thus we end the traversal and the prediction process is finished.  18  Operations on the Data Structure The trie is updated every time a new referenced page is received. For PPM(o, d) predictors, the update involves searching for the nodes within sequences of length o + d — 1, updating reference counters and series of pointers, and possibly adding new nodes to the forests of tries. For instance, if the current accessed page received is A, and the last five accesses are CDFGB ( here o + d is 5). We first check whether A has appeared in the past. If so, it should have an entry in the first level of the net; if not, we add a new node to the first level. Then we search for B in the first level and we are certain it should be in the history trace, since at least it was added just one step before when the trace is CDFGB. After locating B, we check whether A is among the children of B; if A has appeared after B in the past this should be the case and we only need to increase the counter of A; if A is not found, we need to add A as a new child of B and pointing its vine pointer to the A we just added to the first level. We continue on to search for G in the first level, locate GB sequence as they are the ancestors of A, then we do the same thing to G B A sequence as we did to BA sequence. This is done till the o+ d — 1th of A's ancestor, in this case until we update the DFGB sequence. Thus we keep the height of the trie to be o + d. Currently we keep all the history references that have ever been seen for an application. The size of the history structure can be controlled by deleting the the least recent accesses or by some other mechanisms as suggested in [GA96]. There are two reasons for our decision to locate the highest order for the recent o sequences. First, we favor higher order predictions over lower order ones, if the prefetch requests can be satisfied by the higher order predictions we do not need to make lower order ones. Second, to employ the vine pointers after we locate the highest order, we only need to traverse the last node's vine pointer to go up one level, as opposed to searching independently for a few times in a possibly huge access history trace. Once the sequence has been located, we take the accessed pages with probability higher than our predefined threshold t. As we shall see in the prediction example, there are possibly repeated predictions from different orders and depths. We add the probabilities of the  19  same predictions together, so it is possible for probability to be greater than 1. We then order the predictions in descending order.  A P P M Prediction Example Here we use an example to illustrate how the predictions are made using the parameters and the data structures discussed in above. Suppose that we are making order three, depth two predictions with a threshold of 0.5. Figure 2.2 is the snapshot of the data structure maintained for the access stream: A B C M N . . . BCDF . . . C D E . . . A B C M N . . . A B C D E . . . A B C , which are a set of Markov predictors of order one to order four. The number close to the node denotes the number of occurrences in the trace of the sequence ending with that node; a solid arrow denotes the access sequences within the trie; and the dashed arrow denotes the vine pointers. The black node in the middle denotes the common start point, which represents the level zero access. Adjacent to that node are the first level access traces. The algorithm uses the three most recent accesses to predict the next two by consulting all Markov predictors in parallel. Matches are made based on all three accesses, the two most recent, and the most recent access. In this example, if the three most recent accesses are A B C , matches would be based on A B C , BC, and C. We first make order-three predictions by locating A B C in the trie, and observe that depthone prediction is M : M's probability of following A B C is 2/3, which is above the threshold 0.5. We calculated M's probability by dividing M's 2 with C's 3 (we get this by using 4 — 1), since A B C has occurred 3 times before this time and A B C M has occurred twice. We got D's probability following ABC as 1/3 by using ABCD's number of occurrence 1 to divide ABC's 3. The order three, depth two prediction would be N , since M's probability following A B C is 2/3 and we only check accesses following M . N's probability is also 2/3 by dividing N's 2 with ABC's 3 for the same reason as we did with M . Then we move to make order two predictions by following C's vine pointer in A B C sequence, which as seen in Figure 2.2 is pointing to C in the sequence BC. Both M and D are depth one 20  Figure 2.2: A Prediction Example. This figure is a snapshot of the data structure maintained for the access sequence: A B C M N . . . BCDF . . . CDE . . . A B C M N . . . A B C D E . . . A B C . The nodes denote the pages used in the system. Solid line with arrow shows the access sequence while the dashed lines are the vine pointers. The number close to a node is the number of occurrences for the particular sequence, e.g., ABC has appeared four times, ABCD sequence has appeared once while A B C M appeared twice. The probability is calculated as re/,-/(re/;_d — 1) where i is the current level in the access trace, d is the current prediction depth.  21  candidates since each has a probability of 50% of following BC. Only N will be an order two depth two prediction since its probability of following A B C is 50% while the possibility of both E and F is 25%. Finally we are ready to make order one predictions. Following C s vine pointer in BC, we reached the first level in the net . The depth one prediction will be D, because M follow C with a 40% possibility while D has a 60% probability to follow C; there are no depth two predictions. Note that Figure 2.2 is not a complete representation of the data structures. Due to size constraints, we did not show any nodes represented by " . . . " in the trie shown. 2.3.2  T e m p o r a l and S p a t i a l L o c a l i t y  Prediction algorithms such as P P M can be used to detect either temporal locality or spatial locality. In the description presented above, we described the input to the P P M to be an access trace. To be more precise, the input is the sequence of page numbers accessed by the program. A system configured in such a manner can use P P M to detect temporal locality in the reference stream. Reference sequences that appear often in the history will appear as heavily-weighted strings in P P M Markov predictors. As a result, when the prefix of one such string is seen, the predictor can predict that the accesses represented in the remainder of the string may come next. If the P P M predictor is configured differently, however, it can be used to detect spatial locality instead of temporal locality. In this alternate configuration, the P P M uses relative difference between an access and the access that precedes it, not the absolute page number. For example, if a program accesses pages 10, 20, and 30, the P P M algorithm would receive as input 10, 10, and 10. If it finds a match, for example to predict 10, then the matched value is added to the last actual page number referenced to formulate the prefetch candidate; in this example the candidate would be 40. In our GMS-3P implementation, we have used both configurations to capture both the temporal and spatial locality characteristics of a program.  22  2.3.3  Other P P M Variants  Other variants of PPM-like algorithms have been proposed as well. We can take the age of a string into account when selecting from a set of matched strings, with strings that occur more recently given priority. For example, given the history A B C A B C . . . ABDAB, the algorithm might predict the next access as D even though ABC occurs more frequently than ABD. Another variant is to change the way our algorithm looks into the future. When predicting / references into the future, we can treat the references coming within I distances as the immediate next references. Thus the prediction results would reflect the probability of this pair's occurrence, independent of what happens in between these two accesses, which alleviates the effect of intermediate references to this pair. In the current scheme, to be able to make a prediction into I ahead of the current references, the probabilities of all the references come between these two need to be above the threshold.  2.3.4  Limitations of P P M in Practice  While both theoretical and simulation work on P P M and similar algorithms have been promising, substantial limitations exist to achieving good prediction results in practice. Bartels et al. [BKL 99] used order-one Markov predictor to predict virtual memory pages based on past fault +  information. They accumulate page faults and make one prediction at a time. They measured a 2% speedup for wave5, which is the best among the tests they have conducted. They also tested with order-two depth-one, order-one depth-two Markov predictors respectively and concluded they are worse than order-one depth-one predictions because the prediction accuracy is lower. The reason for this result is that they used an order-two Markov predictor, which excludes order-one prediction, and thus predicts more conservatively. In GMS-3P we investigate multiple order and various depth predictions by using P P M predictors, and pick the more accurate ones. The main purpose of choosing to use a fixed single order model with fixed search depth is to avoid the prediction overhead borne by the active node. Getting good results may require  23  a high-order model with high search depth, both temporal and spatial locality based algorithms, and fine-grained trace data. Doing any of these, however, increases algorithm overhead beyond the point of feasibility. This overhead taxes the two most critical workstation resources: C P U and memory. In particular, high-order P P M models consume space exponential in order. Note that the amount of memory consumed by P P M is critical, because it increases with the size of the input file and because the programs that will benefit most from P P M typically access files that are much larger than the amount of local memory available on the workstation. If the P P M predictor must compete with the running application for C P U and memory resources, either overall performance will degrade or prediction accuracy will suffer, or both.  2.4  Using Idle Workstations  In a workstation cluster, there may be idle workstations that have excess C P U and memory resources. Exploiting idle workstations for hosting guest computation  [ADV 95, AES97, AS98, +  CNSW97], or hosting memory for other nodes [DWAP94, FPK 95] has drawn increasing interest +  in the recent past as a research subject. One important way to utilize these resources to improve the performance of an application is to increase its parallelism and run some application threads on the idle workstation. For I/O-bound applications, however, it may be more beneficial to use an idle workstation to run a more accurate P P M predictor than is possible without the idle node's resources. This is particularly true if the I/O-bound application has limited parallelism compared to available workstation resources. We propose a novel architecture for predictive prefetching that utilizes idle workstations to perform prediction and prefetch in parallel with an I/O-intensive program running on another workstation node. Doing so insulates the running program from prediction overhead, both C P U and memory, and thus yields three key benefits. First, it permits the use of more sophisticated prediction algorithms such as high-order or deeper P P M . Second, it permits running multiple prediction algorithms in parallel, either on the same idle node or on a collection of idle nodes.  24  Finally, it permits the running node to refine the granularity of the access trace it collects.  25  Chapter 3  System Design and Implementation This chapter describes the design and implementation of our prototype of GMS-3P (Global Memory System with Parallel Prediction-based Prefetching). GMS-3P tracks the fault traces of an application and sends them to a prediction node, which implements the prediction policy. The prediction node makes predictions using two configurations of P P M , one based on the absolute fault trace and the other based on the relative differences between the subsequent faults. The valid prefetch requests are forwarded to the active node which runs the application and are stored in a prefetch buffer. The GMS-3P prototype changes the original GMS in the following ways. First, we instrumented GMS to send out the faulted page information to a prediction node when a fault is to be serviced by GMS. Second, the prediction node makes predictions and issues prefetch requests on behalf of the active node if there are valid predictions. Third, once the prefetched pages are located in GMS, the desired pages are forwarded to the active node. Fourth, the active node takes in the prefetched page and adds it into the prefetch buffer which holds a number of recently prefetched pages. Fifth, at time of servicing fault, the prediction buffer is first checked to see whether the desired page has already been fetched in; if so, results in a prediction hit; otherwise the fault is serviced as in the original GMS. We have implemented GMS-3P as an extension to the GMS [FPK+95] global memory system ported to FreeBSD operating system in a PC cluster.  26  Discard  Figure 3.1: The System Architecture of GMS-3P This chapter presents the details of our prototype in several parts. We begin by presenting a complete picture of the GMS-3P system. Then we describe the overall architecture of nodes with different roles in our system. Following this is a discussion about using multiple prediction nodes and issues about trace communications in the system. Then we discuss some design decisions made for the system.  3.1  The Big Picture  The GMS-3P prototype is tightly integrated with the GMS system. A simplified representation of this system is depicted in Figure 3.1. Each circle in the diagram represents a node of interest. Arrows indicate the flow of messages among the nodes. The target node runs the application and sends its faulted trace to the prediction node. Using the normal GMS protocol, pageouts are handled by sending the page to a storing node or discarding the page and sending a message for. that page to the GMS directory node (GCD) to reflect the page's new location as introduced in Section 2.1. The GMS directory node was modified to forward this information about page discards  27  to the prediction node for it to keep track of the memory content of the active node. This approach is an optimization that offloads the target node from the task of sending discard information directly to the prediction node. We will further explain why this is needed in Section 3.3.3. The prediction node uses the access traces it has received to build up the P P M predictors, which are configured in two different ways, and to make predictions upon a page fault if it has matched up the recent accesses with the faulted history successfully. It then checks whether the predicted pages are cached in the active node or not. If not, and the algorithm's hysteresis accuracy is above the threshold, a prefetch request for that page is issued to the directory node that has the information for the storing site of the page. This is done by using slightly modified GMS protocols. The directory node locates the page in global memory and forwards the prefetch request to the page's storing node if such a page exists in GMS; if not, the prediction node is notified of this failure. Finally, the storing node forwards the page to the target node. At the time of V M page fault or file cache miss, the target node checks to see whether the desired page has been fetched in by the prefetch process. If so, it results in a prediction hit and reduces the fault service latency. Otherwise, it reverts to the original GMS to service the fault.  3.2  Architecture of the Active Node  The active node has two key functionalities: one is to send out the page fault trace as efficiently as possible and the other is to take the prefetched pages, and to use them. To keep the overhead minimal, we have implemented fast communication for tracing information using a modified Trapeze [ACG 98] Myrinet Control Program (MCP), which is discussed separately in detail in Section 3.6.3. +  As for handling the prefetched pages, we used a hashed FIFO to quickly locate a desired page and to control the amount of memory devoted to hold prefetched pages.  28  Page Fault i  gms_os_getpage i\  send fault _ _ _  t  v  gms_getpage i  gms_gcd_getpage  [ Prediction Hit ,  Figure 3.2: Getpage control flow in GMS-3P 3.2.1  Tracing Fault Stream  As described in Section 2.1, GMS modifies the kernel to insert calls at each point where pages were either added or removed from the file buffer cache or V M . Inserting calls to file buffer cache and V M modules allows GMS to track the collection of pages on the node and gives GMS the opportunity to pass disk reads to remote memory transfers. For GMS-3P, we modified these calls to send the faulted page to the prediction node, and check whether we have prefetched the desired page to shortcut the fault service. Figure 3.2 depicts the operations on the active node related to GMS getpage and prefetch operations and illustrates when we send out the faulted trace. Rectangles with solid lines represent the operations to process the fault and to locate the page in GMS, while rectangles with dashed lines represent prefetch related operations. Arrows indicate the control flow. During this process, as soon as the getpage identifies the faulted gpn,we send out the trace to the prediction node, which might result in predicted pages being forwarded to the active node. Forwarding the faulted page information could have been done in any of the period when host is  29  . Faults  NVM/File Buffer Getpage  GMS Prediction Hit  Prefetch Buffer  Getpage  Remote G M S  Forward Page  Figure 3.3: Handling prefetched pages on the active node faulting on a page. It was added into gms-os.getpage because, at that time, the faulted frame is associated with a gpn, which is all that is needed to notify the prediction node. Thus, we avoid the overhead of redundantly identifying gpn and are able to notify the prediction node as soon as possible thus to overlap I/O as much as we can. Note that the getpage process might be shortcutted if the desired page is in the prefetch buffer. A detailed discussion of the communication of the faulted trace is given in Section 3.6. 3.2.2  H a n d l i n g Prefetched Pages  The handling of prefetched pages in the active node can be summarized with two actions: taking in the prefetched page and using them to service page faults. If the predicted page is located successfully in GMS-3P, the page will be forwarded to the active node. Upon receiving the prefetched page, the active node takes in the page and adds it to the prefetch buffer, which is implemented as a fixed size FIFO. If the FIFO has not reached a predetermined size, the page is simply taken in. If the FIFO is full, the oldest prefetched page is discarded. When a page in the prefetch buffer is accessed by the program on the target node, it is moved from the prefetch buffer into the virtual memory system or to the file buffer cache, depending on the nature of the access, and we reduce  30  the size of the FIFO. Note that we still send a message to the prediction node about this page fault to let it have an unchanged fault stream as if there was no prefetch. The role of the size of the prefetch buffer is to limit the amount of memory consumed by prefetched pages that have not been accessed. As prediction is a speculative process, we expect to predict pages that will not be used. Thus, some mechanism is needed to remove prefetch mistakes from the target node's memory. Figure 3.3 shows a simplified representation of the modified GMS with a prefetch buffer. The boxes represent functional components and the arrows show the control relationship. When GMS services a page fault, it first checks the prefetch buffer, if the desired page has been prefetched into the buffer the fault is serviced; otherwise GMS goes to the remote memory to get the page. GMS on the other nodes forwards predicted future references to the active node, which are added to the prefetch buffer.  3.3  Architecture of the Prediction Node  The role of the prediction node is to make predictions and to issue prefetch requests on behalf of the active node. Here, we discuss how the prefetch requests are made among the valid predictions. First, how to discern the valid predictions and then how to decide what to prefetch.  3.3.1  Prediction by Partial Matching  As explained in Section 2.3.1, we have chosen P P M as the primary prediction algorithm in our prototype GMS-3P implementation. We configure the prediction algorithm with the page fault traces and vary the order, length, and prediction accuracy threshold to test the differences those could bring to the performance of such a system.  3.3.2  Multiple Prediction Algorithms  As stated previously, one of our goals is to support the parallel, execution of multiple prediction algorithms. This strategy could be implemented on a single prediction node that multiplexes  31  execution of different prediction algorithms if sufficient memory and C P U resources are available on that node for all algorithms. Alternatively, multiple nodes could be used to run the algorithms in parallel. In GMS-3P, we chose to use two different algorithms on a single node to perform the prediction. As described in Section 2.3.2, we have configured one P P M predictor to use the page fault trace as the prediction base and the other predictor to use order one Markov predictors based on the relative differences between subsequent page faults, as described in Section 2.3.2. An alternative would be to implement multiple prediction algorithms on different nodes, we will discuss the design of that in Section 3.5. 3.3.3  T r a c k i n g the A c t i v e N o d e ' s M e m o r y Content  The prediction node accumulates access traces for the application running on the active node and uses the prediction algorithms to make predictions. Once predictions are made, the prediction node must be able to efficiently determine whether pages it predicts are already in the memory of the target node or not. This check is important because the algorithm will often predict pages that are already in memory due to locality in the application access trace and the prediction algorithms used. The overhead of unnecessary prefetching in these cases must be avoided. A valid prediction is a page not stored on the target node, neither in its V M nor file buffer cache pool nor the prefetch buffer. Only the valid predictions are possible to be prefetched to the active node. In order to discern the valid predictions from the pages already in the active node's memory, the prediction node keeps track of the content of the active node's local cache and its predictions which are stored in the prefetch buffer on the active node. In a traditional P P M implementation, the prediction algorithm runs on the same node as the target application. Each time it makes a prediction, it can easily check the node's current memory contents to determine whether the predicted page is already in memory and prefetches the page only if it is not in memory. However, separating the prediction node from the target node complicates this check. 32  Faulted/  ~,  \ Pageout  Figure 3.4: Tracking active node's memory from prediction node To solve this problem, the prediction node maintains a shadow list recording the current content of the target node's memory. To maintain this shadow list, the target node could send a message to the prediction node each time it acquires a new page or discards an old one. Note, however, that the target node must have already sent its fault stream to the prediction node and that the stream includes all new page accesses. It is thus sufficient if the prediction node can get a list of discarded pages. As discussed in Section 3.2.1, the fault trace collection from the active node makes the prediction node aware of all the pages coming into the active node. Noticing that GMS also handles putpage as described in Section 2.1, we intercept the message handling on GCD nodes of putpage and page discard to let them inform the prediction node about these pageout operations. In this way we avoid the overhead of an extra message on the active node and keep track of the memory content of the active node remotely from the prediction node. These procedures are shown in Figure 3.4, where circles represent nodes involved in the process. The solid arrow indicates when there is a fault trace, the prediction node adds an entry as the active node gets the page. The dashed arrows show that when there is a pageout, the prediction node gets a message from the GCD node to delete an entry.  33  3.3.4  Decisions on W h a t to Prefetch  The key remaining issue is to decide which valid predictions should actually lead to prefetches. As discussed in Section 3.3.3, the algorithm checks the current contents of the active node's memory for each valid prediction. Any prediction that is not currently in memory is possible to be prefetched. When a single algorithm is used, the prediction threshold parameter (described in Section 2.3.1), determines which predicted accesses are considered prefetch candidates. While this strategy could continue to be used when multiple algorithms are employed, doing so means that increasing the number of prediction algorithms possibly linearly increases the number of pages prefetched on each access. However, our goal is to increase prefetch quality, not its quantity. While increasing the number of pages prefetched may achieve this goal, it does so at the cost of increasing system overhead to transfer more pages, increasing the overhead on the target node for receiving and storing pages that it might not access, and adding competition between the servicing of normal page faults and actions of prefetching. Using the traditional approach, the only way to reduce the number of pages prefetched is to increase the prediction threshold. Following this strategy, increasing the number of algorithms makes every algorithm more conservative in its predictions. Our key motivation for using multiple algorithms is the observation that different algorithms target different types of application behaviour. For example, some applications exhibit temporal locality while others exhibit spatial locality; still, others exhibit multiple behaviours in different phases of their execution. By using multiple algorithms, we thus hope that at least one of them will be a good predictor for application behaviour at a given point in time. Therefore, the best way to limit the number of pages prefetched is not to make every algorithm more conservative, but instead to identify the algorithm that currently has the most success in predicting application accesses and to allow that algorithm to control what pages are prefetched.  We thus introduce a second prefetch threshold as a filtering mechanism based on  prediction-accuracy hysteresis.  34  The hysteresis threshold works as follows. First, it is a simple matter for each prediction algorithm to track its success rate. As each traced access is received from the target node, the prediction algorithm checks to see if that page was previously prefetched by that algorithm. The algorithm performs this check using a sliding window of its prefetch history. If the accessed page is found in this window it is marked accurate. The algorithm then computes the average predictionaccuracy of the window; pages not marked accurate are assumed to be inaccurate predictions. Finally, once the algorithm picks a prefetch candidate, it compares its average prediction-accuracy to a pre-established threshold and issues a prefetch only if its accuracy is above the threshold. Note, we call this is a prediction  accuracy  hysteresis,  because we do not really know the timing of  the prefetched page arriving at the active node. Thus, the prefetch request is only issued based on the prediction that: • The probability is higher than the prediction algorithm's threshold • Not currently cached on the active node • The recent predictions achieve a higher accuracy than the hysteresis 3.3.5  threshold  D a t a M a i n t a i n e d on the P r e d i c t i o n N o d e  In this session we discuss the data maintained on the prediction node to achieve the functionalities discussed above. Figure 3.5 depicts what resides on the prediction node, since the prediction node in our GMS-3P prototype must maintain the following information: 1. A shadow list recording the current content of the target node's memory as described in Section 3.3.3. 2. An accuracy threshold based on prediction accuracy hysteresis as described in Section 3.3.4. Additionally, both the temporal and spatial prediction algorithms must maintain the following, respectively:  35  Reference Stream  Discard  Prefetch Data Prediction Data Prediction Table Sliding Window  Prediction Table Accuracy Rating  Sliding Window  Accuracy Rating  EZ1  Figure 3.5: Data maintained on the prediction node 1. The trie of P P M predictors, as described in Section 2.3.1. 2. A table of predictions recently made by the algorithm, which resulted in a prefetch being issued. 3. A table of predictions recently made by the algorithm, which did not result in a prefetch being issued. 4. A sliding window which contains only the most recent predictions made by the algorithm, issued and unissued, for the purpose of accuracy rating. As previously described, the accuracy threshold is one of the determining factors in deciding whether a prediction will result in a prefetch. If the accuracy rating for a prediction algorithm is below this threshold, no prefetch request will be issued. Regardless of the prefetch, all predictions are recorded in one of the two different prediction tables listed above in order to judge the current accuracy rating of a prediction algorithm. The only reason to separate them is to know which predictions are held on the active node for the purpose described in Section 3.3. The sliding window of only the most recent predictions is used to determine the current, accuracy rating of an algorithm. When a request for a page arrives in the trace data from the  36  target node, it is first checked to see if this page has an entry in the sliding window of one of the algorithms. If it does, the accuracy rating of the algorithm is increased. When a recently accessed entry in the sliding window is bumped, the accuracy of the algorithm is decreased. Before the prediction node requests any prefetching, the current accuracy rating must exceed the predetermined accuracy threshold. If this condition is not met, the prediction is added to the table of unissued predictions. Then the faulted page is used to make predictions. If the predicted page is a bonafide candidate according to the P P M threshold parameter and is not already in the memory of the target node, the accuracy threshold is checked. If the accuracy rating is high enough, the request to prefetch is issued. Otherwise, the prediction is added to a list of unissued predictions.  3.4  Architecture of the Other Nodes  The other nodes in the system function nearly the same way as nodes in GMS. These nodes host global pages for the active node and communicate both with the active node for the original GMS messages and with the prediction node for trace information on pageouts and page discards. They also handle G M S - 3 P messages to locate a prediction page and forward it to the active node if the lookup is successful; otherwise, they inform the prediction node that the desired page is not in the GMS and hence is not forwarded to the active node. In one sentence, those nodes keep playing the roles as GCD and/or PFD nodes with augmented functionalities to process G M S - 3 P related work. Once the prefetch requests have been identified, the prediction node issues them on behalf of the active node. We used slightly modified GMS protocols to locate the predicted page and forward it to the active node if such a page is found. Locating is very similar to the original getpage protocol. The prediction node uses the UID of the predicted page to hash into the POD, producing an IP address of the node implementing the appropriate region of the global-cachedirectory.  The prediction node sends a message containing the UID and the active node's ID to  that GCD node, requesting the page identified by the UID to be forwarded to the active node.  37  The GMS on the GCD node does a lookup in the global-cache-directory, which results in either a miss or a hit. In case of a miss, a message is sent back to the prediction node notifying it of the lookup failure. In case of a hit, the GCD node forwards the request to the node that contains the page-frame-directory for the UID, which then finishes the forwarding of the page to the active node.  3.5  Multiple Prediction Nodes  Other issues occur when prediction algorithms are executed on multiple nodes to further harness the resources in the cluster and the fast network messaging we have developed catering for small messages. Recall that the prediction node must maintain an accurate shadow table of the target node's memory. It needs three flows of information to do this: the access trace sent by the target node, the page discard information sent by the GCD node, and the list of pages it prefetches for the target node. Of course, all prediction nodes must receive the trace and discard flows about the running application. In addition, each prediction node must inform the other prediction nodes of the pages it prefetches. Thus a prediction node could have a clear picture of what is cached on the target node. However, the bookkeeping overhead is overwhelming because of the number of messages involved. : One way to solve this problem might be to introduce a master node that directly communicates with the target node that takes care of administration work for other prediction nodes. Most of the functionalities described in the above session stay on the master node. On the other prediction node, they have all the information needed to make predictions and facilities to track its own accuracy. As in a one prediction node system, the target node will only know a master prediction node, which it forwards faulted streams to. The master prediction node will forward the reference traces to other prediction nodes to let them make their own predictions using different algorithms. Each node makes its own best predictions and does it own accuracy rating as described in Section 3.3.5. The only difference here is that there will only be one'prediction table to keep, all the recent  38  predictions it has made in order to do the accuracy accounting. They do not need to know whether the predicted pages have been forwarded to the active node because the master node takes care of it. After predictions are made and only if the accuracy is above the threshold, they are forwarded to the master node. Otherwise, they will only be added to the prediction table. The master node will decide whether to issue prefetches for the predictions based upon whether the predictions have been made by other nodes, or ultimately, whether the pages are in target node's memory. The master node keeps a shadowed list of the target node's local memory and all the prefetch requests issued. In this scheme, instead of running everything on a single node, we can distribute the workload to other idle workstations, but suffer some cost for sending the traces out in return. If there are predictions coming from the other prediction nodes, the master node will process the messages and check whether to issue prefetch requests. In the case of the accuracy of the algorithm running low, the master node only needs to send the trace out.  3.6  Trace Communication and Collection  Offloading prediction from the faulting node substantially lowers its prediction overhead. The remaining overhead on the target node is the cost of communicating the fault trace to the prediction node. It is thus important to provide an efficient means for the target node to send its access trace to the prediction node. With proper network support, this communication overhead can be quite small. For example, in our prototype the overhead for trace communication is on average around 3 /J.S as shown in Section 4.2. It is thus feasible to substantially refine the collection granularity without increasing faulting-node overhead if doing so will yield more accurate predictions. Note that we must also be concerned with prediction latency, which is somewhat increased by our distributed scheme due to communication overhead. The key constraint is that prefetch latency: needs to be smaller than average inter-access time. If this is not the case, the system will  39  have insufficient time to predict and prefetch the immediate next page based on access i before it is referenced on access i + 1. Below we discuss some issues related to trace collection and communication for prediction and present our implementation.  3.6.1  G r a n u l a r i t y of t h e T r a c e I n f o r m a t i o n  Identifying trace collection and communication overhead is.important to decide the best trace granularity to use to make predictions, which has been one of the concerns for prefetching systems. Griffioen and Appleton chose to use the whole file as the prediction source simply because of the higher overhead of tracking more detailed information [GA96]. They considered using read, write, and seek events to make intrafile predictions and prefetching, but were afraid of the overhead this might introduce. Instead, they suggest that for applications where whole file prefetching is not appropriate (e.g. databases systems), user-hinted approaches like Cao's and Patterson's [CFKL95, PGG 95] should be used. Bartels et al. [BKL 99] collected traces at page fault granularity and +  +  did not report a prohibitive overhead in doing so, but they only employed a one-order Markov predictor, which is faster than multi-order P P M predictors. Considering current approaches, we believe that page fault information should be chosen for four reasons. First, page-fault latency is large compared to prediction overhead, so it is feasible to perform prediction-based prefetching without significantly slowing the faulting application. Second, the OS gains control on every page fault, so trace collection can be easily added to the system's page-fault handler. Third, as the goal is to predict pages before they are faulted, it makes sense for many applications to do this by detecting patterns in page faults. Fourth, a page is the OS access granularity and hence the correct predictions will not waste any system resources. A finer granularity may actually make prediction more difficult by mixing accesses to data structures that were touched infrequently with those key I/O-bound structures. Of these four reasons, only the third is related to the prediction algorithm itself. In contrast, the first two reasons are pragmatic choices to ease implementation and lower the overhead.  4  0  3.6.2  Alternate Trace Collection  Here we discuss some approaches to collect information about finer-grain accesses than page faults. In systems that use a software-loaded TLB, it is possible to instrument the TLB-miss handler to provide fine-grain access information at the cost of increasing TLB-miss overhead. However, this approach will not work on hardware such as Intel x86 systems that have hardware implemented TLBs. An alternative approach is to change the access-permission of a subset of pages to induce an access-violation trap to the OS when those pages are accessed. A similar approach is used by most BSD operating systems to implement the Clock page-replacement algorithm, an approximation of LRU. In this approach, pages are maintained on two FIFOs: the active list that is directly accessible to applications and the inactive list that is protected. The memory-management hardware traps accesses to inactive pages, transferring control to the operating system, which re-enables the page's access permissions and moves it to the front of the active FIFO. The inactive list is periodically replenished by moving pages from the bottom of the active list to the top of the inactive list. Pages are replaced from the bottom of the inactive list. We can modify this scheme to collect trace information anytime an inactive page is accessed. Trace granularity can be further reduced by reducing the number of pages on the active list.  3.6.3  Implementation in GMS-3P  As stated previously, a key goal of our system is to minimize the overhead prediction imposes on the target node, which is to communicate the fault trace efficiently in GMS-3P. We implemented our prototype system in a cluster connected by the Myrinet — a Gigabit network [BCF+95}. Myrinet network interfaces are implemented with a host-programmable network processor. We modified the firmware program Trapeze [ACG+98] running on this processor to provide a lightweight communication mechanism for trace data. The mechanism is connection-based and consists of two buffer rings. As depicted in Fig-  41  Target Node  Prediction Node  Figure 3.6: Sender and Receiver Rings in GMS-3P ure 3.6, the sender ring is stored in memory on-board the network processor of the sending node (i.e., the target node) and the receiver ring is stored in the host memory of the receiving node (i.e., the prediction node). Each ring consists of a set of 32-bit entries.  Sending from the Target Node The sending host maintains a pointer to the next available entry in the sender ring. To send a message, it uses programmed-I/O to read the entry from the network processor's memory to determine if the entry is actually free. We use a unique tag value to indicate that the entry is available. If so, the host completes the send by writing the value to be sent into the entry. If not, the host skips sending the message. The network processor on the sender also maintains a pointer to the next available slot in the send ring. It periodically checks the value stored in this slot against the available tag, detecting a new value written by the host when the value it reads does not match this tag. When it receives a new value, it formulates a message, sends the message, writes the available tag to the ring entry, and advances its ring pointer.  42  Receiving on the Prediction Node The process followed on the receiving node is similar. When the message arrives, the network processor uses host-memory DMA to copy the received value into the next available slot in the receive ring and advances its ring pointer. A thread on the receiving host periodically polls the next available ring slot waiting for a new value. When it receives the value, the prediction node copies it into a data structure accessible to the prediction algorithm, writes the available tag intothe ring entry, and advances its own ring pointer. Receive-ring polling on the receiving host is handled by cooperating between a special predictor thread and the system's idle loop. The main job of the predictor thread is to consume the trace.data received from the target node, compute the prefetch candidates, and issue prefetches as necessary. At the end of each step (i.e. after issuing the prefetches associated with the latestreceived access), the algorithm checks the receive ring to determine if more trace data is available. If not, it spin-waits for a bounded period of time and, if no data is received within this interval, it blocks. We modified the OS's idle loop, which runs when no other threads are runnable, to poll the receive ring and to unblock the predictor thread when data becomes available. This block and unblocking is necessary, to ensure that a spinning predictor thread does not hold the processor for too long. The predictor thread is a standard FreeBSD kernel thread, which is non-preemptable by other threads.  3.7 3.7.1  Design Tradeoffs and Limitations Prefetching F r o m G l o b a l M e m o r y into L o c a l M e m o r y  Predictive prefetching is a speculative process, without perfect knowledge of future reference streams, we must be careful about the side effects that prefetching might bring into the system. One of those effects is the current large latency of disk transfer. We did not consider prefetching from disk to global memory or to local memory for two reasons. First, the penalty of wrong predictions resulting  43  in a prefetch from disk is high and it might actually degrade performance by wasting critical system resources. Second, given the large latency, it requires us to predict much further into the future to overshadow the latency. On the other hand, successful predictions and prefetching from disk will give us more benefits since the difference between the latency of a prediction hit and a disk transfer is much bigger. For our prototype GMS-3P, we only considered prefetching from global memory to local memory. In the environment of high speed cluster network such as the Myrinet [BCF+95] and cooperative network memory system such as GMS [FPK 95] we have used, the cost of prefetch+  ing a page from other nodes' memory over a gigabit-per-second network is low—around 200 ps as we have measured. This prefetch latency is about 50 times faster than prefetching from disk [ACG+98]. As we have discussed, prefetching from global memory into local memory has given us two sides of the world: (1) it reduces the penalty of a wrong prediction, hence encouraging the speculative prefetching and even aggressive ones and it also lessens the reins on prediction distances since the transfer can happen much faster (2) the benefit of successful prefetching is decreased as the time to service fault from global memory is much less. 3.7.2  H a v i n g Prefetched Pages as D u p l i c a t e d Pages  As discussed before, the goal of GMS is to globally coordinate memory management. It attempts to allow any user applications to benefit from the available network-wide resources and use the global resources efficiently and effectively. Pages on a node P are classified as either local pages, which have been recently accessed on P, or global pages, which are stored on P's memory on behalf of other nodes. Pages may also be replicated or unique [Fee96]. A replicated page is found in the local memory of multiple nodes and is not found in global memory for better utilization of the global memory. A unique page may reside in local or global memory and is found in exactly one node's memory, in the system. GMS-3P treats the prefetched pages as replicas of the original local or global pages for the following reasons. First, the prefetched page may not get into the target node's memory before 44  it is wanted by other node again; and even if it gets there, the latency will be longer because we manage the prefetched buffer separately from the original GMS. Second, if we choose to free the original page as soon as we forward the page to the target node, we may need to forward that page out at a later time to the global memory. This will happen once the target node reaches its limit on the number of prefetched pages and starts evicting aged prefetched pages. We have chosen to make the prefetched page a duplicate of the original page for two reasons. First, at most we will have the number of replicated global pages equal to the size of the prefetch buffer. Second, we can avoid a page transfer when the prefetch buffer is full and we need to discard a prefetched page. 3.7.3  U s i n g the G C D N o d e to Track the A c t i v e N o d e ' s M e m o r y  As previously discussed in Section 3.3.3, we track the content of the active node's memory on the prediction node to discern valid predictions. There are a few ways of targeting this problem. Here we briefly discuss them. First, following a prediction, the prediction node could exchange network messages with the target node to ask it if the predicted page is stored there. This communication, however, adds overhead to the target node and increases prediction latency. Second, the prediction node could maintain a shadow list recording the current content of the target node's memory as we explained in Section 3.3.3. But instead we let the target node augment its faulted stream with a list of discarded pages. The addition of discard information at most doubles the overhead of trace collection, assuming, page-grain access information. If access information is finer grained, the overhead impact is relatively smaller. Compared with these two approaches, the one we used is better because system-wide we have the same amount of messages, but avoid the overhead on the active node.  45  3.7.4  K e e p i n g the H i s t o r y of Prefetched Pages  The prediction node also keeps track of the prefetched pages currently residing in the prefetch buffer on the target node. One way to do that is to let the target node notify it at the time of taking in a prefetched page. This is clean but opposite to our goal — alleviate the overhead on the target node as much as possible. Another way is to make the storing node inform the prediction node when it is about to forward a found-prefetched page to the target node. The target node would notify the prediction node on an unsuccessful receiving of the page, such as it is a late prediction. There should not be much overhead on the target node, but many messages will be needed between the storing node and the prediction node. We chose to add an entry to the prefetched page history when a request is issued, no matter whether it will be a miss or not. If it turns out to be missing from the global memory, either the directory node (GCD) or the storing node (PFD) will send a message back about the locating failure. Finally, if the target node fails to take it in, a message is sent back to the prediction node too. Since the prediction heuristic is based on history reference streams, if the predicted page is not in the target node's memory, then it most likely resides in the global memory. Thus, we can keep an accurate history, but do not incur much overhead.  46  Chapter 4  Performance Evaluation In this chapter we give performance results we have measured of our prototype GMS-3P implementation. First, we provide microbenchmark measurements of the prefetch latency and overheads. Second, we use synthetic workloads to measure the I/O stall time reduction for different configurations of the P P M algorithm. Third, we show that programs exhibit different memory access patterns at different phases of execution which can not be captured by a single algorithm. Finally, we look at the necessity of having a control over the prediction accuracy by the prediction-accuracy hysteresis.  4.1  Experimental Setup  Our experiments were conducted on a cluster of 266-MHz Pentium II PCs with 128-MB of memory running FreeBSD 2.2.2. Unless otherwise noted, the tests were run on an 8 node cluster. The PCs are connected by a Myrinet network that uses 33-MHz LANai 4.1 network processors [BCF 95]. +  Our prototype system GMS-3P uses a modified Trapeze Myrinet control program [ACG 98] that +  runs on the LANai for efficient fault trace communication as described in Section 3.6.3. The page size and the unit of transfer is 4 Kbytes.  47  Location Local Memory(unmapped) Prefetch Buffer Global Memory Disk  Latency (/is) 6 44 212 9561  Table 4.1: Latency of page access  4.2  Microbenchmarks  We report the fundamental measurements to evaluate the potential performance of our prototype implementation of GMS-3P. All of these measurements were taken on otherwise idle machines and network.  4.2.1  Access Latency  Table .4.1.gives the memory access latency from various memory locations in our system. These times were determined by reading a large file non-sequentially, and tabling the median value from the 2000 trials. The entries describe each of the cases, accessing a local unmapped page requires 6 (is, accessing a prefetched page needs 44^JS and accessing a page from global memory requires 212/xs. Disk access time will vary for reasons like distance of disk arm movement involved. The number in the table is the median while the range is from 8111/YS to 31070/iS. The access latencies are consistent with the amount of work that must be done to map the page into the associated physical map. For an unmapped page in local memory, an entry is simply inserted into the physical map. A page in the prefetch buffer requires inserting an entry into the object page table, the object list, and the physical map. Accessing a page in global memory includes generating a getpage request on the faulted node, looking up and generating a request on the GCD node, and looking up and transferring the page from the P F D node; finally we insert it to the physical map upon arrival of the page. At the application level, accessing remote memory is 50 times faster than accessing disk, and accessing a prefetched page is approximately 5 times faster than accessing remote memory.  48  I Target Node:  Trace Fault  1  Accept Page  ••mi 1  Prediction Node: Handle Prediction GCD Node:  Lookup & Request  PFD Node:  Lookup & Forward  0  ~\  r n  12  r "i  24  r  36  n  i  48  i i 60  i  i  72  i  i  84  i  i  i  96  i  108  1  Time (us)  Figure 4.1: Detailed timeline of GMS-3P operations during a prefetch process 4.2.2  Prediction Latency  Figure 4.1 shows a timeline of the operations on different nodes in the process of prefetching. It starts with the trace collection and ends with receiving the predicted page. From left to right, the top line is the trace collection and communication overhead on the target node(2.2/xs); the second line is the prediction processing on the prediction node(26/xs); the next line is the look up and request generation on the GCD node(6^ts); the fourth line is the look up and page forwarding on the P F D node(75.7/xs); and the last line is the target node accepting the prefetched page(2.6^s). It is important to note that the time we measured is the host C P U time excluding the interrupt handling. With the exception of the time to trace the faults on the target node, all the other timelines are subject to change because they are initiated from other nodes. The time to trace the fault on the target node includes the time to verify that the faulted gpn belongs to the target file and to create a message sent to the prediction node as discussed in Section 3.6.3. The prediction node uses temporal and spatial algorithms to make predictions and discerns whether to issue prefetch requests as described in Section 3.3. The handling time starts from receiving the faulted page number and ends when prefetch requests are issued. On the GCD node, the processing includes looking up the P F D node of the predicted gpn and creating a request for that page to be sent to the target node. Note, for the timeline on the GCD node, we do not consider the situation when it is also the P F D node. The PFD node looks up the page and forwards it to the target node. The current implementation of forwarding the page needs to make a copy 49  Number of Prefetched Pages  Sequence  1  1 1 2 1 2 3 1 2 3 4  2 3  4 .  -  Latency (/is) (3 Nodes Sys.) 231 264 335 268 386 471 281 400 524 602  Latency (/is) (4 Nodes Sys.) 235 259 328 271 353 408 282 350 422 500  Latency (^is) (8 Nodes Sys.) 242 257 292 259 301 364 270 322 375 425  Table 4.2: Prefetch latency — multiple prefetches following one page fault with different number of nodes in the system from the host memory to the network interface card memory because of the underlying message system. At last the target node takes in the prefetched page and adds it to the prefetch buffer. For the measurements shown in Figure 4.1, we measured the operations using the Pentium cycle counter, took 250 samples and reported the median time. This was repeated for 3 times and the median is reported. Next we detail the prefetch latency seen by the target node, which is the total elapsed time from seeing the faulted page to receiving the predicted pages. Table 4.2 shows the differences in prefetch latency when there are different numbers of prefetch requests generated from one fault; it also shows the differences between the prefetch latency when there are different numbers of nodes in the system. The latency is measured by starting the Pentium cycle counter on the target node when a page fault occurs and stopping the counter when receiving the first prefetched page, the second, and up to the fourth. We performed the tests three times, each time took 250 samples and reported the median value. This latency is near the best case because we put a 1000 us delay between the faults. Thus, there is almost no contention on the nodes. The first row of the table gives the latency for one prediction with different number of nodes  50  in the system. In this situation, a system with 3 nodes has the lowest latency. This is because in approximately 50% of the distribution, the GCD node and P F D node would be. the same node, which saves one message and the interrupt handling time. This will not be the case if we do not guarantee that the node has sufficient time to process each fault, because it is not possible to overlap I/O and request handling time. The last row of table is the latency for the fourth prediction. It is clear that in this situation the eight nodes system outperforms the three nodes system since the handling of GCD messages and P F D messages could possibly be distributed among different nodes, thus overlapping the handling time. The columns of the table shows the differences of the latency when there are different number of predictions made. The increase in latency between the same sequence mainly reflects the longer processing time on the prediction node; and the increase of latency between different sequences includes the gap between requests issued from the prediction node and the processing time on the target node, assuming all the G C D node and P F D node handling are fully overlapped. This is almost the case for a system with eight nodes, since there is a steady difference between the latency of the different sequences when there are different number of predictions being made. 4.2.3  C o m p a r i s o n of gms_getpage i n G M S a n d G M S - 3 P  Figure 4.2 depicts the elapsed time on the active node for the gms_getpage operation in the original GMS system and in GMS-3P. In GMS, upon a page fault, the GMS handler first identifies whether the gpn associated with the page belongs to the device it is monitoring; if so, GMS generates a getpage request. We call this the pre-fault time in the figure, which is 15.3/xs. Then, the underlying messaging system sends the message and the host C P U halts. We measured the time when C P U halts for the page to arrive and call it the demand fetch time, which is 164/is. After the page arrives, GMS needs to process the message and identify some attributes of this page. We call this period the post-fault time, which we measured as dps. In GMS-3P, there are two possibilities to handle a page fault. One possibility is that the faulted page has been prefetched into the prefetch buffer and thus the access is a hit on the page 51  ta post-fault • demand fetch • pre-tauit  Y///////A  V//////A  Original GMS  Figure 4.2: Latency of gms_getpage in GMS and GMS-3P as shown by the second bar. In this case, GMS-3P sends the faulted page number to the prediction node, which we also call the pre-fault time(6/fs). Then, it finds the page in the prefetch buffer and we call this time the post-fault because the page is found and the processing needs CPU cycles(18//s). The other possibility is that the page is not in the prefetch buffer, shown as GMS-3P Miss. This is more like the getpage operation in the original GMS, where in the pre-fault period the faulted page number is sent to the prediction node. Thus we see in the third bar of Figure 4.2 the pre-fault time is 18//s because of tracing the faulted page number to the prediction node. The demand fetch time is 170/is, which also increases because there is more processing for the prefetch related work in the system. Thus GMS-3P introduces a 1.4% overhead and 3.2% delay comparing with the original GMS when the faulted page is not in the prefetch buffer. If the faulted page is in the prefetch buffer, it brings the fault service time to 12.8% of the time needed in GMS. The timing in Figure 4.2 is highly related to Figure 3.2,which illustrates the processing and associated operations on the target node during the execution of getpage. In the original GMS, pre-fault is 15.3^s The test program reads 250 pages each time, with prefetch enabled or disabled, and either will hit on the pages in the prefetch buffer or need to read 52  o  1  2  3  D 1 2 3 1 2 3 1 2 3  Hiton-Prefetch 13192 29013 37732 22836 47596 54413 26682 50633 56170  Prefetched Pages 23386 50277 62297 35022 83563 91349 38870 87292 93263  Accuracy 56.4% 57.7% 60.6% 65.2% 57.0%. 59.6% 68.6% 58.0% 60.2%  Space (M-bytes) 5.19 7.68 10.2 7.68 10.2 12.7 10.2 12.7 15.1  Prediction Overhead(^is) 11 15 17 13 27 31 14 31 34  I/O Reduction 10.4% 20.8% 29.3% 18.1% 35.3% 43.1% 20.4% 36.8% 43.2%  Table 4.3: Comparison of different order, different depth of P P M predictions, accessed file size is 160MB, the total number of page faults is 78000 from global memory. We used the median as the measured time. This was done three times for each case and took the median as shown in Figure 4.2.  4.3  Performance of the Synthetic Applications  In this session, we report the performance of two synthetic applications in GMS-3P. The first application is designed to have a high temporal locality access characteristic, with different access patterns to be modeled by our temporal P P M algorithm configured with different order and depth. The second application reads a file sequentially and is suitable for the spatial algorithm. Performance of the Temporal Algorithm Table 4.3 shows the performance improvements of a synthetic program with different configurations of the P P M algorithm. The first column shows the order of the P P M and the second one shows the depth. Our test program creates different access patterns of temporal locality that can be modeled by P P M configured with different orders and different depths. The application computes for 600us between each page access. In this case, we used a 160-MB file and the total number of page faults is 78,000. Except for the node running the application, the nodes in the cluster are idle and act as  53  Figure 4.3: Comparison of sequential access between GMS and GMS-3P memory servers; one of them acts as the prediction node if prefetch is enabled. We ran the program three times with the same setting in the cluster of 8 nodes. The accuracy hysteresis for the spatial algorithm is set to be above the highest possible to eliminate its effect. We see that the number of hits on the prefetched pages increase in accordance with the higher order and longer depth, e.g., the accuracy of PPM(1,1) is 56.4% and the I/O reduction is 10.4% while the accuracy of PPM(1,3) is 60.6% with an I/O reduction of 29.3%. However, the prediction time and memory consumed also increase under these circumstances. The prediction handle time of PPM(1,1) is l l ^ s and that of PPM(1,3) is 17/xs, which has a 54.5% increasement; the memory used also increases from 5.19Mb to 10.2Mb. We measured the time needed to service I/O in GMS and use it as a base to compute the I/O reduction when introducing prefetch. We ignored the initial fault phase since the first time fault will not generate any predictions using the temporal algorithm, while the spatial algorithm may predict well. In Table 4.3, the accuracy and I/O improvement for PPM(2, 3) is very similar to PPM(3,3) predictions. PPM(3,3) is better because the extra prefetches it adds on top of PPM(2,2) predictions are almost all hits. As shown in the table, we are able to reduce the I/O service time as much as 43%.  54  Performance of the Spatial Algorithm Figure 4.3 shows the performance difference in GMS and GMS-3P of a synthetic application, which reads a 160MB file sequentially. We run the program in a 6-node cluster either enabling prefetch or only using the original GMS. This is a case where our first configuration of P P M algorithm will not be able to make any predictions, because as discussed in Section 2.3, it works when the reference pattern appears repeatedly, i.e., it captures the temporal locality characteristics of the applications. The second algorithm, on the other hand, sees sequential access as a spatial pattern with stride equal to 1. In this set up, though GMS-3P uses one node as the prediction node, it still saves I/O service time up to 58%. The application computes for 200us between each page access. Except for the node running the application, the nodes in the cluster are idle and act as memory servers; one node acting as the prediction node if prefetch is enabled.  4.4 4.4.1  Application Performance L U Decomposition  We have used LU decomposition as described in [PTVF92] to test our GMS-3P system. Though a large portion of LU decomposition is suitable for spatial prediction algorithm, we did find that PPM(2,2) has a better prediction hit rate than PPM(1,1). Figure 4.4 shows the prediction accuracy calculated on the prediction node of the different P P M configurations. We have only shown the portion in which the temporal algorithms have made predictions. We sampled the prediction accuracy every 100 page faults received on the prediction node. Together with Table 4.4 this shows that for LU decomposition, PPM(2, 2) configuration is better than PPM(1, 1), while PPM(3,3) is almost the same PPM(2, 2). For this application, PPM(1, 1) has a good prediction accuracy while PPM(2, 2) has a better hit rate. There is no need to use PPM(3,3) since most of the access patterns can be captured by PPM(2,2), which also consumes less memory and takes less prediction time.  55  Parameter PPM(1,1) PPM(2,2) PPM(3,3)  Faulted Pages 15454 15656 15710  Number of Hits 3155 5120 5252  Hit Percentage 20.4% 32.7% 33.4%  Prefetched Pages 3560 8086 8257  Accuracy 88.6% • 63.3% 63.6%  Space (M-bytes) 0.83 1.48 2.15  Table 4.4: The statistics for LU decomposition with different configuration of P P M  56  Prediction Accuracy 0 20 40  Number of Hits 3874 3945 3968  Prefetched Accuracy Pages 26.7% 14499 10029 • 39.3% 9955 39.8%  Table 4.5: The effect of the hysteresis accuracy control in 0 0 7  4.4.2  0 0 7 Benchmark  0 0 7 is an object-oriented database benchmark [CDN93] that builds and traverses a parts-assembly database. It uses a mapped file as the database source and performs several traversals on this database. In our experiment, we have used a 61Mb file as the database and there are approximately 42Mb free memory on the active node for 007. The total number of page faults is 23206. Table 4.5 shows the necessity of using the hysteresis accuracy controliox the P P M algorithm. As described in Section 3.3.4, we use this accuracy to monitor how well an algorithm is predicting. In this test, we used PPM(2,2) with the different accuracy setups to control which predictions will be issued as prefetch requests. The first column is the hysteresis accuracy required for any prediction to result in a prefetch. As described in Section 3.3.5, we have used 128 as the size of the sliding window, which, as a result, is the highest possible value for this accuracy. The second column is the number of page faults serviced from the prefetch buffer, and the third column is the total number of prefetched pages accepted on the target node. We get the fourth column by dividing the second column by the third one. In this test, we see that for 007, the prediction accuracy increase corresponds directly to an increase in hysteresis accuracy threshold. Figure 4.5 shows the I/O wait time comparison between GMS and GMS-3P of 007. We ran 0 0 7 with both the temporal algorithm (PPM(1,1)) and the spatial algorithm. The I/O time includes all the gms.getpage operations in both systems. In GMS, all the pages are read from global memory; in GMS-3P, some are read from the prefetch buffer and others are from the global memory. In GMS-3P, we also show the overhead on the host to process GMS-3P related work, such as taking in the prefetched pages, GCD looking up for the prefetch candidates and GCD updating  57  )•;: oo  • GMS-3P overhead • i.'O service Ume  Figure 4.5: Comparison of I/O wait time in 0 0 7 between G M S and G M S - 3 P for the accessed prefetch pages. We do not include the interrupt handling time in both systems. The timer is started when G M S captures the fault, and is stopped when G M S finds the page. The I/O service time in G M S - 3 P has been reduced to 23% of that in G M S . But there is a 4.8% overhead in G M S - 3 P to process the extra work it involves, thus making a total service time of 27.8% of the original. In this test, the hit rate in the prefetch buffer is 83% and the prediction accuracy is 80.4% in a total number of 23206 page faults.  58  Chapter 5  Conclusions and Future Work 5.1  Summary  Processor speed has been increasing at an ever faster rate, but storage systems are not speeding up proportionately with processors. To reduce the I / O overhead, prefetching has been employed with various techniques to fetch to-be-accessed data into a higher level of the memory hierarchy of a computer. Key to the success of any prefetching technique is the ability to know or predict data accesses before they happen. There are two approaches to address this problem, one through hints of future references and the other using predictive methods to speculate future references. The idea of using predictive methods to prefetch originates from pattern detection. The problem of detecting patterns from an access stream for the purpose of predicting a future access is similar to data compression. To make use of prediction-based prefetching, the system must be able to predict future references accurately and efficiently. The key limiting factor in traditional approaches for prediction-based prefetching is prediction overhead, in terms of computation and memory usage. We have proposed a novel framework to prefetch through an idle node in the SAN for an active node which runs an I / O intensive application. This configuration allows predictions to run in parallel with a target application. We have implemented this parallel predictive prefetching system  59  based on the Global Memory System, GMS; we call our new system GMS-3P. In our system, we minimized the overhead involved in prefetching on the target node, which only needs to send out the application's fault stream to an idle node in the cluster. By modifying an existing Myrinet control program, Trapeze, we are able to reduce this overhead to 3/^s on the host preparing a message. We employed two different configurations of P P M , one based on the absolute fault trace of an application and the other on the relative differences between successive faults. Each algorithm captures, a certain pattern for the faulted trace and predicts the most likely future references. The first one addresses the temporal locality and the second one targets the spatial locality access characteristics an application may have. We have used a threshold to the prediction algorithm and a hysteresis accuracy threshold to control the quality of our predictions. We have implemented GMS-3P based on GMS in the FreeBSD operating system and successfully further reduced the page fault service time from 212 jis in GMS to 44 /is in GMS-3P at the application level. We have reduced the overhead on the active node both in terms of memory usage and CPU cycles by exploiting another idle workstation in the cluster. By execution-driven measurements, we can see the benefits of using multiple algorithms to catch the different behaviours of the application at different time instances, thereby increasing the prediction power of the system. The spatial algorithm predicts when there are repeated patterns in the relative difference of the faulted trace, while the temporal algorithm predicts when there are patterns in the faulted trace itself. For some application, we can reduce the I/O stall time as much as 77%, and the overhead to process the required work in GMS-3P is 4.8% comparing to GMS.  5.2  Future Work  This section enumerates several aspects that could be changed to the current design and implementation.  .  60  Locate an Idle Node and Handle Idle to Active Transition The current system chooses a node at random, and assumes it is idle. There are many existing schemes to test whether a node is idle or not. It is easy to extend the system to dynamically locatean idle node before the prefetching begins. Another more interesting question is to handle the transition of a node from idle to active. Handle the On-going Prefetching We have not handled the situation where the current faulted page is on the way to the active node. This leads to a waste both in terms of fault service time and the network resources. Since the prefetching on the active node is passive, an easy way to do it would be to send a small message to the active node at the time when the prefetch is issued. It is not obvious how much savings would result because of the cost of the new message involved, the additional message handling on the active node, and the frequency of this situation. On the active node, since the fault service and prediction use two different threads, it is possible to wake up the fault service thread if the awaited page is the one just being prefetched. This is possible, and could further reduce fault service time as a result.  Implementing Multiple Prediction Nodes As described in Section 3.5, it is possible to introduce more nodes to make predictions for an active application. The tradeoff is improved prediction latency but at the cost of additional system messages. Prefetching From Disk into Global Memory and/or to Local Memory As we have demonstrated with GMS-3P, prefetching from global memory into local memory can reduce the page fault service time from 212 /JLS to 44 /is, which is an improvement on an order of magnitude. This is beneficial to an application that is I/O intensive and most of the predicted  61  pages have been accessed, and subsequently purged to global memory by the time they are predicted. Prefetching from global memory lowers the penalty of the wrong predictions and loosens the constraints on high prediction accuracy and small inter-fault time. If we further extend the system to engage prefetching from disk, the performance improvement would be more obvious if the accurate prefetches arrive in memory before being accessed. Prefetching from disk would require much more restrictive constraints on prediction accuracy, as mistakes are more costly in this context. Prioritize the Original GMS getpage Message and the Predicted Page Forwarding Message In our prototype GMS-3P, we do not prioritize the original GMS getpage messages and our new predicted page forwarding messages. Thus, it is possible to hurt performance because they are using the same system resource, e.g., the network bandwidth, the active node CPU, and the network interface. If page forwarding takes place concurrently with a getpage operation, the getpage operation would be prolonged because of the enlarged message queue and the competition with the messages of page forwarding. Dynamically Adjust the Parameters of the Prediction Algorithm Assuming we can control the quality of predictions well, most of the penalty for prefetching comes from the late predictions. One reason leading to late prefetched pages is that the inter-fault time is too small to interleave with prefetched pages. Successful predictions reduces inter-fault time too, which gives more pressure on accurate and efficient prefetching. One way to solve this is to monitor the inter-fault time on the prediction node. If consecutive faults happen within a certain time threshold, we keep making predictions but do not issue the prefetch request. Thus we explicitly reduce the risk of wasting resources on those predictions which have a high possibility of being late.  62  Bibliography [ACG+98] D. Anderson, J.S. Chase, S. Gadde, A. Gallatin, A. Lebeck, K . Yocum, and M . Feeley. Cheating the i/o bottleneck: Network storage with trapeze/myrinet. In Proceedings of the USENIX Technical Conference, June 1998.  [ADV 95] R. Arpaci, A. Dusseau, A. Vahdat, L. Liu, T. Anderson, and D. Patterson. The interaction of parallel and sequential workloads on a network of workstations. In Proceedings +  of the 1995 ACM SIGMETRICS Joint International Conferences on Measurement and Modeling of Computer Systems, pages 167-178, May 1995.  [AES97]  A. Acharya, G. Edjlali, and J. Saltz. The utility of exploiting idle workstations for parallel computation. In Proceedings of SIGMETRICS'97, May 1997.  [AS98]  A. Acharya and J. Saltz. Using idle memory for data-intensive computation. In Proceedings of SIGMETRICS'98, May 1998. Poster Session.  [BCF+95] N . Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W-K Su. Myrinet — a gigibit-per-second local area network. IEEE Micro, February 1995. [BCW90]  T. C. Bell, J. C. Cleary, and I. H. Witten. Text Compression. Prentice-Hall Advanced Reference Series, 1990.  [BHK+91] M . G. Baker, J. H. Hartman, M . D. Kupfer, K . W. Shirriff, and J. K . Ousterhout. Measurements of a distributed file system. In Proceedings of the ACM 13th Symposium on Operating Systems Principles, pages 198-212, October 1991. [BKL+99] Gretta Bartels, A nna Karlin, Henry Levy, Darrell Anderson, and Jeffrey Chase. Potentials and limitations of fault-based markov prefetching for virtual memory pages. In SIGMETRICS 99, May 1999. Poster Session. [CB92]  Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Fifth International Conference on Architectural Support for Programming language and Operating Systems, pages 51-61, October 1992.  [CDN93]  M . J. Carey, D. J. Dewitt, and J. F Naughton. The oo7 benchmark. In Proceedings of the 1993 ACM SIGMOD International  257-266, May 1993. 63  Conference on Management of Data, pages  [CFKL95] Pei Cao, Edward W. Felton, Anna R. Karlin, and Kai L i . A study of integrated prefetching and caching strategies. In Proceedings of the 1995 ACM SIGMETRICS Joint International  Conferences on Measurement and Modeling of Computer Systems,  pages 188-197, May 1995. [CFKL96] Pei Cao, Edward W. Felton, Anna R. Karlin, and Kai Li. Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM  [CKV93]  Transactions on Computer Systems, 14(4), November 1996.  K. M . Curewitz, P. Krishnan, and J. S. Vitter. Practical prefetching via data compression. In Proceedings of the 1993 ACM SIGMOD International  Conference on Manage-  ment of Data, pages 257-266, May 1993. [CNSW97] A. Chowdhury, L. Nicklas, S. Setia, and E. Whie. Supporting dynamic space-sharing on clusters of non-dedicated workstations. In proceedings of the 17th International conference on Distributed Computing, 1997.  [Den68]  Peter J. Denning. The working set model for program behavior. Communication of the ACM, ll(5):323-334, May 1968.  [DWAP94] M . Dahlin, R. Wang, T. Anderson, and D. Patterson. Cooperative caching: Using remote client memory to improve file system performance. In Proceeding of the Conference on Operating system design and Implementation,  [FCLJ98]  November 1994. USENIX.  Li Fan, Pei Cao, Wei Lin, and Quinn Jacobson. Web prefetching between low-bandwidth clients and proxies: Potential and performance. In Proceedings of the 1998 ACM SIGMETRICS Joint International Conferences on Measurement and Modeling of Computer Systems, May 1998.  [Fee96]  Micheal J. Feeley. Global Memory Management for Workstation Networks. PhD thesis,  Department of Computer Science, University of Washington, 1996. [FPK+95] M . Feeley, W. Morganand F. Pighin, A. Karlin, H. Levy, and C. Thekkath. Implementing global memory management in a workstation cluster. In Proceeding of the Fifteenth ACM Symposium on Operating Systems Principles, December 1995.  [GA94]  James Griffioen and Randy Appleton. Reducing file system latency using a predictive approach. In Conference Proceeding of the USENIX Summer 1994 Technical Confer-  ence, pages 197-208, June 1994. USENIX. [GA96]  James Griffioen and Randy Appleton. The design, implementation, and evaluation of a predictive caching file system. In Technical Report, CR-264-96, Department of Computer Science, University of Kentucky, June 1996.  [Gin77]  J.'D. Gindele. Buffer block prefetching method, July 1977.  64  [JK97]  Zhimei Jiang and Leonard Kleinrock. Prefetching links on the www. In IEEE International Conference on Communications, pages 483-489, June 1997.  [KL96]  Thomas M . Kroeger and Darrell D.E. Long. Predicting file-system actions from prior events. In Proceedings of USENIX 1996 Annual Technical Conference, January 1996.  [KLM97]  Thomas M . Kroeger, Darrell D. E. Long, and Jeffrey C. Mogul. Exploring the bounds of web latency reduction from, caching and prefetching. In Proceedings of USENIX Symposium on Internet Technology and Systems, December 1997.  [LD97]  H. Lei and D. Duchamp. An analytical approach to file prefetching. In Proceedings of USENIX 1997 Annual Technical Conference, January 1997. USENIX.  [Lib99]  V. Liberatore. Empirical investigation of the markov reference model. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999.  [MBKQ96] Marshall Kirk McKusick, Keith B., Michael J. Karels, and John S. Quarterman. The Design and Implementation of the 4-4 BSD Operating System. Addison-Wesley, 1996.  [MDK96]  T. Mowry, A. Demke, and O. Krieger. Automatic compiler-inserted i/o prefetching for out-of-core applications. In Proceedings of the 2nd Symposiums on Operating Systems Design and Implementation, October 1996.  [MG91]  Todd Mowry and Anoop Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, June 1991.  [PG94]  R: Hugo Patterson and Garth A. Gibson. Exposing i/o concurrency with informed prefetching. In Proceedings of The 3rd International Conference on Parallel and Distributed Information Systems, September 1994.  [PGG 95] R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. Informed prefetching and caching. In Proceeding of the 15th Symposium on Operating Systems Principles, pages 79-95, December 1995. +  [PGK88]  David A. Patterson, Garth Gibson, and Randy H. Katz. A case for redundant arrays of inexpensive disks (raid). In Proceeding of the ACM Conference on the Management of Data (SIGMOD), pages 109-116, June 1988.  [PGS94]  R. Hugo Patterson, Garth A. Gibson, and M . Satyanarayanan. Transparent informed prefetching. ACM Operating Systems Review, pages 21-34, April 1994.  ' [PM96]  V. N. Padmanabhan and J. C. Mogul. Using predictive prefetching to improve world wide web latency, July 1996.  65  [PTVF92] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipe in C, The Art of Scientific Computing. Cambridge University Press, 1992. Second Edition. [PZ91]  Mark Palmer and Stanley B. Zdonik. Fido: A cache that learns to fetch. In Proceedings of the 17th International Conference on Very Large DataBases, pages 255-264, September 1991.  [Smi78]  A. J . Smith. Sequential program prefetching in memory hierarchies, December 1978.  [SS95] .  David Steere and M . Satyanarayanan. Using dynamic sets to overcome high i/o latencies during search. In IEEE HOTOS-V,  [TD91]  1995.  Carl Tait and Dan Duchamp. Detection and exploition of file working sets. In Proceedings of the 1991 IEEE 11th International Conference on Distributed Computing Systems, pages  2-9, May 1991.  [TE93]  Dean M . Tullsen and Susan J . Eggers. Limitations of cache prefetching on a bus-based multiprocessors. In Proceedings of the 1993 International Symposium on Computer Architecture, pages 278-288, May 1993.  [TPG97]  A. Tomkins, R. Hugo Patterson, and Garth A. Gibson. Informed multi-process prefetching and caching. In Proceeding of the ACM International SIGMETRICS on Measurement and Modeling of Computer Systems,  Conference  June 1997.  [VAK+98] G. Voelker, E. Anderson, T. Kimbrel, M . Feeley, J. Chase, A. Karlin, and H. Levy. implementing cooperative prefetching and caching in a global memory system. In Proceedings of ACM SIGMETRICS and Evaluation,  Conference on Performance Measurement, Modeling,  June 1998.  [VK96]  J . S. Vitter and P. Krishnan. Optimal prefetching via data compression. Journal of the ACM, 43(5):771-793, September 1996. an earlier version of this work appeared on IEEE FOCS (1991).  [ZL78]  J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24:530-536, September 1978.  66  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051684/manifest

Comment

Related Items