Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

On managing visibility of resources in social networking sites Su, Ying 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2008_fall_su_ying.pdf [ 964.14kB ]
JSON: 24-1.0051240.json
JSON-LD: 24-1.0051240-ld.json
RDF/XML (Pretty): 24-1.0051240-rdf.xml
RDF/JSON: 24-1.0051240-rdf.json
Turtle: 24-1.0051240-turtle.txt
N-Triples: 24-1.0051240-rdf-ntriples.txt
Original Record: 24-1.0051240-source.json
Full Text

Full Text

On Managing Visibility of Resources in Social Networking Sites by Ying Su B. Engineering, Tsinghua University, Beijing, China, 1994 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) October 2008 © Ying Su Abstract The social network sites has been very popular these days. In these systems, re sources are associated with each user (i.e., network node) in the network, and these resources include both soft and physical resources, such as pictures, videos, books, etc. Users would like to control who can see their resources, e.g., any one, only friends, friends and families, friends of friends, etc. The rules defining who can see the resources are called visibility views, and in our system they are identified by the owner nodes and regular path expressions. The users can issue queries for discovering resources, e.g., “Find all lawnmowers in my three degree network that are visible to me”. The evaluation process needs to check all candidate resources to see if they are visible to the query issuer, and this requires computing the visibility views associated with every candidate resources. The visibility views are expen sive to evaluate. In order to facilitate the visibility query answering, it is necessary to pre-compute and materialize the views in the so called “cache”. But because of the tremendous volume of the whole view data, we have to select only a portion of them tomaterialize, based on metrics such as the views’ sizes, the costs to evaluate them, the popularity of references, the temporal locality of requested views, and the correlation relationships among them. The problem of trying to select the views in a static way is called the static view selection problem, and it has been studied ex tensively by the database community, mostly for answering OLAP queries in data warehouses. We call it “static” because it pre-computes and materializes the views in one lump-sum, based on statistics accumulated before the query sequence starts. This problem has a deep root from the caching problem studied in the the ory and system communities. The researches done on caching policies solve the problem in an ad-hoc dynamic way by making decisions at each request, which II make use of the temporal locality. Many ad-hoc policies have been proposed in the literature. Specifically, there is a branch of research done based on the so called Independent Reference Model, in which view requests are assumed to form a fixed time-independent sequence. It was shown that under this model optimal policies can be found. Coffman and Denning [13] presented a policy called A0 that is optimal with respect the long-run average metric for paging, where the cached objects (i.e., memory pages) have uniform sizes and retrieval costs. Bahat [7] later presented a stationary Markov replacement policy C, which is optimal for cases in which the retrieval costs for the cached objects are arbitrary but their sizes are uniform. In this thesis we address the problem one step further: to assume both the sizes and costs are arbitrary. We reveal the intrinsic relationship between the static view selection problem and the caching problem, and show that, still under the indepen dent reference model, any dynamic replacement policy (random or deterministic) are no better than the optimal static selection algorithm in terms of the long-run average metric. We then present a branch of optimal history dependent replace ment policies for the arbitrary sizes and arbitrary costs case, and show that policy A0 and C0 are special cases of our policy. Furthermore, we prove that a policy is optimal if and only if the induced stochastic process is a unichain. We also study the approximate static algorithms and policies: we bring up polynomial static algo rithms that are both K-approximate with regard to the fractional optimum solution and HKF-approximate to the integral optimum solution, where K is the size of the cache, K’ is the total size of all view objects minus K, and HKI is the harmonic number of K’. In addition, we present a K-competitive dynamic policy and show that K is the best approximation ratio for both the static algorithms and dynamic policies. 111 Table of Contents Abstract ii Table of Contents iv List of Tables vi List of Figures vii Glossary viii xl’ 12451314171722252837 Acknowledgments 1 Introduction . .-. . - 1.1 The Social Networking System . 1.1.1 LogicalModel 1.1.2 Visibility Views 1.2 Problem Definition and Related Work 2 Optimal Algorithms Under IRM 2.1 The Static Selection Problem 2.2 Finding The Optimal Policies For The Caching Problem 2.2.1 The MDP Framework 2.2.2 The Objective Value of a Policy 2.2.3 Comparing the Objective Values 2.2.4 The Way to Calculate The Average Reward . 2.2.5 Finding The Optimal Policies iv 3 Approximation Algorithms 42 3.1 Approximate Static Selection Algorithms 43 3.1.1 A Greedy Algorithm With Bound K 43 3.1.2 The Lower Bound of The Approximate Ratios 48 3.1.3 The KnapSack Revisited 49 3.1.4 An Extended Greedy Algorithm 50 3.1.5 A Mediated Greedy Algorithm 53 3.2 Competitive Policies 57 3.2.1 The K-Competitive Policy 58 3.2.2 Comment To Related Works 61 4 Conclusion and Future Work 62 4.1 The Problem Remaining 62 4.1.1 Overtaking Optimal Policies 62 4.1.2 The Problem For the Naive Version Problem, Online Case 63 4.1.3 The Problem For The Dependence Case 63 Bibliography 65 v List of Tables 1.1 An instance where Algorithm A is not optimal 11 3.1 An instance where the integrality gap is K 48 3.2 An instance where Extended Greedy is the best 54 3.3 The result of the three algorithms 54 vi List of Figures 2.1 The Value Iteration Algorithm 37 2.2 The D0 Algorithm 40 2.3 The D Algorithm 40 3.1 Greedy Algorithm 47 3.2 Knapsack’s Polynomial 2-Approximation Algorithm 50 3.3 The Extended Greedy Algorithm 51 3.4 The Mediated Greedy Algorithm 57 3.5 The K-competitive Policy 58 vii Glossary G The social network graph model. G = (,) Y’ The set of all nodes in the social network graph. 6 The set of all edges in the social network graph. The visibility view definition The set of all tags in the system V The set of all views N Total number of distinct views in V e1 The evaluation Cost of view v r The retrieval cost of view v b The benefit by caching view v1. b = e — r1 J3 The probability that view v1 will occur under the Independent Reference Model c (e — rj)pj for view v lvii Sizeofviewv1 K Capacity of the cache viii K’ EL1IviI—K S The subset of V that is selected for materialization by the static selection. C Random variable representing the cost of answering a view request E(C) The expected cost of answering a view request E* (C) The expected cost of answering a view request for the optimal integral so lution E(C) The expected cost of answering a view request for the fractional integral solution a A constant equal to p1e, 6 A constant equal to pjrj B The expected benefit for cache content S of the static selection solution. Bmvc The maximum expected benefit of the static selection solutions. Cmin The minimum expected cost of the static selection solutions. Cmin a Bmax t The decision epoch time t T The set of decision epochs T A constant used for time Q, The query at time t S The cache content at time t G The set of all states 0 with cache content s S The cache content space 9, The random variable of MDP state at time t ix 0 The state space 0 variable representing an MDP state A0 The action space for state 0 A The action space for the MDP system Pt(Oj 01,a) The transition probability from state 0 to state O for action a at time w(0,a) One step reward when taking action a at state 0 c1(0,a) One step cost when taking action a at state 0 d1(0) The decision at state 0 at time t h1 The history up to time t for some policy H The space of the history up to time t for some policy tw’ (s) The expected total reward of policy ic e 11HR tc(s) The expected total cost of policy E rIKR aw (s) The average reward of policy JJHR ac(s) The average cost of policy flHR to Sample path, a sequence of states and actions for a policy 2 The space of all sample paths for a policy W A query sequence = {Qi , Q2,. P The one-step transition matrix of the induced Markov chain by some stationary policy r p(0jI0j) Pa’s element at row i and column j representing the probability of one step state transfer from 6 to 8 by the decision of policy r on state 0 x P Cesaro limit of P,r D1 The 1’ closed communicating class L A constant representing the number of closed communicating classes P1 The transitions between states in closed communicating class D1 under a policy The limiting transitions between states in closed communicating class D1 under a policy Vç The split view p The cost density: (e1 — rj)Pj/IvjJ xi Acknowledgments First and foremost, I would like to express my gratitude to my supervisor, Professor Laks V.S. Lakshmanan, for more than two years of guidance, inspiration, and sup port. He always discusses the research topics with me in great details and give me valuable directions. Without his support, this work would not have been possible. Meanwhile, I must add my appreciation to Dr. Son,Vuong, my co-supervisor, who supported me in the first phase of my research work and gave me the maximum freedom in my research. I am also grateful to Dr. Norm Hutchinson, the second reader of my thesis, for his technical and editorial support. Dr. Hutchinson agreed to read my thesis in his sabbatical days, his kindness and valuable suggestions have helped me a lot. I would also like to thank Ms. Hermie Lam, Ms. Holly Kwan, and Ms. Joyce Poon, for all the help they have provided. Moreover, I would like to extend my thanks to all members of the DB lab and the MC lab for the useful feedback and fun times. Most of all, I would like to thank my family for their unconditional love and support. xii Chapter 1 Introduction In this thesis we will be working on the view management problem which emerges in building an effective and efficient social networking system. The system we are building supports not only facilities for social networking and community build ing, but also supports transactions including listing items (e.g., books, DVD play ers, and lawnmowers) that you wish to trade with, and creating a list of items you wish to buy/borrow (i.e., your wish list), In particular, it allows the users to set up flexible control over access to their resources or even links, and permits efficient querying of social neighborhoods for potential contacts, items of interest, etc., tak ing access control into account. We call a system featuring these functionalities a social network database system. In the first section we will briefly introduce the system model, then the visibility view in particular. In the next section we for mally introduce the view selection problem and its connection with the file caching problem, and some related work done on them. 1.1 The Social Networking System Recently, there has been a spurt of activities in the development of popular web sites that enable the sharing/exchanging of ideas, opinions, bookmarks, and blogs, as well as personal multimedia data such as pictures, music, and video, between like minded individuals. In addition, some systems support online communication such as instant messaging. Examples include MySpace [4], Flickr [2], YouTube [6], [1], Linkedln [3], and Yahoo!360 [5]. All of these exploit an un derlying general framework of a social network. The Wikipedia defines a social network as “a social structure made of nodes which are generally individuals or organizations. The nodes are connected through various social formalities ranging from casual acquaintance to close familial bonds.” The term was first coined by Barnes in 1954 in his classic work [81. Some of the systems above emphasize the social aspect more than others, but in the end, they all use a social network of some kind as the backbone. A generic social network system should allow the users in the network to own and share resources like images, videos, etc., and define their neighbors with tags created by themselves. In addition, the system should support item transactions (rent, borrow, etc.) and efficient matching of the users wish items and the items listed by other users. Specifically, the users should be able to control who can access their resources so that when other users are querying the resources, only those visible ones will be returned. For example, the following query should only return the items within Mary’s 2-degree network that she can see: Qi : “find all items that are listed by Mary’s 2-degree network, are visible to Mary, and satisfy the constraint itemName = DVD Player & year> 2000.” The system should be able to efficiently process ad hoc queries using con straints on the item properties as well as constraints on users/nodes based on their social neighborhood, taking visibility into account. Thus we require a model which permits effective representation of information and a query language which is easy to use, is powerful and captures use cases such as the above naturally, and admits efficient evaluation. So we will briefly introduce our model and language in the following subsections. 1.1.1 Logical Model Our logical model for a social network database system consists of a directed graph G = (, S), whose nodes t’ correspond to individuals that we refer to as users. There is an edge (u, v) E f whenever u and v are involved in some kind of social 2 relationship. We henceforth refer to edges as links. We assume the graph is non symmetric: the link (u,v) is in doesn’t mean that (v,u) is also in . This is to allow one-way relationship, e.g., one user may use another user’s articles as her/his web feed without building a two-way relationship. We assume that there is a “user profile” associated with each node. In addition, each node contains a collection of resources, which could be electronic belongings such as images, music,etc., or the items the user listed for trade, or wishlists regard ing the items the user wants to get. A node may have zero or more resources. The user profile, resources and links are all modeled by a set of attribute-value pairs of the form A = v, together with some keywords. An example of an item is “DVD, Oscar nominee, Denzel Washington, thriller, Year = 2006, Title = Deja Vu, Rating = five star, Director = Tony Scott.” There is no assumption about the “types” of items nor their schema. E.g., different instances of DVD in the same or different nodes may have different sets of attributes defined for them. The set of keywords or the set of attribute-value pairs describing an item may be empty. Specifically, we use a kind of link attribute called link labels or tags to model the relationships. E.g., a link (u, v) with label friend indicates that u considers v its friend. We permit any number (no less than one) of tags on a link. E.g., a link (u, v) with labels {friend, doctor} means v is u’s friend as well as doctor. The one-degree network of a user u is the set of all users adjacent to it, i.e., onedegree(u) = {v I (u,v) e Sz}. We regard a user as the owner of its one-degree network, with the user at the center and the members at its spokes. For any k 1, the k-degree network (kdegree(u)) and any degree network (anydegree(u)) are defined in the obvious way: kdegree(u) consists of nodes reachable from u by paths of length at most k, while anydegree(u) consists of nodes reachable from u. By convention, we exclude u from these networks as a technical convenience. Notice that the k degree network of u includes nodes reachable by paths of length <k. Visibility in this model is achieved by permitting a user to dictate who can possibly access its items, wishes or other resources. The specification can take the form of a list of users or predefined groups (like Yahoo! groups). Additionally, we permit the notion of visibility views, which are queries expressed in a simple intuitive query language, to be discussed shortly. 3 1.1.2 Visibility Views Visibility of a user’s items and wishes is controlled by the user by means of spec ifying a visibility view. In addition, a user may want to dictate who can see its outgoing links. As an example of the latter, John has an outgoing link to Mary labeled closefriend which he wants to be visible to only his close friends and their close friends. We achieve both of these by means of visibility views. A visibility view for a specific user is a restricted regular path expression defined as follows: = = = = ‘private’ I ‘public’ In the above definition . means concatenation, + means alteration, * is the Kleene star, and is the alphabet of all tags in the system. Basically, represents queries which are either the constant visibility specs ‘public’ and ‘private’ or a a union of conjunctive queries, and the conjunctive term could be a specific link tag, a wildcard tag, and the transitive closure a specific link tag or the wildcard tag. This definition allows a limited form of recursion, e.g., a recursive view for user Mary might be: Mary: friend* doctor Informally, the semantics of a visibility view at a node u is the set of nodes related to u in a manner that satisfies . More precisely, II(u)II, the set of nodes satisfying the expression at node u, is defined as follows. • When is ‘public’, II(u)II = {v e G}, i.e., all nodes in the network. • When @ is ‘private’, I.@(u)II = {u}. • When is E, IL(u)II ={vE GI v is the one degree network of u} 4 • When is , II(u)II = {v E G I v is reachable from u}. • When is t, where t is a tag, I(u)lI = {v G I a link (u,v) G, with tag t}. • Whenist.C,wheretisatag,thenJ(u)II={vE GIwe G: Gcontains alink (u,w) with label t & v E IIC(w)II}. • WhenisCl+Cz++Ck, II(u)Il=IICl(u)IIU”UWCk(u)Il. The most expensive part of evaluating queries with visibility constraints is the evaluation of visibility views, or more precisely, finding these graph patterns de— scribed by the regular expressions. This is because a regular expression can de scribe arbitrary long paths in the graph which means in turn an arbitrary number of physical access to the database. The evaluation of the views defined here shares many similarities with the XML regular path query evaluation. The ability to an swer a query using the answers to views can significantly decrease query evaluation time. For example, in query Qi if all the visibility rules in Mary’s 2-degree net work are evaluated and materialized, then the evaluation process just needs to check whether Mary is in the resulting node sets of the resources within Mary’s 2-degree network, and does not need to evaluate the rules from scratch. If the views mate rialized can not completely answer a query, they can, nevertheless, speed up query processing, because some views needed for the query processing may be obtained through other views using the query rewriting technique. On the other hand, view maintaining can be computationally expensive, and the amount of space that is available to store view results could be bounded. These constraints lead to the nat ural optimization problem, known as the view selection problem, which is, roughly speaking, asking for a set of views that can answer given queries and satisfy spec ified constraints while minimizing the average expected cost of query evaluation. In the next section we will formally introduce the problem and its related work. 1.2 Problem Definition and Related Work As mentioned in last section, query evaluation can be thought of as checking whether the user that issued the query belongs to the visibility view associated with 5 each item satisfying the query. Thus the view selection problem aims at reducing the query answering cost (in terms of the query execution time). As a convenience for theoretical analysis, we can regard that the queries are made of a very long view request sequences. The objective is to minimize the expected cost of answering any view request. The view selection problem is defined formally as follows: Definition (The Static View Selection Problem) Given a set V of N views v , v2,... , where each view vg has the following specified functions • size ivl; • evaluation cost e; • retrieval cost if materialized r,; • probability estimation f3 (subject to L f3 = 1) and possibly a partial order relation on V representing the dependence relation among views, find a subset S of V to materialize, such that the average time taken to evaluate a view is minimized, with the constraint that the total size of S is no more than an integer constant K. This problem emerged in answering OLAP queries in the data warehouse en vironment, whereby the views form a data cube lattice. Plenty of work has been done on it, such as [21],[20],[19],[231,[l 8]. Among them, the pioneering work that established the problem was [21], most of others were follow-up works based on it. In that paper the authors argued that the problem, with the dependence relation existing among the views, is NP-complete even when the view sizes and probabili ties are uniform. In our case, however, a visibility view for a specified resource can be treated as a unit though it can also be decomposed to subviews among which dependence relation exists. Since solving the problem of no dependency can be seen as a basis for solving the problem of the dependent case, we will start from the simpler case and only talk about the no dependence case in this thesis. The static view selection problem is closely related to the (file) caching prob lem defined below. 6 Definition (The Caching Problem) Given a set F of N files fi ,f2,.. . ,fii, where each file fj has the following specified values • size vI; • retrieval cost from remote server e; • retrieval cost from the cache r1; • possibly a probability estimation 13 (subject to 2=i 13. = 1) and a sequence of file requests, maintain files in the cache of size K to satisfy the requests while minimizing the total (or average) retrieval cost. This is a well studied problem in both the systems and databases area since it plays a key role in many applications from the memory replacement policies in operating systems to the web proxy file caching. To allow the term to be general enough to catch all these applications, we simply call it the “caching problem”. The caching problem can have objectives other than the cost, such as the byte hit rate, network traffic, etc. However in this thesis we mainly consider the total/average cost criteria for the queries since it is a most natural and commonly used criteria. Unlike the static view selection problem, the caching problem aims at finding a replacement algorithm/policy that can dynamically adjust the cache content so as to minimize the total/average cost. So we can find out that the two problems are indeed seeking the same goal but with different ways which are dedicated by the application requirements. We now take a look at several variants of the caching problem and introduce some related work. In order to understand how good a replacement algorithm/pol icy is, [35] introduced the notion of competitiveness here: Definition (Competitiveness) A replacement policy g is said to be k-competitive if the following holds: cost(g) k. cost(OPT) + c where OPT is the optimal policy and c is a constant. 7 The notion of competitiveness is sometimes referred to as approximation. Follow ing the conventions, we will use the word “competitive” for dynamic replacement policies and “approximate” for the static selection algorithms. Uniform sizes, uniform costs With the restrictions that all file sizes and costs are the same, the problem is just paging. Paging requires mandatory admission, i.e., every requested page must be admitted into the cache. Sleater and Tarjan showedthat Least Recently Used (LRU) and First In First Out (F I FO) are K-competitive, assuming the optimal algorithm uses the same size of cache. Here K is the size of cache. They also showed that K is the best competitive ratio for deterministic on-line algorithms, The algorithm Least Frequently Used (LFU) evicts the page in the cache that was requested the fewest number of times while not considering the recency information. LFU seems to perform well on Web request sequences, but it is not a competitive policy if the query sequence does not presume some distribution pattern. Lee et al [261 designed a spectrum of hybrid policies between LRU and LFU. They named these policies LRFU with the parameter A varying between A = 0 (LRU) and A = 1 (LFU). Cohen [15] proved that the competitive ratio of LRFU is K+ — 1, where 0.5 A < 1 is the decay parameter used by the LRFU algorithm, and K is the size of the cache. O’Neil presented an algorithm called LRU-K in [27] which keeps track of the times of the last K references to popular pages. Uniform sizes, arbitrary costs This special case is called weighted caching. There are two well known algorithms for this problem: Balance[ll] by Chrobak, Karloff, Payne, and Vishwanathan, and GreedyDual[34] by Young. Both of these two al gorithms were shown to be K-competitive. arbitrary sizes, cost=1 or cost=size This is motivated by the importance of file size in caching for world-wide-web applications. For this problem, Irani gave O(log2k)-competitive algorithms for both the cost=1 and cost=size cases [22]. 8 arbitrary sizes, arbitrary costs This problem is calledfile caching or web caching. Interestingly enough, this problem was also shown to have solution with the com— petitive ratio K. In [10] Cao and Irani presented the algorithm Gree dyDua 1—Size, in another independent work, Young brought up a similar algorithm LANDLORD, and proved it is K-competitive using potential functions. The research results introduced above were mostly competitiveness results when the future is unknown, i.e., they were focusing on finding the competitive algorithms/policies. For most of the problems, it was shown that the best compet itive ratio for deterministic algorithms is K where K is the cache size. We need to mention that when the sizes are arbitrary, the competitive ratios of these algorithms are with respect to the integral optimal solution in which no partial views can be se lected/admitted. Some policies such as the Least Frequently Used would perform arbitrarily bad for some specific query sequence, because the query se quences are not confined to any rules or probabilistic models and it’s always pos sible to make up a sequence that is bad for it while good for the optimal solution. But what if we knew the future sequence (offline case), or even some partial infor mation about it? Can we find out the optimal policies for it? Belady showed an optimal offline algorithm B0 for the paging problem in his seminal work [9]. The algorithm assumes the future is known and evicts the page with the longest forward distance, i.e., will be accessed furthest into the future. Bo is optimal for every query sequence. But as far as the author is aware, for the arbitrary sizes, arbitrary costs case, there is no simple algorithm known to be optimal except the brute force dynamic programming. To study the performance of paging algorithms in terms of expected cost, Coffman and Denning introduced the Independent Reference Model [13]. It was a simple probability model under which optimal policies (in terms of expected cost) can be found. The model assumes the query sequence Qi Q2 is a se quence of independent random variables with the common, stationary distribution = f3i j3,. . .,fl, such that P[Qj = i] = f3j for all t 1. Though not a realis tic representation of query patterns, it provides insights for analyzing algorithmic behavior and designing algorithms under non-stationary models. For the uniform sizes and uniform costs case, Coffman and Denning presented 9 a policy called A0 which is optimal in terms of fault rate [13] A0 prescribes: When the cache is full, Evict v1 if i = arg1min(f3, : v in cache) The idea behind policy A0 comes in conforming with policy B0. Let the ran dom variable d (x) denote the forward distance to page x just after Qt. Under the independent reference model, d (x) follows the stationary geometric distribution P[d1(x)=k]=f3(l_P)k_l, k=1,2,... The mean value of d(x) is 1/Ps, so evicting the file with smallest 13 is equiva lent to evicting the file with the greatest expected forward distance. A0 is optimal only for the paging problem. But for either the view selection problem or the file caching problem, both the sizes and costs are important. Fiat and Rosen [17] showed that caching policies adapted from memory management applications that don’t take size into consideration do not work well in practice. Costs are important as well. In many cases, the costs are neither uniform nor proportional to the size. An example might be the views containing the Kleene closures, e.g. the view Mary : closefriend* might have a much higher cost but smaller size than view Mary: . Bahat and Makowski showed a policy C for the uniform sizes and arbitrary costs case: Evict v if i = argmin(/3c1: v in cache or request) Unlike the paging and weighted caching problem, C assumes optional evic tion. Policies with mandatory eviction are subsets of the policies with optional eviction, therefore C is no worse than the policy C0 described below: Evict v if i = argmin(f31c: v in cache) Bahat also discussed the problem of finding optimal policies with multi-criteria aspects, specifically, they studied the problem of minimizing an average cost under 10 a single constraint in terms of another long-run average metric such as the miss rate. However, the arbitrary sizes, arbitrary costs case requires the total size of the chosen views cannot exceed a certain bound at any time, which is a stronger constraint than the long run constraint. Thus using the average cache size as the long-run average metric doesn’t solve the problem. Seeing C, it is tempting to think that the following offline algorithm is optimal for deterministic weighted caching: Algorithm A At time t, evict the view with smallest (e — r1) /distj(t), where dist1(t) is the forward distance of view v at time t. However we can prove that it is not the case. Consider the following instance: Suppose there are two algorithms A and A’, and they behave the same from time I to time t — 1, so the cache content for A and A’ at the beginning of time t are the same, denoted as S. At time t a fault occurs by request q = v,. Suppose in S, vl is the view with smallest (ei — r1 ) /disti (t) in St U q1. Algorithm A evicts vl and A’ evicts some other view 1)2, and from now on the two algorithms behave the same and there’s no fault except at some time 1)2 or v occurs again. Then at time t the information regarding v and v2 is in Table 1.1 view size vj e — r, dist(t) 1 30 30 1)2 1 20 15 Table 1.1: An instance where Algorithm A is not optimal In this query sequence instance, v1 and v2 are requested just once from time t on at distances 30 and 15 respectively. So A and A’ will both fault once but the cost for A is 30 and for A’ is just 20, thus showing that the algorithm evicting the view with smallest e —r1)/dist(t) is not offline optimal. In this thesis we address the problem one step further: to assume both the sizes and costs are arbitrary. We make the following contributions: 1. We will reveal the intrinsic relationship between the static view selection problem and the caching problem. For any replacement policy (random or deterministic) under the independent reference model, we show that the opti mal long-run average metric cannot be better than that of the static selection 11 algorithm. We systematically study the problems and present the essence of it, which is not found in any previous works to the best of the author’s knowledge. 2, We then present a history dependent replacement policy that is optimal in terms of the long-run average metric for objects with arbitrary sizes and costs and show that policy Ao and Co are special cases of this policy. We also show that a policy is optimal if and only if the induced stochastic process chain is a unichain. 3. Because the intrinsic hardness of the problem when we introduce the arbi trary sizes, this optimal policy runs in non-polynomial time. Then we bring up K-approximate static algorithms and replacement policies with regard to the fractional optimum solution and show that K is the best approximation ratio for both any replacement policies and static selection algorithms. Here K is the size of the cache. The algorithm is also HK’-approximate to the inte gral optimum solution, where K’ is the total size of all objects minus K and HK’ is the harmonic number of K’. The thesis is organized as follows: In Chapter 2, we will discuss the static se lection problem and the relation between cost and benefit. Then we will present an Markov Decision Process (MDP) framework for the caching problem and discuss the relationships between the two problems. We show that the optimal long-run average metric for any replacement policy (random or deterministic) under the in dependent reference model cannot be better than that of the static selection algo rithm. This reveals the essence of the problem, which was not found in any previ ous works to the best of the author’s knowledge. Then we present optimal policies for the arbitrary sizes, arbitrary costs case in Chapter 2 and show that policy A0 and C are special cases of it. In the next chapter, we will present our approximate static selection algorithms and policies with an optimal approximation ratio K with regard to the fractional optimal value and HKI with regard to the integral optimal value where K’ is E1 lvii —K. 12 Chapter 2 Optimal Algorithms Under IRM In Chapter 1 we have briefly introduced the static selection problem and the caching problem and browsed their similarities and differences. In this chapter we will fur ther discuss the relationships of the two problems under the IRM model. In Section 2.1 we firstly look at the static selection problem and show that min imizing the expected cost is equal to maximizing the benefit of the cached objects no matter whether there exists dependence relation among the views or not. In this section we will examine the problemwithout considering the dependence relation first since it provides a basis for the more complex case where dependence relations do exist. Let’s call this restricted form of problem the naive view selection problem. We show that the naive static selection problem under IRM is a restricted Integer Covering problem and also the complement of the Knapsack Problem, then we review the well-known dynamic programming solution that can achieve optimum. In Section 2.2 we will present optimal replacement policies for the caching problem. A solution can be seen as a sequential decision making process and hence can be modeled by a Markov Decision Process (MDP). However, the MDP de scribed in [71 assumed uniform sizes, so any actions at any time evicts at most one view. We use a slightly different MDP to incorporate the arbitrary sizes, namely, to allow the system to admit or evict multiple objects. We will describe the MDP framework first, then show that the optimal long-run average reward for any re placement policy (random or deterministic) under IRM cannot be greater than the optimal benefit Bm the dynamic programming algorithm produces for the naive 13 view static selection problem. Later on we will show that this optimal value can indeed be achieved by some policies inducing a unichain with one closed set and some transient states. We also show that only such policies are average optimal, policies inducing irreducible chains or multichains are both non-optimal. Hence we will focus on policies inducing unichains and present one such policy, D0. In addition, we show that policy A0 and Co are both special cases of D0. However D0 is not a demand policy thus it’s not suitable for web proxies, so we present another version of the policy Db that is a demand policy. D is no worse than D0. 2.1 The Static Selection Problem In Chapter 1 we introduced the Static View Selection or Eviction Problem. For convenience, we state this problem here again: Definition (The Static View Selection or Eviction Problem) Given a set V of N independent views v , v2,... , v,, where each view v has the following specified functions • size vjl; • evaluation cost e; • retrieval cost if materialized r; • probability estimation J3( subject to EI P1 = I ) find a subset S of V to materialize, such that the average time taken to evaluate a view is minimized, with the constraint that the total size of S is no greater than an integer constant K. The size v evaluation cost e and probability estimation J3 of a view v1 are usually obtained from the query frequencies observed in a running system. This is essentially assuming that future queries will follow some fixed and static distri bution, which is analogous to the independent reference model. For some applica tions this model is reasonable, for example, the OLAP queries in data warehouses usually exhibit some stable patterns. Let the random variable C be the cost of 14 answering a view request (or equivalently, a query) with cache content S. Our ob jective is to minimize the expected query answering time E(C) for a query. To do this, we have the following important observation: E(Cs) = J31r+ f3e (2.1) V1ES v1S Our objective is to minimize E(Cs) in the above Equation subject to the condi tion that sumVES vu <K. For most of the cases, the evaluation cost e is larger than the retrieval cost if r is materialized. This is a restricted integer covering problem. An integer covering problem is defined as follows: Definition Integer cove ring problem Given an n by m matrix A, n dimensional vector b, and m dimensional vector c, the problem seeks to minimize cTx subject to Ax> b over the non-negative integer vectors of length m. We say that our problem is a restricted integer covering problem because the matrix A in our case is just a vector. Hence we would expect to get simpler and better solutions than the full version of the problem when A is a matrix. We will ex amine the optimal solution in this chapter and approximate solutions in Chapter 3. The equation in Equation 2.1 can be written as: N E(Cs) = J3e— (ej—rj)I3j (2.2) 1=1 viES Because N and e1 are finite, the first half of Equation 2.2 e1 can be deemed as a constant a which we call the base evaluation cost. It represents the evaluation cost for a query when nothing is materialized. We use b1 to denote the benefit by caching v, b1 = e — r. Then the expected benefit of S, denoted as E (Bs), represents the expected cost saved by caching the views in 5, where B is the ran dom variable representing the benefit of answering a query using cache content S. E(B)= (e1—r)J3u V1ES v1ES Then Equation 2.2 can be written as: 15 E(Cs) = a — E(Bs) (2.3) Thus the problem of choosing a subset of views to minimize the expected cost can be transformed to the problem to maximize the expected benefit, while the total size constraint is satisfied. If we use c to denote (e — rj)J3 the problem can be mathematically described as below: maximize c1 (2.4) v ES subjectto lvii K v1ES As we can see, this problem is exactly the Knapsack problem. It can be solved in pseudo-polynomial time using dynamic programming. Define a recursive func tion, B(i,j) to be the maximum benefit that can be attained with the total size less than or equal toj using items up to i. B(i,j) is defined as follows: B(O,j) = 0 B(i,0) = 0 B(i,j) = B(i— l,j) if lvii > J B(i,j) max{B(i— l,j),c+B(i— l,j—ivt)} if lvii j The solution of optimal benefit can then be found by calculating B(N, K). To do this efficiently we maintain a vector of size K to store previous computations. This solution will therefore run in O(NK) time. The different combinations of views whose total size doesn’t exceed K is at most 2”, therefore the total different value of j is 2”. If N is small enough to make 2”’ <K, it would require 2” space instead of K. So the space complexity is min{2”,K}. The value of K, to an extreme, is defined in terms of bytes so that it is an integer. But if using byte as the unit, it would be too slow to conduct the computation since the time complexity is in proportion to K. The unit cannot be too big either. For example, If we use 10MB as the unit but most of the view sizes are not multiples 16 of 10MB, then it would be inaccurate to use the above algorithm. Choosing the unit of the size should be done carefully, but in this thesis we will not put too much effort in it. Now suppose we have got this maximum expected benefit Bm under the cache capacity constraint, B, = sup E(Bs) = sup (e1 — (2.5) S S VIES Then the minimum expected cost Cmjn is just the complement of Bm< to the base evaluation cost a: Cmjn = a — We call the cache content whose expected benefit is Bm the optimal cache content SOPT. 2.2 Finding The Optimal Policies For The Caching Problem We design replacement policies so that the system knows how to make good deci sions in the run. It can be seen as a sequential decision making and hence can be modeled by Markov Decision Process (MDP). However, the MDP described in [7] assumed uniform sizes, so the decision at any time evicts at most one view. We use a slightly different MDP to incorporate the arbitrary sizes. 2.2.1 The MDP Framework The query answering system maintains a dedicated piece of disk space which we call the cache to hold the materialized views. We assume the Independent Refer ence Model, where the queries for views form an infinite sequence with a static distribution. When a query arrives, the system would first check the cache. If the view is materialized, then its content is retrieved from the cache, and this is called a hit; otherwise it will be evaluated from scratch, which we call a fault. When a fault occurs, we must decide if this view is worth caching and some views in the cache 17 need to be evicted, and if yes, which ones. Note that we assume non-mandatory admissions, i.e., when a fault occurs, the requested view may not necessarily be admitted into the cache. We will use the same notations as in the definition of The Caching Problem. For convenience we restate it below, but use the terms for views instead of files: Definition (The View Caching Problem) Given a set V of N views v1 , v2,. . . , VN, where each view v has the following specified values • size IvI; • evaluation cost from scratch e; • retrieval cost from the cache r; • probability of occurrence f3 (subject to f3 = 1) and an infinite sequence of view requests, maintain the cache to satisfy the requests while minimizing the total (or average) query answering cost. We assume the number of views N, the evaluation cost from scratch e, and the retrieval cost from the cache r1 are all finite. e, is usually larger than r. Decision Epochs Decision epochs are defined as the instants when requests are posed. We use T to denote the set of decision epochs. In our problem we assume infinite requests as T={1,2,...} Elements of T will be denoted by t and usually referred to as “time t”. State Sets At each decision epoch, the system occupies a state. In our work we define the state to be composed of the cache content information and the request. We use S to denote the cache content at time t. St is a subset of V with the constraint that its total size cannot exceed K: vu K. All the possible cache 18 contents form a set S, which is the state space of all cache contents satisfying the capacity constraint. We use Qt to denote the request at time t. Request space Q is made of all views V, so Q = V. The MDP’s state O at time t is defined by the pair (Se, Q). We call the set of possible system states the state space and denote it e. The state space e is then given by: O={SxV} The number of elements in S, denoted as IS I is at most 2”, and the number of elements in V is N, so the total number of states iei is at most 2NN. Though it is exponential in N, it is still finite. So our MDP is a finite state MDP. The MDP framework we proposed consists of SIN states, and these states are grouped into S sets. We call such a set a group defined as follows: Definition (Group) A group of states is a set of states sharing the same cache content. From the definition of the MDP we know that each group contains N states corresponding to the N different view requests. Let G denote the set of all states 0 with cache content s: G = {0 : 0 = (s,v),i = 1,2,... ,N} Action Sets When the request Qt (Q, is just one view, not a set) is made, the content of the cache is S. An action has to be made to decide the system behavior: whether to admit this view and evict some other views. We classify the situation into several cases: 1. Qt is materialized (which implies e1 r1); 2. Qt causes a fault and there is enough space to admit Qt 19 3. Q causes a fault and there is not enough space to admit it. When case 1 happens, the request would be answered by retrieving the materi alized view from the disk cache. When case 2 happens, all caching policies would just admit Q(= v) if e, — r, > 0. This strategy is always better than the one not to admit it, assuming the cost to evict the views is negligible. It is the different deci sions made on case 3 that differentiate the policies. So when mentioning “actions”, we are mainly talking about the actions taken on case 3. For case 3, the system may decide to admit and evict some views. We denote the admitted view set at time t X and the evicted view set Y. Unlike [7], both X and 1 may contain multiple items including Q in our work. If at any time, X1 has only one view Qt, the policy is called a demanding policy [13], meaning that a view is only cached when requested. When X is , it means that the request is not admitted, and the Y should be as well since we don’t evict views when no view is to be admitted. X and Y cannot contain the same views at the same time. Thus given the state € = (St,Qt) the resulting state can be written as = (S+i,Q+i) (2.6) if Q E St = (St+Qt,Qt+i), if QtSt,IQtIK—IS1 I if QtS,,IQ,I>K—IStI We denote the actions available at time t A which is composed of X and Y1. The action space for state 0 e G is denoted as A0 and represents all possible actions for state 0. The action space for all states A is the superset of all A8 for all states 0 in the system. In our MDP system X or Y can be any combinations of the elements in V with the constraint that their sizes cannot exceed K, so the number of elements in any of them cannot exceed 2N Thus the number of elements in A8 for any state 0 is at most 2’’2” = 2N which is also finite. 20 Rewards and Transition Probabilities For any MDP system, as a result of choosing action a A0 in state 0 e 0 at decision epoch t, • The system state at the next epoch is determined by a probability distribution • The decision maker receives a reward w(0,a) or a costc1(O,a) We have defined the notion of group in Section 2.2.1. It is easily seen that for any state, when an action is made, the system would transferto the N states in some (not necessarily a different one) group with the probability distribution () = 13i ,132,.. ., 13N defined in the Independent Reference Model respectively. So the transition probability from state 0 to state O for action a at time t is Pt(Oj I 0,a) = f3 where the state 0 = (S,+1 , v) is composed of cache content Si and request v. Suppose at time t the system is in state 0 and action a is taken, and 0 = (S,, QI), we define the one step rewardw1(G,a) and one step costc1(0,a) to be: I e1—r if Q1 ES,,Q1 =v1 w1(0 a) = (2.7) 1 0 ifQ,C1 r1 if Q1 S1, Q, = c1(0,a) = ç (2.8) e ifQ1.C, We can see that the reward or cost is only relevant to the state but not the action, then we can omit the parameter a and write them as w1(0) andc1(0) The request Q, either causes a hit or a fault. If it is a hit, we say that the system earns a reward amounting to e’ — r, or 0 if it is a fault. Alternatively we could say that a hit causes the system a cost r, and a fault causes a cost e1. Notice that our definition of the one step reward here is just the benefit b, of view v1 we defined in Section 2.1. 21 Decision Rules Decision rules range in generality from deterministic Markovian to randomized history dependent, depending on how they incorporate past information and how they select actions. A deterministic Markovian rule is a function d1 : 0 —* A0, which specifies the action choice when the system occupies state 0 at decision epoch t. For each 0 e,d1(o) E A0. It is said to be Markovian (memoryless) because the decision only depends on the current state 8, and deterministic because it chooses an action with certainty. Furthermore, if the decision does not depend on time t, then the rule is called stationary. We call a deterministic decision rule history dependent if it depends on the past history of the system. That is, d, is a function of the history h1 = (Oi ,a1,. . . , 0_ i,a1_, 0,), where 0, and a denote the state and action at time u. Since we only consider deterministic algorithms in this thesis, we will only introduce the definitions of deterministic policies. We use the following notations to denote the variant set of policies: • JJHD history dependent and deterministic • 11MD Markovian and deterministic • flSD Stationary Markovian and deterministic The relationship among them are: 11SD ç HMD ç flHD 2.2.2 The Objective Value of a Policy The MDP we propose is discrete with infinite-horizon. Let policy r = {d1 ,d2,. . } denote a randomized history-dependent policy. For each t, ci : H, —+ A, where H, = 0 x A x 0 x x 0 is the set of all histories of states and actions prior to decision epoch t, For each history realized up to time T, there corresponds a sequence of rewards { r1(0l,al),r2(82,a2),.. . ,rT_1(8T.1,aT_I),rT(9T,aT)} 22 Thus the random variables of states and actions induces a bivariate discrete- time reward process defined as {O,,w,(9,,A,) : t = 1,2,.. .}. 9, represents the state of the system at time t, and r(0,,A,) represents the one-step reward received when using action A, in state 9,, and its value follows the result we got in Section “Rewards and Transition Probabilities”. When t 0, r(So,Ao) is denoted as r(So), representing the benefit obtained from the null action from a null state to the first state, and its value is the benefit made by Ro. In the view cache management problem, we are mostly interested in the fol lowing values objectives: 1. The expected total reward twa(s) of policy n E rIHR, or equivalently, the expected total cost tc’(s) twa(s) = limE{ w,(ø,,A,)} T tc(O) = urn E{c,(0,,A,)} T-.’.’ 1=1 In our MDP, the one step reward w,(@,,A,) and one step cost c,(@,,A1)is irrelevant to whatever the decisions made, so we can denote the one step reward as w,(0,) and one step cost as c,(0,). So the above formulas can be written as tw(O) = limE{w,(0,)} tc(O) = lim E{c,(0,)} t=1 2. The average reward of policy ir fluI?: ar2(O), or equivalently, the average cost ac(O) aw(G) = limE{w,(O)} (2.9) ac(O) = lim E{c,(O,)} (2.10) t=1 23 In the above equations, the parameter 0 is the starting state at time I. In many applications the expected total cost diverges for some or all policies, but for our problem, the limit exists and is equal to cc for all policies. Theorem 2.2.1 tc(8) = ccfor any policy r and any starting state 0 Proof Define tc(0) =E{r(@,At)} t= 1 and tc(0) =E{r(@1,A} t1 wherer(91,A,)=max{__wt(Ot,At),O}andr+(@t,At)__max{wt(Ot,Aj),O}. The theory of MDP [29] prescribes that if for any policy n, for each state 0 E €, either tc(0) or tc(0) is finite, then tc(0) = limM E{E1w(e1,A)} exists and satisfies tc(0) =tc(G) —tc(0). In our case, the cost at each step is either e1 or r1 which are positive values, so tc(0)=O and tc(0) = t=1 > r,where r1 is the request at time t t=1 = cc thus the limit converges and the value is tc(0)—tc(0)=cc—O=cc I Since the expected total reward is cc for all policies, it cannot distinguish the 24 policies. Therefore we will mainly focus on the average reward or the average cost. The average cost is in fact the expected cost to answer a query. For either the total criteria or the average criteria over infinite horizons, maxi mizing the benefit is equivalent to minimizing the cost. For simplicity we will use the average reward criterion in this section. 2.2.3 Comparing the Objective Values Both the problem of seeking optimal policies under the MDP framework and the problem of static selection assumes the same probability model — the Independent Reference Model. For either the total criteria or the average criteria over infinite horizons, we have known that maximizing the benefit is equivalent to minimizing the cost. As stated before we will examine only the expected average reward. Now let’s look at the objectives of the two problems: 1. The average reward objective for the dynamic policies Recall the definition of average reward aw (0) as follows 1 M aw(O)=lim E{w(ø)} t=O where 0 is the stating state, and € is the random variable representing the system state at time t. We want to seek a method for computing the best reward awO)T(O) = sup awC(O) 7CEI1HR and find a policy rcT such that = awOPT (0) 2. The expected benefit objective for the static selection solutions Recall the definition of the benefit a cache content S gets: E(B)=b1=(ej—r1)I3j vies vies 25 We want to find the best subset SOPT that produces the maximum benefit Bm such that Bm = SUp E(B) scV,sK E (Bsopr) = Bm where ISI is the total size of the views in S. The average reward aw’(6) of an MDP policy is the expected benefit of the gradually adapted states over the horizon for a “dynamic” policy 2t, while the ex pected benefit B that the static selection algorithm seeks to maximize is for a static cache content S to answer one query. The relation between the optimal average re ward tw(O) and Bmax is given in the following theorem Theorem 2.2.2 awOPT (0) Bfor any starting state 0. Proof From the definition of the average reward defined in Equation 2.9 we know M awO’T(0) = sup urn 7CEIIHR ° t=O = sup lirn JEHHRM M0 From (Equation 2.5) in Section 2.1 we know that B, is the greatest expected reward to answer one query over all possible cache content S under IRM, so we can assert that Bm > E{w1 (®) } for any starting state 0 and any policy r no matter what the distribution for the states @ at time t is. Thus we have aw(0) < lirn—Bmax t=o < B I 26 This theorem shows that under the Independent Reference Model, the average reward or even the total expected reward any policy can achieve cannot exceed that of the best view selection algorithm. The theorem also holds for the case where dependence relationship exist among the views. Our result is in contrast with the result in Dynamat [25) in which the authors argued that the dynamic view man agement system using Smallest Penalty First (SPF) policy is better than the static selection even when the queries form a uniform distribution. The SPF algorithm they used is similar to our Greedy algorithm introduced in Chapter 3 in which we will show it is K-competitive. The authors of Dynamat examined the results by experiments. They used the parameter Detailed Cost Savings Ratio (DCSR) as the criterion: DCSR = where b1 and e are similar to ours: 0, if q1 cannot be answered byS b1 e, if there is an exact match forq (2.11) e1 — e, if f from S is used to answer qj From its literal meaning the DCSR is the benefit in our case. It’s worth asking why their experiment results do not agree with our theoretical results. Some reasons might include: 1. Though they also tested the uniform distribution, they omitted the first 10% of queries. 2. They used the DCSR per view metric which is good for any dynamic algo rithm but bad for the static algorithm and it doesn’t guarantee the overall query answering cost. Static selection requires that all the information about the probability distribu tion, the view sizes, the evaluation costs and retrieval costs is known. It may look like a strong requirement, but the idea is applicable since while collecting the infor mation of probability distribution in the probation period, other information such as sizes, evaluation costs can also be collected. In fact, both doing a static selection 27 or seeking a “dynamic” policy would require such information. In seeking a “dy namic” policy, we would have to know all the states, which essentially requires the same information. This result suggests that if we assume the independent reference model and if the applications can actively pick the objects, which is just the case for view selection problem, we can just use the static selection algorithms to get the optimal average cost. Nonetheless, it is still valuable to study the dynamic policies although they are no better than the static selection under the Independent Reference Model. We believe the probability distribution is not fixed all time in practice and the mate rialized views should be managed as time goes by. In that case our knowledge in designing optimal policies under the Independent Reference Model will help us find good online policies. So in the next section we will be seeking optimal policies under IRM. 2.2.4 The Way to Calculate The Average Reward Recall that the average reward aw’ (0) for starting state 0 was defined as follows aw(0) =iE{Ewt(9t)} (2.12) In some MDP systems, aw(0) need not exist for some policies, i.e., it may diverge. Fortunately researchers have found that when the state space and action space are finite or countable, the limit always exists under some stationary ran domized policy r [30]. The question now is: How large will this limit be? In Theorem 2.2.2 we have seen that the average reward for any policy and any start ing state is no better than the optimal expected benefit Bm to answer one query using the static selection. Can the optimal average reward of the MDP policies be equal to Bm? If yes, what kind of policies can achieve the optimum? To know that we will have to be able to calculate the average reward for a specified policy in the first place. Only after that can we compare the policies and find out the optimal ones. However it is not easy to calculate the average reward for a specific policy r starting from the definition in Equation 2.12. A policy r induces an infinite 28 sequence co of states and actions when the horizon is infinite: co=(Ol,al,O2,a2,...) We call o the sample path. All possible sample paths form the space of sample paths c2: = S x Ax S x Ax Then the expectation of the reward over the horizon is: Al M E{Lwt(0t) = (w1G))Pr{co} (2.13) t=O O)E2 t=O where co = (0 ,A, ,@2A,...) is one sample path resulting in the states {e ,02, . . by the action taken at time t, A, and Pr{co} is the probability of the sample path Equation 2.13 does not provide an efficient procedure for computing such ex pectations and ignores the dynamic aspects of the problem. However if {0 : t T} is a Markov chain’, the expectation can be calculated by standard matrix methods in the MDP theory which we’re going to introduce. Suppose P is the one-step transition matrix of the induced Markov chain by this policy r, with pl) (O O) as its element at row i and column j representing the probability of one step state transfer from O to Q by the decision of policy ir on state O, O, O E 0. Let P be the Cesaro limit: P = lim (2.14) M—* M t=O where is the t step transition matrix. Doob [16] and Chung [12] proved that the limit in Equation 2.14 always exists though the limit limM p(M) sometimes doesn’t. In addition, if the states in the chain are finite, P must be stochastic, i.e., the value of the elements in each row sum to 1. Then by the MDP theory [29], the ‘Meaning the decision rules are Markovian, i.e., the action taken depends only on the current state 29 average reward aw(0) can be calculated using the following formula: awr(0) = limjEe{Ewr(Ot)} =Pw(0) (2.15) where w(0) is the one step reward for state 8 under the stationary policy r. The proof of this formula can be found in Proposition (8.1.1) in [29]. Since w,(0) is known according to the definition in Equation 2.7, the only work left to get the average reward for a policy r is to obtain P. P has different looks when the induced chain has different structures, namely, irreducible, multi- chain, and unichain, We will be examining them one by one shortly. This will not only let us understand the MDP system better, but also help us design the optimal policies. According to the Markov chain theory, a Markov chain can be classified into three categories: Irreducible,Unichain and Multichain. To understand these def initions, we need to introduce the notation of communicating class as defined in Markov chain theories: Definition Accessibilily A state O is said to be accessible from a different state 0 if, given that we are in state 01, there is a non-zero probability that at some time in the future, we will be in state 0. Formally, state O is accessible from state 0 if there exists an integer n 0 such that = OJI@O 9} > 0 Allowing n to be zero means that every state is defined to be accessible from itself. The expression can also be written as p(Oj0j) > 0 Definition Communicating Class A state 0 is said to communicate with state if it is true that both 0 is accessible from O and that 0 is accessible from 8. A set of states & is a communicating class if every pair of states in @ communicates with each other, and no state in 0 communicates with any state not in 9. It can be shown that communication in this sense is an equivalence relation. 30 Definition Closed Communicating Class/Closed Irreducible Set A closed class 0 is a communicating class with the probability of leaving the class being zero, namely that if 8 is in 0 but O is not, then 8 is not accessible from 01. Definition Transient State A state 0 is said to be transient if, given that we start in state 0, there is a non-zero probability that we will never return to 0. So we can classify the Markov chains as follows: Irreducible A Markov chain is said to be irreducible if its state space is a commu nicating class; this means that, in an irreducible Markov chain, it is possible to get to any state from any state. Unichain A Markov chain is said to be unichain if its state space is made up by one closed communicating class and some transient states. Multichain A Markov chain is said to be multichain if its state space is made up by multiple closed communicating class and (possibly) some transient states. One thing we need to mention is that not all MDP systems have policies induc ing all three kinds of chains. It depends on the MDP’s classification concerning the ergodic structure. But our. MDP, being a general (or multichain) one, has policies that can induce all the chains defined above. Before examining the limiting matrix for all three kinds of chains, we need to introduce some knowledge in Markov chains and linear algebra that will be used later. We can partition a chain into disjoint closed communicating classes Di, 1 = 1,2,. .. , L, L iei, and possibly a set of transient states T. After relabeling the states if necessary, we can express any transition matrix P as below if the policy r induces disjoint closed irreducible ets D1,l = 1,2,... ,L and possibly a set of transient states T. P10 O 0P2. (2.16) 0 FL 0 Qi Q2 QL QL+1 31 where P1 correspond to transitions between states in D1 under policy ir, Qi to tran sitions from states in T to D1, and QL+1 to transitions between states within T. We call this form the canonicalform of the transition matrix. The limiting matrix was proved to be in the following form [29]: P 00.. 0 0 P 0 . 0 (2.17) * * * * 1 2 L L-I-1 where P,* is the limiting matrix of Pj. Because every two states in a closed commu nicating class are commutable, the elements in F must be positive. And since P is stochastic, every Pj must be stochastic. Another fact we’d like to introduce for our MDP system is that the states within one group cannot belong to different disjoint closed communicating classes under any policy. This is because at any state 0 = (s,v1) when an action is taken, the system would transfer to the N states of a group (not necessarily different from the current one) with probability flu to fiN. It is impossible for state 0 to transfer to only part of a group. Hence if one state within a group belongs to a closed communicating class, then all the states in this group must also belong to the same closed communicating class. Now we look at the three kinds of Markov chains one by one: Irreducible Suppose a stationary policy r induces an irreducible Markov chain. Because the chain is just one communicating class, the Cesero limit matrix P for such a chain is stochastic and has the following property: fl(s1s) >0 The chain consists of ISI groups. Now order the states in the transition matrix first by groups, then in each group by the requests of the states in that group from 32 vl to VN. For each group, the limiting distribution for the N states in it for any starting state must be linearly dependent with the vector (13i ,132, • 13 since no matter what decisions are made at time t, the system would always transfer to some group (not necessarily different from the current one) with probability J3 to 13N. So the row of the limiting matrix P which representing the transition probability from state 0 to all the states is in the following form: [(ai/3i,a1132, . . . ,afl/3N, ), (aJ3i ,a2I32, . . . ,aj2I3N,), . . . , (ai13i ,a115132, . . . )j (2.18) subject to IS’ a = l,alk >0 k=1 where alk is the coefficient for row i and the group. The one-step reward w (0, a) is independent of to time t and action a in our system so we simply denote it to be w(0). Let’s also order the states 0 e € in the vector w(0) first by the groups then by the requests from vl to vN, just like what we did in the limiting matrix P’, so that the resulting vector is in the form: [(w(01),w(02),.. . ,w(6,)), (w(ON+1),w(ON+2),... ,w(ON+N))),..., (w(0(J5l_1)XN+1), w(0(5_1)N+2),.. . ,w(0)] (2.19) where w(0(k_1) xN+j) means the reward for the JIh state in the kth group. We will call this form of the one step rewards vector the canonicalform in the future. So the average reward for starting state O can be calculated as aw(O) = P2w(0) k=N,j= IS I *1 — P (k—1)xN+j 1W (kl)XN+j k=1,j=1 151 N = (a1kPJw(0(k1)XN+f)) k=1 j=1 33 ‘SI = EaikBsk k= 1 where Sk is the cache content of group k and Bsk is the expected benefit for cache content Sk. Because ak > 0 and there must be some Sk such that Bsk <B, so 5 aw(Oe) = a1/B <B, k= 1 So the policy r inducing an irreducible Markov chain must have an average reward smaller than Bm. Unichain If a policy induces a unichain P, i.e., the chain consists of one closed communi cating class D and a (possibly empty) set of transient states T, then P has equal rows and p(sIsi) = { qj Si E D (2.20)0 SJET where q satisfies qP = P subject to LSJED qj = 1 This result was known in MDP theory and can be found in [29]. Note that the closed communicating class may contain more than one group, but certainly not all the groups, otherwise it would become the Irreducible case. Now the problem remaining is to ascertain qj’s val ues. We first write the one step transition matrix into the canonical form as in Equation 2.16: 0 \\Q1 Q2 where PD corresponds to transitions between states in the closed irreducible set D under policy r, Qi to transitions from states in the transitive set T to D, and Q2 to transitions between states within T. We know that the limiting matrix P must be the following form as shown in the Equation 2.17: 0 7r ‘\Q Q* 34 Since we know that P must also be in the form of Equation 2.20, we shall get the result that • P has same rows and the j’ element in the row is qj • The rows in Q are identical to the rows in P • Q=0 So in order to determine qj,sj D, it’s enough to find out what P1 is. Suppose the closed communicating class D contains b groups. Then the rows in P must take a similar form as in Equation 2.18 for the same reason: [(alfll,a1132,...,alflN,),(a2131,a2P ,...,a2fN,),.. (abIl b12 ... abP )] subject to b a=l,a)’O k= I where a is the coefficient for the kth group in D. As in the last section, we write the one step reward vector in its canonical form: [(w(Ol),w(62),. ..,w(ON)),(w(ON+1),w(ON+2),..., ,(w(O(b_1)xN+1),w(O(b_UxN+2),.. . So the average reward for any starting state 0 in the system has the same value and can be calculated as aw(0) = P,w(0) k=b,j=N = p(O(kl)XN+JI0i)w(0(k_1)XN+f) k=1,j=1 b N = E(ak k=1 j=1 b = akBsk k= I 35 Specifically, if D only contains one group G where S is a specific cache con tent, then al = 1 and the rows in the limiting matrix would be: [131,132,... and the canonical one step reward vector would be: [W(O1),w(02),. . . so that awO)—_rBs for every 0 e e. This shows that the resulting average reward is only relevant to the groups in the closed communicating class D. If D only contains one group Gsopr where5OPT is the cache content that has the best benefit Bmax, then the average reward for every starting state 0 can achieve optimum. Multichain Suppose policy r partitions the induced chain into disjoint closed communicating classes D1,l = 1,2,... ,L,L and possibly a set of transient states T. Write the one step transition matrix and one step reward vector in their canonical forms as in Equation 2.16 and Equation 2.19, then the average rewards for the starting state 0 inD1,le [l,L} must be aBs, Gsk ED1 and smaller than Bm. Only in a closed communicating class D1 = GSOPT can the average reward be Bmax. We don’t even need to look at the limiting distribution for the transient states to conclude that a multichain cannot guarantee that every starting state obtains the optimal average reward Bm. So far we have discussed all three kinds of chains induced by a stationary policy r. The discussion lead to the following theorem: 36 Theorem 2.2.3 The optimal average reward awOPT (0) a stationary policy can achieve is Bmax for any starting state 0, and it is obtained only by a policy inducing a unichain whose only closed communicating class is made of the optimal groups. A group is optimal if its N states share the cache content SOPT. The proof follows naturally from the discussion in Section 2.2.4 Note that we only discussed the average reward. The same result applies to the average cost. 2.2.5 Finding The Optimal Policies Like any other finite state MDPs, the optimal stationary replacement policy can be attained by value iteration (dynamic programming) as in Figure 2.1. 1. Define ar(°)(0j) = w(01), set the iteration index n = 0 2. For each 01 E €, computear1)(0j) by ar’ (Oi) = maxaEA {w(01,a) + EoEe p(6j0j,a)ar(1) (0j)} 3. If ar(’’)(01)= ar()(0j), then go to step 4. Otherwise increment n by 1 and return to step 2. 4. For each O c3, choose the decision d(01) = argmaxaEA{w(Oi,a)+ojEe} Figure 2.1: The Value Iteration Algorithm The value iteration process starts from n = 1 for all states, then n = 2,3,... until the values don’t change any more. The policy that satisfies this equation is the optimal stationary policy, and from Section 2.2.4 we know it must induce a unichain Markov chain. When the execution of the value iteration algorithm is done, it would find out the optimal actions in each state, so that the system knows which action it should take when seeing various states in processing the query sequence. However, the value iteration algorithm requires the information about all states 37 before it can answer the query sequence. This is in contrast to policy A0,B0 and Co in which the optimal decisions can be made solely on the information of the views in cache and the requested one. This is because that when the sizes are uniform, stack algorithms like Ao, Bo and Co can progressively converge to the optimum cache content by sorting the views in terms of the benefit density P(e — ri)/Ivjl and evicting the smallest one when lvii is uniform. Coffman and Denning [13] gave the definition of stack algorithms as follows: Definition Stack Algorithms An replacement policy A is called a stack algorithm if its cache content satisfies the inclusion property: For any query sequence W = {Qi, Q2, . . .} and any cache size k S(k,) cS1(k+ 1,w) where S, (k, w) is the cache content for a cache with capacity k and query sequence W at time t. In the case that I v is uniform, the stack for the replacement policies are the sorted order of views (in descending order) on the benefit density f3(e — rj)/ivii where li’jI 1 for all i. By evicting the views with smallest benefit density, the cache would gradually converge to the first K views, which is exactly the optimal cache content when the sizes are uniform. The action to evict the view with small est f3(e — rj)/ivji when vj = I is the optimal decision for every state, since the process of convergence is monotone, i.e., if a view is among the first K on the stack, then once it is requested, it would be admitted and never be evicted again. However when the sizes lvii are arbitrary the case is different. As we have seen in Section 2.2.4 the optimal policies would converge to group Gsor, but the optimal algorithms for this case doesn’t have the inclusion property. As a simple example, the cache content {v , V2} is optimal for cache capacity k, but the optimal cache content for cache capacity k+ 1 might be a totally different set {v3, v4}. So it is impossible to get the optimal policies using the principles of stacks, i.e., greedily evict the view with smallest density. To obtain the optimal policies, one has to use dynamic programming. Now let’s analyze the complexities for the value iteration algorithm. In each 38 iteration, the algorithm needs to calculate the objective value for all states and all actions. The number of states in the system is IS IN, where SI is 0(2”). The number of possible actions for a state 0 is A0, which is also SI = o(2N) since a state can transfer to any other states in the system by our MDP definition. So the total time complexity is: IxISINxISI=0(I.22” ) where I is the number of iterations. The space complexity of value iteration is ISIN and 0(2”)N. The value iteration algorithm, though it can obtain the optimal stationary pol icy that has maximum average reward, it doesn’t help us thoroughly understand what the decisions are, and it is excessively expensive. Note that all these compu tations have to be done before answering the queries and this would make the value iteration algorithm not very meaningful. Suppose that all the information is known prior to the real queries, then why wouldn’t we just calculate what the optimal cache content is and make the system jump into the states with that cache content? If this is true then we’re essentially dealing with the static selection problem. However we know that designing dynamic polices are still worth studying. To design a dynamic policy, we can integrate the calculation in the process of query answering. Now we will introduce a policy called D0 in Figure 2.2. This policy, like the dynamic programming algorithm for the Knapsack problem, keeps a vector B( (j) of size K at any time to hold the best selection so far in iteration n. The algorithm reaches the optimal cache content when all the N views are met, and from that point on the system remains with the optimal static cache content. The time taken to reach that point is 1/f3 and is finite. Though this policy needs to remember the views seen in the past and hence is a history dependent policy, it doesn’t incur more time or space complexity. The space consumption is K and with a little modification it can become min{2”,K}. For each iteration of a new view, the algorithm incurs a cost of 0(K), so the total time complexity is 0(Nmin{2”’,K}), much smaller than the value iteration. The average reward of policy Do is also Bm. In fact the stochastic process induced by this policy has the same limiting matrix P as in Equation 2,20, since 39 1. Set B(°)(j)=Oforeveryj [l,K1 2. If a new view is requested, set n = n + 1, update for each j B()(j) = B(’)(j) if lvii > J B()(j) = max{B(’1)(j),cj +B(”U(j — IviD} if vd j 3. If B(’)(K) #B’1)(K) Update the cache content to be B) (K 4. Loop back to step 2 when seeing the next request. Figure 2.2: The D0 Algorithm the system would remain in the optimal groups whenever it reaches them, making the limiting probability of the transient states 0. This policy, however, has a drawback that it is not a demand policy and may load views that are not the view request. These views might be evicted without being referenced when another new view is requested. Therefore we may want to use the demand version of the policy. 1. SetB(°)(j)=Oforeveryje [l,K] 2. If a new view is requested, set n = n + 1, update for each j B()(j) = B(1)(j) if lvii > J B()(j) = max{B(’’)(j),cj +B’(j — ivi)} if lvii j 3. If BO)(K) $ The current cache content St If the current request Qt e B( (K) — While the space is not enough Evict the view with smallestf3e(e — rj)/ivil from S —B(K) 4. Loop back to step 2 when seeing the next request. Figure 2.3: The D Algorithm The D policy in Figure 2.3 remembers the state of Do but defers the admis 40 sions and evictions of views until they are requested. As proved in [13], the demand version of a policy is no worse than itself in the total reward criteria. Furthermore, it has the same optimal average reward as in Do. However the convergence rate is slower than D0: the time to reach the optimal cache content is bounded to be 1/fly, which is at most two times of that of D0. 41 Chapter 3 Approximation Algorithms In Chapter 2 we discussed the optimal static selection algorithm and “dynamic” policies. We have seen that when both the view sizes and costs are arbitrary, the problem is a restricted Integer Covering Problem and is NP-complete, so the time complexity to get the optimal solutions cannot be polynomial. Therefore it is nec essary to study the polynomial time approximate or competitive algorithms, and these become the main topic in this chapter. Knowing that the dynamic policies are dominated by the static selection un der IRM from the last chapter, we will start from the static selection problem in Section 3.1. We will first show an polynomial algorithm called “Greedy” and prove it to be an K-approximate algorithm with respect with the optimal fractional solution. Then we show that it is in fact the “best” approximate algorithm be cause K is the best approximation ratio for all polynomial algorithms. We also look back at the 2-approximate algorithm for knapsack and reveal that it is in fact also K-approximate for the view selection. After that we present a Hi-approximate algorithm with respect to the optimal integral solution called “Extended Greedy”, where K’ = Ei lvii — K. Sometimes the Extended Greedy algorithm can outper form both Greedy and knapsack’s 2-approximate algorithm, which will be shown in an example follows. At the end of the section we present a mediated algorithm that is superior to the above three algorithms. In Section 3.2 we will discuss polynomial competitive policies under IRM. We will present a polynomial time policy that is K-competitive to the fractional opti 42 mal average reward. The policy is analogous to the one used in Watchman [31], but in that paper the authors didn’t give the analysis of the competitive bound. The authors argued that their policy was optimal with some additional constraints, however the constraints in fact removed the hardness of the problem and degrades it into the uniform sizes case. In contrast, we prove the policy to be K-competitive and K is the lower bound of all polynomial policies. 3.1 Approximate Static Selection Algorithms In Section 2.1 we have seen that the static selection problem is actually a knapsack problem. The knapsack problem has a polynomial 2-approximation algorithm, and it was shown in [24] that a polynomial algorithm for the knapsack problem cannot achieve an approximation ratio smaller than 2. But here comes the distinction of maximizing the benefit and minimizing the cost. Does the approximation ratio on the benefit imply the same ratio on the cost? If it doesn’t give the same ratio, does it guarantee any ratio at all? In the view selection problem for OLAP queries where the dependence relationships among views form data cubes, it was proved by Karloff and Mihail [23] that the problem is basically inapproximable for general partial order relationships under IRM, even though the approximate algorithms to get competitive benefit ratios do exist [21J[32J. So we may be wondering if the same situation happens when there is no dependence. Fortunately we have found that this is not the case. In the following sections we will give a K approximate selection algorithm. 3.1.1 A Greedy Algorithm With Bound K We introduced the static selection problem in Section 2.1 and defined the objective value, the expected cost to answer one query, in a mathematical way in Expression (Equation 2.1). For convenience of finding the approximate algorithm, we now restate the problem using the integer programming formulas here: Definition (The Static View Selection: An integer Programming (JP) Formula tion) Given a set V of N independent views v v2,... , v,, where each view v has the following specified functions 43 • size • evaluation cost e1; • retrieval cost if materialized r1; • probability estimation p, (subject to E=i Pi = 1) The problem prescribes: Minimize (p1exj+ pjrj (1 — xi)) Subject to iVlX K’, 1=1 where K’ = lvii — K and K is the cache size x = 0,1; x1 = 0: v is cached; x, = 1: v is evicted The objective value1(px+p1r(l—xi)) in the above formulation can be simplified as: N N N (pex+pr11—x1)) =p1(e—r)x+p, 1l j=1 i=1 = where N is a constant (3.1) Thus the problem is simplified as the following if we use c to represent pj(ej — r1) for each view: N Minimize c1x, 1=1 Subject to N IviIxi K’, (3.2) 44 = 0,1 Note that our formulation here is different from that in Equation 2.3. In Equation 2.3 we represented the objective value as the benefit of the cache content taken off a constant a, while here the objective value is some cost added to some other con stant 6. Let F denote the objective value cx1 where S is the selected views by algorithm CJ, if it achieves approximation ratio a, which means Fq <aF* where F* is the optimal F, then the expected cost Ee(C) = 6 + F for algorithm cI must have the same ratio a because: E(C) = 6+F (3.3) a(6+T) < aE(C) where E* (C) is the optimal expected cost to answer one query. In order for the readers to understand the analysis of the approximate ratio, we will first introduce the linear programming relaxation and the fractional optimal solution of the problem. Linear Programming Relaxation and The Fractional Optimal Solution As we know, a standard approach for any integer program to get any insight into the structure of the problem and also to get a first estimate of the solution value is to relax the integrality constraint to a linear one. The linear programming relaxation (LP relaxation) of this problem can be defined as: Definition (The Static View Selection / Eviction Problem: An Linear Program ming (LP) Formulation) Minimize cx Subject to v1Ixj K’, (3.4) 45 0 I In this formulation x = 0 or x, = 1 has the same meaning as in the IP formula tion, but when 0 <x1 < 1 it means that only part of the view is chosen. The solution of the LP problem can be computed in a very simple way as the global optimum can be obtained by making a series of greedy (local optimal) choices. Sort the views in terms of in non-decreasing order. Without loss of generality, assume lvii — 1)21 — IVNI The greedy algorithm assumes all views are materialized initially, i.e., x1 = 0 for all views, then greedily evict the view with smallest in each iteration, by evicting we mean that these evicted items’ x were set to be 1. If evicting a view would cause the capacity constraint to be satisfied, let us say in iteration ç, then the execution is stopped and the residual space (K’ — vj I) that must be “filled” by the evicted views is then filled by an appropriate fractional part of view vç. vç is referred to as the split view. The optimal solution vector for the LP problem is defined in the following theorem: Theorem 3.1.1 An optimal solution vector =(4P,,... ,P) is defined as: = — (K’—EZIvI) — lVci 4” = 0,j=ç+1,...,N The corresponding solution value is Fpc1+(KI_3IVJI)cc The optimal solution for finding smallest r is just the complement of the op timal solution of finding the largest benefit caused by views not evicted, which is 46 the solution to a knapsack problem. The optimality proof of the knapsack problem was shown in [24], so we don’t repeat it here. The Greedy K-Approximate Algorithm Inspired by the optimal solution for the LP relaxation of the view selection/evic tion problem, we now present an algorithm for the IP version of the problem in Figure 3.1. 1. Sort views in non-increasing order of Assume the resulting order is marked by j, j = 1,2,. .. 2. Assume all views are cached, and evict the views ranked from 1 to ç Figure 3.1: Greedy Algorithm Theorem 3.1.2 The greedy algorithm in Figure 3.1, denoted as G, is K-approximate, i.e., EG(C) S KE(C) where E0(C) is the expected cost to answer a query by algorithm G and (C) is the fractional optimal expected cost. Proof c—i FQ = Cj+Cç (K’ - lvii) < c1+K cc 1=1 IVcI c-i (K’—Z v1) < K(c,+ cc) Vç = KFP Together with the result in Equation 3.3, we can conclude that EG(C) <KE(C) 47 I3.1.2 The Lower Bound of The Approximate Ratios We have showed that the greedy algorithm in Figure 3.1 has a competitive ratio K. K might be a very large number and we wonder if this bound can be improved. In order to do that, we need to first introduce the integrality gap defined in [331. Given a LP relaxation of any minimization problem, let zp (I) be the objec tive function value of an optimal solution to the LP-relaxation, and let z (I) be the objective function value of an optimal solution to the original IP problem, the integrality gap of the relaxation is defined to be z*(I) sup * I ZLP If the objective function value of the solution found by the algorithm is com pared directly with that of an optimal fractional solution, the best approximation factor we can hope to prove is the integrality gap of the relaxation. In our case the task is to examine the integrality gap on the expected cost to see if it is smaller than K. The result is given in the following theorem: Theorem 3.1.3 The integrality gap on the expected cost for the view selection/e viction problem, defined as E* (C) (I) E(C)(I) is at least K. Proof Consider the following instance: view size lvii cost c cj/Ivtl K K 1 V2 1 K+1 K+1 Table 3.1: An instance where the integrality gap is K In this instance the system only has two views and both of their sizes are no greater than the cache size K. Now we can calculate its optimal fractional solution ELP(C)* and optimal integral solution E*(C). 48 The optimal solution for the LP relaxation is to just evict part of v and the result is E(C) = The optimal solution for the original IP problem is to evict the whole vl and the result is E*(C) = K And the ratio of the optimal solution for the original IP problem to the optimal fractional solution for this instance is: E*(C) K E(C) — So the integrality gap sup1 EC) is at least K. • This theorem says that the best approximation ratio for the view selection/e viction problem is K. It means no polynomial algorithm can achieve a better ratio. The direct consequence of this theorem is the following corollary. Corollary 3.1.4 The Greedy algorithm in Figure 3.1 can achieve the best approx imate ratio for the view selection/eviction problem. Proof We have shown in Theorem 3.1.2 that the algorithm is K-approximate. From Theorem 3.1.3 we know K is the best ratio. Thus the corollary holds. • 3.1.3 The KnapSack Revisited Now let’s revisit the well known 2-approximate KnapSack solution in Figure 3.2. The solution using KnapSack’s 2-approximate algorithm also sorts the views in terms of c/ Iv I’ but unlike the Greedy Algorithm in Figure 3.1, it solves the prob lem by selecting the views to materialize, not to evict. Still, we use c to denote pj(ej — ri). Like Greedy, KnapSack’s 2-approximate algorithm also evicts all the views in {vl,v2,...,vç_1}. Based on that, it then picks the better of {vl,v2,...,vç_1} and {vç}. Since we know the Greedy algorithm which evicts all the views in 49 1. Sort views in non-increasing order of Assume the resulting order is marked by j,j = 1,2,... ,N 2. Greedily add views until we hit the split view Vç . JV > K) 3. Pickthebetterof{vi,v,...,vç_i}and{vç}. Figure 3.2: Knapsack’s Polynomial 2-Approximation Algorithm {vl , v, ..., vç}, KnapSack’s 2-approximate algorithm must be better than the Greedy. This means that KnapSack’s 2-approximate algorithm not only has the best ap proximation ratio K, but also achieves at least half of the optimal benefit and must outperform Greedy. This result is in contrast with the data cube case, where the ap proximate algorithm for the optimal benefit exists but the minimum expected cost is inapproximable. 3.1.4 An Extended Greedy Algorithm In Section 3.1.2 we have proved that the Greedy algorithm in Figure 3.1 is the best approximation algorithm in the sense that it achieves the best approximation ratio K. But that doesn’t mean it performs well under any circumstances. In this section we present another approximate algorithm, called Extended Greedy, that might outperform the Greedy algorithm in Figure 3.1 in some cases. We have known that the problem can be seen as evicting a subset of the views V conforming to Equation 2.1. Let’s call this subset S. The Extended Greedy algorithm iteratively picks the least cost-expensive view into until the size J is no less than K’, where K’ = EVIEV IviL Suppose the view picked in iteration j is denoted as v1 and the set selected by the end of iteration j is S , the cost- expensiveness for v is defined as follows: Definition (Cost-Expensiveness) The cost-expensiveness for v in iteration j is Cj — , where S i is the current subset to evict when iteration j — I ends. mrn{lvjI,K —jSj....ij} =.. The Extended Greedy algorithm is defined formally in Figure 3.3. 50 1. jrrO;S=4? 2. While <K’ j =j+1; Find the view v with the smallest cost-expensiveness, where j is the index of current iteration. (Break ties arbitrarily) — 3. Materialize the views in V — S. Figure 3.3: The Extended Greedy Algorithm In any iteration j, the selected view v covers the space of size min{lvjI,K’ — } with cost c, where j.i is the total size of the views in Let the cost per unit space, or cost density incurred by these views be: P1,P2,. . ,PK’. The following lemma holds: Lemma 3.1.5 where E* (C) is the optimal objective value—the minimum expected cost — for the IP view selection/eviction problem. Proof Let the subset to evict selected by the optimal IP solution be S*. In any iteration j when the Extended Greedy algorithm in Figure 3.3 is picking a view to evict, there must be some views belonging to S* that have not been picked yet, otherwise the extended greedy algorithm would have covered all the space K’ and terminated. Denote these views set O, and O = — j—1. Views in O can fill the left uncovered space at a cost YZoc, which is no greater than the optimum solution’s total cost S*. Suppose in iteration j, the selected view v covers the space indexed from j_’ + ito 3J1 = + Jvi. The cost densities incurred by it are denoted as +1’ P1i , .. . p-s. As they belong to one view, these values are the same. Without losing generality, suppose the order in which the views are chosen is the same as their original numbering, i.e., the first view chosen 51 ‘Ifr-1I+l PISj_1I+2 = = Cj min{tvji,K’— ij—’i} ci < The smallest , where v min{ivi,K’— I’l} LvEo Cj lvii E*(C) V,EO lvii E*(C) K’—ISj_ii E*(C) K’—iSj_iI—IvjI+l In a more straightforward way, we can write this result for iteration j into: E*(C) K’—ISiI E*(C) E*(C) pIj_lI+IvjI K’— S_’ — IvjI + 1 Now let’s start from the first view: Pi < E*(C) K’ P2 < E*(C) - K’-l PK’ E*(C) 52 The lemma is proved. • The lemma leads naturally to the following theorem: Theorem 3.1.6 The Extended Greedy algorithm is HK’-approximate with respect to the optimal IP solution, where HK’ is the harmonic number of K’. Proof Denote the cost achieved by the Extended Greedy algorithm EEG(C). Start ing from Lemma 3.1.5 we obtain: = j=1 J=1 .1 K’ E*C Zr> EEG(C)<K, .j=1 3 = EEG(C)<HKI I Note that the approximation ratio of the Extended Greedy HK’ is with respect to the solution of the IP problem. It doesn’t hold with respect to the solution of the LP relaxation because Lemma 3.1.5 doesn’t hold if E*(C) is changed to E(C). However this doesn’t guarantee that Extended Greedy is worse than Greedy. It can outperform both the Greedy and the Kanpsack’s 2-approximate algorithms in some cases. Here is an example of such an instance: Example In this example the system has four views and their features are shown in Table 3.2. Suppose the cache size is 45, the execution of the three algorithms are respec tively shown in table Table 3.3 3.1.5 A Mediated Greedy Algorithm In the last section we have seen that though the Greedy and the 2-approximate knapsack algorithm can achieve the best approximation ratio, the Extended Greedy 53 view cost c size i’d cj/lvd vl 20 40 0.5 V2 40 40 1 V3 20 10 2 V4 15 5 3 Table 3.2: An instance where Extended Greedy is the best Algorithm Evicted View Resulting Cost KS 2-app VI,V3,V4 55 Greedy Vi,V2 60 Extended Greedy vi, 1)3 40 Table 3.3: The result of the three algorithms can outperform them in some cases. Is there a way to find an algorithm that is su perior to all three of them? To find out the answer we need to analyze the behavior of the three algorithms. We first look at the Extended Greedy in Figure 3.3. Just like what we did in the Greedy algorithm and the 2-approximate knapsack algorithm, let’s sort the views in terms of the cost density 4LF in non-increasing order. Without losing generality, suppose the result views’ indexes are conforming to their original indexes, i.e., C1 C2 CN —<—<...<——lvii — 1)21 — 1VNI The split view is denoted as vç. The following lemma holds Theorem 3.1.7 The Extended Greedy algorithm must evict thefirst g — 1 views and their order of being chosen conforms to the order of views on cost-expensiveness. Proof We assume that ç > 1 because when ç = 1 the theorem is trivially true. Then we prove the theorem by strong mathematical induction on j, I <j < ç — 1, where j is the iteration index. The Basis Step We show that in the first iteration, v, who has the smallest density, must be chosen to be evicted. 54 In the first iteration the selected subset is S = 4?, thus the algorithm picks the view with smallest min{I,K’} Suppose the algorithm didn’t choose vl but chose some other view v. It means the algorithm chose some other view v other than vi and Ci Cl min{ivjl,K’} < There are two cases: 1. iv,iK’ 2. lvii > K’ Case 1 means It contradicts with the fact that vi has the smallest density. In case 2, the cost-expensiveness for v is c1/K’ and Ci Ci > Cl Cl K> lv — lvii — min{ivii,K’} where min{Jvj ,K’} is the cost-expensiveness of vi. The last step holds because we have assumed that vi is not the split view so lvi I K’. Thus it contradicts with the hypothesis that the algorithm chose v1 instead of Vi. So the algorithm must choose vi in the first iteration. The Induction Step Suppose in iteration j, 1 <j < ç — 1, the algorithm didn’t choose vj but chose some other view v. It means min{ivji,K’—ij_iI <min{ivI,K’—l_il (3.5) Because 1 <j < ç — 1, the size of view v must be no greater than the left uncovered space, i.e., lvjl JS—i . Thus Equation 3.5 can be simplified to Ci C - min{lvjl,K’—lSj_iI} vu Now there are two cases: I. lvii K’—lS_iJ 55 2. IvI>K’—S_iI Case 1 means Then v must have been chosen in previous iterations, which contradicts with the hypothesis In case 2, the cost-expensiveness for Vj is K’—i and the hypothesis says C K’—IS_iI vj — We have know that is the cost-expensiveness of v. Thus it contradicts with the hypothesis that the algorithm chose v instead of Vj. So the algorithm must choose Vj in iteration j, I j ç — 1. Now let’s look back at the Greedy algorithm and knapsack’s 2-approximate algorithm. We found that both algorithms evict the views from v to v_i,just like the Extended Greedy. The difference is how they deal with the views from vç to VN. Thus we can mediate the three algorithms so that the new algorithm not only achieves the best approximate ratio, namely, K-approximate on the expected cost and 2-approximate on the benefit, but also performs no worse than the Extended Greedy. In the following algorithm we use the denotation CQI) as the cost incurred by views in the set C1, and C() =v1Ec The algorithm is based on the Extended Greedy but it first identifies those views that must be evicted and only iteratively picks views in Vhjgh U Vç. Vhigh and V1 are defined in Figure 3.4. In addition, it keeps a bound of min{C(Vhjgh),cç} when choosing the views. If the cost incurred by the chosen views exceeds the bound, it stops and pick the better of Vhjgh or to evict. Thus the algorithm inherits all the approximation ratios. The run time complexity is still 0(N2) but we could expect the real running time be smaller than that of the Extended Greedy, because step 1 and 2 in the Mediated Greedy Algorithm can be computed using the partial quick sort because we don’t care about the ordering within V10. If the number of views in V1 is a large number, a lot of of operations would be saved. 56 1. Use sorting algorithms to identify the split view vç, the set of views whose cost-expensiveness are smaller than Vç, denoted as V10w, the set of those higher than Vç, denoted as Vhjgh, and the total size of Vi0, denoted as 2. During the sorting process, get the cost of Vhigh, denoted as C(Vhjgh), 3.S=4. 4. While ( S <K’ — Vj0,I and C() <min{C(Vhjgh),cç}) In Vhigh U {vç },find the view vj with the smallest cost-expensiveness. (Break ties arbitrarily) S=SU{v} 3. If C(S) min{C(Vhjgh),cc} Materialize the better of Vhjgh or Vç. Else Materialize the views in V — S - Figure 3.4: The Mediated Greedy Algorithm 3.2 Competitive Policies In Section 2.2 we discussed the optimal “dynamic” policies. When both the view sizes and costs are arbitrary, in order to get the optimal average reward, the running time before reaching the optimal states must be non-polynomial. In Section 3.1 we discussed the approximate static selection algorithms. In this section we will dis cuss polynomial competitive policies. We will present a polynomial time policy that can achieve K times the fractional optimal average reward under IRM. Con trast our result with the K-competitive algorithms to the integral average reward in [35] [14] and the so called optimality argument in LCB-K [28] and WATCHMAN[3 1], it shows that precisely predicting the probability information does help in improv ing the competitive bound, but cannot never be better than K times of the optimal fractional solution. 57 3.2.1 The K-Competitive Policy To obtain the optimal average reward, a policy must reach the optimal states within finite time t. The amortized running time for each action from time 1 to t must be non-polynomial, since if it is polynomial, we could have solved the Knapsack problem in polynomial time. When the number of views is very high, it would take too long to make a decision. Thus we introduce competitive policies defined below: Definition In an MDP system with the objective value function ov(U), where 2r is a policy and 0 is the starting state, if ov’(0) k. Uy0PT(0)+ holds for any starting state s, where is a constant and ov0PT(0) is the optimal objective value, we say policy r is k-competitive. The optimal average reward we’ve achieved in Chapter 2 — acOIT (0)— is actu ally an integral one, i.e., no views are chosen to be evicted or materialized partially. It can be easily inferred that if we allow the MDP system to admit or evict part of a view in its actions, the optimal average reward the policies can achieve is no worse than acOI°T(0) which is equal to Bm. B, is the integral maximum benefit the static selection can obtain. Following the convention in Section 3.1, we use ar?cT (0) to denote the fractional optimal average reward. Now we will present a K competitive policy in terms of the average reward. Suppose at time t, the system is in state @ (S,_i,Q) where S,_1 is the cache content at the beginning of time t and Q is the new request. Like in Section 2.2, we only consider the situation of faults. Whenever a fault occurs on request Q1 at time t and there’s not enough room for it, While the request Qt doesn’t fit into the cache Evict the view with the smallest from Se_i U {Qt} Figure 3.5: The K-competitive Policy 58 Theorem 3.2.1 The policy in Figure 3.5, denoted as ir, is K-competitive to the optimalfractional solution of the static selection problem, i.e., ac(O) K.ET (C)for all starting state 0 Where K is the size of the cache, and ET (C) is the expected costfor the optimal fractional solution to the static view selection problem. Proof Without losing generality, suppose the views’ indexes satisfy the following sequence: lvii — 1V21 — — IvNI The split view is denoted as vç. Under policy r, the states converge to the group G whose N states share the same cache content S = ({vc+1 vc+2,.. . , vN} for any starting state within finite time. At the first time the system reaches one state in this group, the system would remain in the group because the views in the cache (S) will never be evicted again according to the algorithm definition. So this policy is a stationary policy that induces a unichain Markov chain, with the only one closed communicating class being equal to group Gs. From the result in Section 2.2, we know the average cost ac(0) for policy r is IC( * /ac = — s if the closed communicating class only contains one group Gs. So ac(0) = B N = J31(e—r) i=ç+ 1 < K.Ef”(C) where Ej5T (C) is the expected cost for the optimal fractional solution to the static view selection problem. • 59 This theorem leads to the following corollary: Corollary 3.2.2 The policy r in Figure 3.5 is K-competitive to the optimal MDP policy, i.e., ac(6) K. acT(O)for all starting state 0 where acO°T(8) is the optimal average costfor any starting state 6. Proof In Chapter 2 we have known that the optimal average cost acO’T (0) is equal to C,, the optimal expected cost of the static selection. Thus the following steps hold: ac(6) < K.E’T(C) < KCmjn < K.acO’T(0) I In fact, for any MDP policy, the lower bound of the competitive ratio with respect to the optimal fractional solution of the static selection problem is K. This fact is given by the following corollary: Corollary 3.2.3 Any policy v cannot achieve better competitive ratios smaller than K with respect to the optimal fractional solution of the static selection prob lem, i.e., ac(0) K. ELOiT (C)for any starting state 6 where n is an arbitrary policy and 0 is any starting state for 2r. Proof ac(O) > ac°T(6) = Cmin K.EL°(C) From Theorem 3.1.3 I 60 3.2.2 Comment To Related Works In WATCHMAN [311 Scheuermann described a similar algorithm. Their algorithm admits a view only if the admission can increase the density of all the views in cache when a fault occurs. He proved under a constraint model where the sizes of the cached retrieved views are relatively small when compared with the total cache size K and it is always possible to utilize almost all space in the cache, the algorithm is optimal. This, however, does not imply the optimality for general cases since it wipes off the distinction of arbitrary sizes and thus eliminates the difficulty of the problem. This algorithm is in fact K-competitive just as what we have shown in this chapter. 61 Chapter 4 Conclusion and Future Work As a conclusion of this thesis, we discussed the view selection/caching problem under IRM when the views are non-correlated to each other. we show that for any replacement policy (random or deterministic) under the independent reference model, the optimal long-run average metric cannot be better than that of the static selection algorithm. We presented both optimal and competitive algorithms, and revealed the lower bound of the competitive ratio. However some problems still remain in this area, so in this chapter we will discuss the problems remaining and the problems open for future study. 4.1 The Problem Remaining The following problems are found within the topics discussed in the thesis: 4.1.1 Overtaking Optimal Policies In this thesis we were focusing on the average reward/cost criteria in seeking an optimal MDP policy. This criteria is not very selective, since we may have multiple average optimal policies. In order to discriminate the policies further, researchers introduced overtaking optimal policies: Definition Overtaking Optimal A policy n is Overtaking Optimal if for each starting state in the system, the total reward/cost of r’ is better than any other policy ir at any time. 62 Thus our task is to find and prove such optimal algorithms. 4.1.2 The Problem For the Naive Version Problem, Online Case This thesis mainly discussed the problem under IRM, which assumes the fixed and static distribution is known. If we don’t know any thing about the future, the case is called online. In that case we can only predict the future through the past. There is a fold of policies evicting the view with smallest utility value function LocalityFactor1* where LocalityFactor1is a simulation of the probability distributions. GreedyDual-Size,LCB-K, LCVT all falls into this category. In the thesis we have shown that under IRM, any policy cannot have competitive ratio smaller than K with respect to the optimal fractional policy (if we define one, oth erwise we use the optimal fractional solution for static selection). So for the online case we may wonder if it is possible to have the performance if we could precisely predict the distribution at any time. The conjecture is that even if the distribution at time t is known, there is still a bound of K to the optimal fractional solution. 4.1.3 The Problem For The Dependence Case The views in this thesis assumes no dependence relations among the views. We have shown by experiments that if we decompose the views and build the depen dence tries among them, the query answering cost can be improved. However in the data cube case, it was shown that the view selection problem for general partial order is not approximable, i.e.,, the expected cost to answer a query doesn’t have a meaningful approximation bound. We may be wondering if this applies to the trie case in our problem. Meanwhile, the researches done for the view management problem in the data warehouse area are mainly static, i.e., they select the views in one lump sum with out considering the work load change. Some work was done in managing the view dynamically, such as WATCHMAN and Dynamat, however the policies they use were not creative or didn’t have strong theoretical results. Moreover, they didn’t consider the dependence relationships among the views. Therefore our task is to design some real dynamic systems in managing the views for both the IRM and online case, taking the dependence tries into consider 63 ation. There’re certainly a lot of things to explore in this area. 64 Bibliography [1] pages 2 [2] pages 1 [3] pages 2 [4] — pages 1 [5] - pages 2 [6] pages 2 [7] 0. Bahat and M. Makowski. Optimal replacement policies for non-uniform cache objects with optional eviction. In IEEE INFOCOM Conference, volume 1, pages 427—437, 2003. paees iii, 13, 17, 20 [8] J. A. Barnes. Human relations. In Human Relations, pages 39—58, 1954. — pages 2 [9] L. Belady. A study of replacement algorithms for virtual storage computers. IBM Systems Journal, 5:78—101, 1966. -- pages 9 [10] P. Cao and S. Irani. Cost-aware WWW proxy caching algorithms. In Proceedings of the 1997 Usenix Symposium on Internet Technologies and Systems (USITS-97), Monterey, CA, 1997. 1.JRL .edu/cao97costaware. html. —V pages 9 [11] M. Chrobak, H. Karloff, T. Payne, and S. Vishwanathan. New results on server problems. In SIAM Journal on Discrete Mathenwtics, pages 291—300, 1990. --- pages 8 [12] K. L. Chung. Markov Chains with Stationary Transition Probabilities. Springer, 1960. -— pages 29 65 [131 E. Coffman and P. Denning. Operating Systems Theory. Prentice Hall, 1973. pages iii, 9, 10, 20, 38, 41 [14] E. Cohen and H. Kaplan. Caching documents with variable sizes and fetching costs: An LP-based approach. Algorithmica, 32(3):459—466, 2002. URL citeseer. ist.psu .edu/cohen99caching.html. - pages 57 [15] E. Cohen, H. Kaplan, and U. Zwick. Competitive analysis of the LRFU paging algorithm. Lecture Notes in Computer Science, 2125:148—??, 2000. URL -- pages 8 [16] J. L. Doob. Stochastic Processes. John Wiley, Sons, 1953. pages 29 [17] A. Fiat and Z. Rosen. Experimental studies of access graph based heuristics. In 8th ACM-SIAM Symp. on Discrete Algorithms, 1997. pages 10 [18] H. Gupta. Selection of views to materialize in a data warehouse. In IEEE Transactions on Knowledge and Data Engineering, volume 17, pages 24—43, 2005. — pages 6 [19] H. Gupta and D. Srivastava. The data warehouse of newsgroups. In The 7th International Conference on Database Theory, pages 471—488, 1999. pages 6 [20] H. Gupta, V. Harinarayan, A. Rajaraman, and 3. D. Uliman. Index selection for olap. In The 13th International Conference on Data Engineering, pages 208—219, 1997. — pages 6 [211 V. Harinarayan, A. Rajaraman, and D. UlIman. Implementing data cubes efficiently. In ACM SIGMOD, volume 25, pages 205—2 16, 1996. — pages 6, 43 [22] S. Irani. Page replacement with multi-size pages and applications to web caching. In Twenty-ninth annual ACM symposium on Theory of computing, pages 701—710, 1997. pages 8 [23] H. Karloff and M. Mihail. On the complexity of the view-selection problem. In Symposium on Principles ofDatabase Systems, The 18th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 167—173, 1999. pages 6, 43 [24] H. Kellerer, U. Pferschy, and D. Pisinger. Knapsack problem. Springer, 2004. —* pages 43, 47 66 [25] Y. Kotidis and N. Roussopoulos. Dynamat: A dynamic view management system for data warehouses. In In Proc. of the ACM SIGMOD Conference, pages 371—382,1999. paces 27 [26] D. Lee, J. Choi, J.-h. Kim, S. H. Nob, S. L. Mm, and Y. Cho. On the existence of a spectrum of policies that subsumes the least recently used (Iru) and least frequently used (ifu) policies. In The ACM SIGMETRICS’99 Conference, 1999. — pages 8 [27] E. J. O’neil, P. E. O’neil, and G. Weikum. The lru-k page replacement algorithm for database disk buffering. In SIGMOD, pages 297—306, 1993. pages 8 [28] E. Otoo, F Olken, and A. Shoshani. Disk cache replacement algorithm for storage resource managers in data grids. In In. Proc. of the IEEE/ACM SC 2002 Conf on Supercomputing. Los Alamitos. IEEE Computer Society, pages 1—15, 2002. pages 57 [29] M. L. Puterman. Markov Decision Process: Discrete Stochastic Dynamic Programming. John Wiley & SONs,INC, 1994. pages 24, 29, 30, 32, 34 [30] S. M. Ross. Introduction to Stochastic Dynamic Programming: Probability and Mathematical. Academic Press, Inc., 1983. ISBN 0125984200. pages 28 [31] P. Scheuermann, J. Shim, and V. Radek. Watchman: A data warehouse intelligent cache manager. In Proceedings of the 22nd VLDB Conference, 1996. — pages 43, 57, 61 [32] A. Shukia, P. Deshpande, and I. F. Naughton. Materialized view selection for multidimensional datasets. In Proceedings of the 24rd International Conference on Very Large Data Bases, pages 488 — 499, 1998. — pages 43 [33] V. V. Vazirani. Approximation Algorithms. Springer, 2001. —* pages 48 [34] N. Young. The k-server dual and loose competitiveness for paging. Algorithmica, 11:525—541, 1994. pages 8 [35] N. E. Young. On-line file caching. Technical Report PCS-TR97-320, Dartmouth College, 1997. - pages 7, 57 67


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items