Query Answering Using Views for X M L by Zheng Zhao B S c . H o n . , T h e University of Western Ontario, Canada, 2002 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T O F THE REQUIREMENTS FOR THE DEGREE OF Master of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia M a r c h 2005 © Zheng Zhao, 2005 Abstract The problem of answering query using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database. A s X M L becomes the standard of data representation and exchange over the internet, the problem has recently drawn more attentions because of its relevance to a wide varieties of X M L data management problems, there is a pressing needs to develop more techniques to solve it for X M L data effectively and efficiently. We study a class of X P a t h queries and materialized views which may contain child, descendant axis and predicates. We first describe an algorithm to find the maximally-contained rewritings i n the absence of database schema. We then present an efficient algorithm to search the maximally-contained rewriting under choice-free acyclic schema and prove the uniqueness of the maximally-contained rewriting. F i nally we show its performance experimentally by extending our algorithm to answer queries i n X Q u e r y expression. ii Contents Abstract ii Contents iii List of Figures v Acknowledgments vii Dedication viii 1 Introduction 1 2 B a c k g r o u n d and P r o b l e m Studied 5 3 2.1 X P a t h and Tree Pattern Queries 5 2.2 Materialized X P a t h Views 7 2.3 Query Containment and Query R e w r i t i n g 7 2.4 Schema and D T D s 8 Q u e r y Answering U s i n g Views without Schema 3.1 Sound Rewritings and M a x i m a l Rewritings 3.2 A l g o r i t h m and T i m e Complexity iii 9 9 13 4 5 3.2.1 Algorithms 13 3.2.2 T i m e Complexity 18 Q u e r y A n s w e r i n g U s i n g Views in the Presence of Schema 19 4.1 Constraints from Acyclic choice-free D T D 20 4.2 Decidability of Containment Under Acyclic Schema 24 4.3 Query Answering Using Views Under D T D 33 4.4 Algorithms and T i m e Complexity 36 E x p e r i m e n t a l Results 40 5.1 Query Set 41 5.2 Savings and Overhead on Queries Answering using Views 45 5.2.1 Savings on useful views 45 5.2.2 Overheads on useless views 5.2.3 Various number of views 46 5.2.4 V a r y i n g query size 47 6 Related W o r k 7 Conclusion ...... 45 , 50 52 Bibliography 53 iv List of Figures 3.1 H e l p Functions 14 3.2 A l g o r i t h m to find rewriting of Q using V 15 3.3 Schemaless Case E x a m p l e 16 4.1 Duplicate D T D Example 23 4.2 Theorem 4.1 Proof - Case a . 32 4.3 Theorem 4.1 Proof - Case b 34 4.4 F i n d i n g I C from D T D 36 4.5 F i n d i n g C C from D T D 37 4.6 A p p l y Chase on Q 38 5.1 Selection Queries and Views on auction.dtd 41 5.2 J o i n / G r o u p B y Queries and Views on auction.dtd 42 5.3 G T P s built for Q3 44 5.4 Saving R a t i o - Useful Views 46 5.5 Overhead R a t i o - Useless Views 47 5.6 Overhead R a t i o - Various Number of Useless Views 48 5.7 Saving R a t i o - Various Number of Useless Views 48 5.8 Saving R a t i o - Various Query Size 49 v 5.9 Overhead Ratio - Various Query Size vi Acknowledgments I am deeply appreciate my supervisor D r . Laks V . S . Lakshmanan who provided me invaluable guidance and support. W i t h o u t h i m , this would never have been completed. I would like to thank all the members i n database lab, especially Dr.George Tsiknis and Wendy H u i W a n g for their help i n m y work. Z H E N G ZHAO The University of British Columbia March 2005 vii my parents and husband, for their endless love and continuous encouragement. viii Chapter 1 Introduction T h e problem of answering query using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database [7], T h i s problem is relevant to many data management problems. One of the major context where the problem of answering queries using views is considered is data integration and data warehouse design where the efforts focus on searching a maximally-contained rewriting, the best results possible. D a t a integration systems combine data residing at a multitude of autonomous data sources, and provide a uniform query interface, called global schema, which can be queried by the user. In the design of a data integration system, we need to make a basic decision which is related to the problem of how to specify the relation between the sources and the global schema. There are basically two approaches for this problem. T h e first approach, called global-as view ( G A V ) , requires that the global schema is expressed i n terms of the data sources. T h i s means that every concept of the global schema is associated w i t h a view over the data sources, so that its meaning is specified i n terms of the data residing at the sources. 1 In the second approach, called local-as-view ( L A V ) , the global schema is specified independently from the sources, and the relationships between the global schema and the sources are established by defining every source as a view over the global schema. In the area of data warehouse design we need to choose a set of materialized views i n the warehouse to improve the query performance. In this case, the most important step is to select a set of views to materialize that answers all the queries of interest while minimizing the total query evaluation and view maintenance cost. W h e n a query is posed, it is evaluated locally, using the materialized views. Accessing the original data sources are avoided mainly because either the original sources are not accessible any more or it costs too much. B o t h problems are translated into the problem of query rewriting using views i n which we often need to settle for a contained result which is a subset of the original query result rather than an equivalent one because the given materialized views may not cover the entire database. T h e problem of rewriting queries using materialized views has been extensively studied i n the relational world. M a n y algorithms were developed for a specific area of applications [3, 6, 13, 20, 7] such.as the bucket algorithm, the inverse-rules algorithm, the M i n i C o n algorithm, etc. In contrast, this problem for X M L data management has not been fully explored. Some of the existing work is outlined i n the Chapter 6 Related Work. X M L has become the standard for data representation and exchange over Internet. W i t h W 3 C ' s recommendation, XQuery(17] emerges as the standard query language for X M L and XPath[16] is a language for navigating X M L documents which is embedded i n XQuery. B o t h these languages are based on a basic paradigm of finding bindings of variables by matching tree patterns against a database. Similar to relational databases, the problem of finding a rewriting of X Q u e r y / X P a t h queries using a set of X P a t h views is relevant to a wide varieties of 2 X M L data management problems. Besides those two major applications we mentioned above, this problem is also related to semantic web applications as illustrated in [8] when the query is posed over the schema of source S and we wish to reformulate it over the schema of target T which is the schema neighbor of S. T h e problem of refomulating Q is known as answering queries using views. Therefore there is a pressing need to develop better techniques to solve the problem of rewriting queries using materialized views effectively and efficiently. In this thesis, we consider this problem for X P a t h expressions w i t h / w i t h o u t a database schema. Informally, we define the problem as following. Suppose we are given a query Q, and a set of previously materialized view definitions V\,... ,V n all expressed i n X P a t h . Is it possible to answer query Q using only the answers to the views V\,..., V without accessing accessing the database? If so, how? W h e n n the database schema is given, the query and views are over the same schema. C u r rently we concentrate on X P a t h expressions containing child, descendant axis and predicates. T h e specific contributions of this thesis are the following: • We propose an algorithm to check when a view is usable to answer a X P a t h query and find maximally-contained rewritings i n the absence of schema. We show the lower bound of the time complexity is E X P T I M E . • W e show that containment for XP{ '[ /,// ^ can be decided i n P T I M E under acyclic choice-free D T D s . • We describe a P T I M E algorithm to find the maximally-contained query rewriting using views under acyclic choice-free D T D s . • We introduce an approach to answer an X Q u e r y query using X P a t h views by extending our algorithm and present detailed experimental results to show the 3 performance. T h e rest of the thesis is organized as follows. In Chapter 2 we describe the class of X P a t h fragments and database schema we studied. We present our algor i t h m i n the absence of schema i n Chapter 3. In Chapter 4, we prove that for tree pattern queries i n XPVJM 1>, five types of necessary and sufficient constraints i m - plied by choice-free, acyclic D T D can be used to decide query containment problem and extend to solve the problem of a query rewriting using views using a P T I M E algorithm. We provide experimental results i n Chapter 5 where we illustrate how to use our algorithm to answer X Q u e r y query using X P a t h views. Finally, we discuss related work i n Chapter 6 and conclude i n Chapter 7. 4 Chapter 2 Background and Problem Studied 2.1 XPath and Tree Pattern Queries A n X M L database is a finite rooted ordered tree T — (N,£,r,\), where A / repre- sents element nodes, £ represents parent-child relationship, A denotes the labelling function to assign a tag w i t h each node, and r is the root. Associated w i t h each node is a set of attribute-value pairs. In our work, we do not consider order any further. Tree pattern queries, introduced i n [1], capture a useful fragment of X P a t h . A tree pattern query ( T P Q ) is a triple Q — (N,E,F), where (N,E) is a rooted tree, w i t h nodes N labelled by variables, and w i t h E = E L) Ed consisting of two C kinds of edges, called pc- (E ) c and ad-edges (Ed), corresponding to the child and descendant axes of X P a t h . A distinguished node i n N (shown boxed i n Figure 3.3) corresponds to the answer element. T h e path from root node to the distinguished 5 node is the distinguished path. F is a conjunction of tag constraints ( T C s ) , valuebased constraints ( V B C s ) , and node identity constraints (NICs). T C s are of the form $x.tag = t, where t is a tag name. V B C s include selection constraints $x.val relop c, $x.attr relop c, and join constraints $x.attr relop $y.attr', and $x.val relop $y.val, where relop G { = , ^ , > , < , > , <}, attr, a t t r ' are attributes, vai represents content, and c is a constant. N I C s are $x idop $y where idop G {=!7^}- Q is join-free if it contains no join constraints and no N I C s . W e assume no disjunctions appear i n V B C s and queries are join-free throughout the thesis w i t h a few clearly identified exceptions. We denote the nodes of a query Q by N(Q) and the nodes of a view V by N(V) . T h e root nodes of Q and V w i l l be denoted by R(Q) and R(V) respectively. We use dQ and dy to denote the distinguished nodes of query Q and view V. T h e distinguished paths i n Q and V are denoted by DQ and Dy (i.e. the paths i n Q and V from R(Q) to dQ and R{V) to dy respectively). For any node x i n Q or V, the tag name associated w i t h that node w i l l be denoted by tag(x) and the value based constraints associated w i t h that node w i l l be denoted by VBC(x). Answers for T P Q s are formalized using homomorphism. A homomorphism is a function h : query Q —> a tree T w i t h the following properties: 1. h(R(Q)) = h(R(T)); 2. V x G Q,tag(x) = tag(h(x)); 3. Var,y G Q , i f (x,j/) is a pc edge i n Q then (h(x),h(y)) must be a pc edge i n T ; 4. Va;,y G Q, if (a;,y) is an ad edge i n Q then (h(x),h(y)) must be a path i n T . 6 2.2 Materialized XPath Views We consider X P a t h views are i n the class of copy semantics which implies that views store copies of answer elements. This implies that X P a t h views can be used to answer X P a t h queries w i t h subsequence operations o n the results of the view without navigating to the parent or ancestors. Since we consider join-free X P a t h query i n our work, only a single view would be involved i n rewriting if it is usable. For the ease of readability, we denote an X P a t h query and a view by Q and V respectively. 2.3 Query Containment and Query Rewriting Query containment is a necessary condition for rewriting query using views. A s proven i n [14], for any wildcard-free X P a t h queries Q and Q', Q' C Q iff there is a containment mapping from Q —» Q'. A containment mapping is a function h : Q -> Q' w i t h the following properties: (1) h(R(Q)) = h(R(Q'))\ (2) V x e Q,tag(x) = tag(h(x)); (3) V : r , y <E Q, if (x,y) is a pc a edge i n Q then (h(x),h(y)) must be a pc edge i n Q ' ; (4) Va:,y 6 Q, if (x,y) is a ad edge i n Q then (h(x),h(y)) must be a p a t h i n Q',which may include pc edges and/or ad edges. In our context, the correctness of the rewriting is verified by using query containment. We say that Q is rewritable using V if there exists an X P a t h expression E such that for every X M L database D, E o V(D) C Q(D) then E is said to be a sound rewrite of Q using V. In addition, our goal is to find m a x i m a l sound rewriting(s). A sound rewriting E of Q is said to be maximal if there has no E' such that for every X M L database D, Eo V(D) C E' o V(D). 7 2.4 Schema and DTDs We are especially interested i n studying the problem of rewriting query using views i n the presence of schema. W e abstract the schema of a database (in our work, we only consider D T D s ) as a graph w i t h nodes corresponding to tags and edges labelled by one of the quantifiers '?, 1, *, + ' w i t h their standard meaning of 'optional', 'one', 'zero or more', and 'one or more' respectively. A l l tags i n D denotes set a. The set of trees satisfying D T D D is denoted S A T ( D ) . A X P a t h query Q is satisfiable if there is a tree T 6 SAT(D) such that Q(T) ^ 0 . Otherwise, Q is unsatisfiable. T h e satisfiability of T P Q s w i t h / w i t h o u t schema is recently studied i n [12]. Without losing the generality, we assume that both query Q and view V are satisfiable w i t h regard to D T D . D T D s provides constraints on the structure of X M L documents. Hence, while Q\ may not rewritable using Q2 i n general, it may be the case that given a D T D D, Q is rewritable using V when both satisfy D, by applying a compensation expression E on V. For ease of exposition, we initially focus on acyclic choice-free D T D s . If C is a set of constraints inferred by D T D , then SAT(C) denotes the set of trees i n Ts which satisfy each constraint i n C. Problem Statement: W e formally define the problem of query answering using views (QAV) for X P a t h fragment, denoted XPU' ' '[ / / ^ i n our context, as follows: G i v e n a query Q and a view V b o t h expressed i n X P a t h , check whether V is usable for answering Q. If so, find all maximally-contained rewriting(s) of Q using V, w i t h / w i t h o u t choice-free acyclic D T D . 8 Chapter 3 Query Answering Using Views without Schema In this chapter, we illustrate an algorithm for computing maximally-contained rewriti n g ^ ) i n the absence of schema and prove the soundness of our algorithm. W e firstly give some useful definitions, provide a detailed proof and then present the algorithms. A n example follows to show that time complexity can not be better than E X P T I M E . 3.1 Sound Rewritings and Maximal Rewritings D e f i n i t i o n 3.1 ( E m b e d d i n g ) An embedding / : Q ~» V is a partial function from N(Q) to N(V) satisfying the following properties. 1. If the first character in the XPath expressions Q and V are / then, f(R(Q)) = R(V). 2. \fx £ Q, f is defined on x implies (tag(x) = tag(f(x)) A (VBC(f(x)) —> 9 VBC{x)). 3. V x , y G Q, f is defined on x and y, (x,y) (fi ), f(y)) x i s a is a pc edge in Q implies that P edge in V. c 4- Vx, y G Q, f is defined on x and y, (x, y) is an ad edge in Q implies that there exists a path from f(x) to f(y) in V which may include pc or ad edges. 5. Vx G Q, then f is defined on every ancestor of x (upward closed). Definition 3.2 (Useful Embedding) / : Q ~» V is a useful embedding, pro- vided: 1. f is an embedding; 2. Vx e DQ, if f(x) is defined, f(x) 6 Dy; 3. Let P — { i > o , i > i , . . . ,Vk} be any path in Q. (a) either f is defined on VQ, V \ , ..., v^; or (b) Vi : f(v{) is defined, f(vi) G Dy and suppose vi = max{i\f(vi) }, then either f(v{) = dy or (yi,vi+i) is defined is an ad edge in Q. Definition 3.3 ( C A T : C l i p p e d A w a y Tree) Let the distinguished path in Q, Dy {vo,vi,... ,Vk}. A Clipped Away Tree (CAT) is a subtree of Q rooted at v{, s.t. f is not defined on Vi but is defined on Definition 3.4 (Extension of Useful E m b e d d i n g ) A useful embedding g is an extension of another useful embedding f if dom(f) C dom(g). In this chapter, we denote different rewritings (E o V) and (Ef o V ) as R g g and Rf. B o t h are the sound rewritings derived from the useful embeddings g and / respectively. 10 Theorem 3.1 Let Q,V G XPt/.//.[ 1} and Q is join-free. Q is rewritable using V iff there exists a useful embedding f : N(Q) ~* N(V). Proof (Only if) Let E o V be a sound rewrite of Q using V , i.e. VdatabaseT : E o y ( T ) C Q(T). Let h : AT(Q) -> N(E o V) be containment mapping, s.t. Vx G N(Q),h(x) G N(V) or /i(x) G N(E). We construct a useful embedding / : N(Q) ~> JV(V) as follows: Vx G Q :fo(x)G AT(V), /(x) = /i(x). By the definition of containment mapping, f is a valid embedding from N(Q) ~> AT(V). / also satisfies all path constraints defined in useful embedding because: 1. Vx G N(D ) and x is defined in / : / ( x ) C N(D ) since / ( i 2 ( Q ) ) = f(R(V)) Q and V /(DQ) = /JEOV- 2. Mark nodes of Q top down as follows. If x G Q and h(x) G V, mark the node as V, else mark it as E. The marks on all paths from R(Q) to any leaf node are of the form V*E*. Let x be the last node in any path marked V and y be the first node in the path marked E. Since E is a valid rewrite, (a) either x is mapped to d , O R v (b) (x,y) is an ad edge and x is mapped to a node in D v (If) Let / : N(Q) ~> N(V) be a useful embedding. We construct E and extend f as follows: Vx G N(Q) s.t. T y x G dom(f) and By s.t. edge(x,y) G Q a n d y ^ dom(f). Let denote the subtree rooted at y. Do the following: add a copy T' of T as a y child subtree of d and define for every node z G T ,f(z) v y — z' where z' is the y corresponding node in Ty. If (x, y) is a pc(ad) edge, then (d ,y') is a pc(ad) edge. E v 11 contains a l l such TyS. T h e extended / is the required containment mapping because V z defined i n the useful embedding is defined i n / and Vx N O T defined i n the useful embedding, f(x) is its image i n the corresponding T . y f is a valid containment mapping from Q ^> E oV. Therefore, E o V is a sound rewriting of Q using V. • For efficiency concerns, we a i m to generate only m a x i m a l rewrites. T h e following lemma makes this goal possible to achieve. We w i l l describe the algorithm in the next section. Lemma 3.1 Let a useful embedding g is an extension of f. Rf C R iff dom(f) C g dom(g) and Vx 6 dom(g) — dom(f) \ 3y £ dom(g) and edge(x,y) £ Q: (x,y) is an ad edge. Proof (Only if) We know that g is an extension of / . M a r k every node x of Q top down as follows: if x £ dom(f), mark the node as F; if x £ dom(g) — dom(f), mark the node as G\ else mark it as E. T h e marks on all paths from R(Q) to any leaf node are of the form F*G*E*. L e t u be the last node i n any path marked F, x be the first node i n the path marked G, y be the last node i n the p a t h marked G and z be the first node i n the path marked E. F r o m the way we construct Rf and R , we know a l l subtree T w i l l be copied as T' to attached to d using pc(ad) g x x v edge i n Rf if (u, x) is a pc(ad) edge; a l l subtree T w i l l be copied as T' to attached z to d i n R v g z using pc(ad) edge i f (y, z) is a pc(ad) edge. Since Rf C R , there is a g containment mapping h : N(R ) g —> N(Rf). z', the image of z i n R g is mapped to z", the image of z i n Rf which is a node i n T' . Since p a t h ( d „ , z") i n Rf through x node x' must contain at least two edges. Therefore, edge (d ,z') i n R v g must be ad edge. T h u s the pre-image of u i n Q, (y, z) must be an ad edge i n Q. (If) W e show Rf C R by constructing a containment mapping h : N(R ) —> g N(R ) f g as follows: 12 1. V x G N(V) of R , h is defined as its image i n N(V) of Rf since Rf = Ef oV g and R = Eg oV, g 2. V u G Ty, a child subtree of Dy rooted at y i n R , s.t. the pre-image of y i n g Q has an edge to node x and a; G dom(f): 3T \ a child subtree of Dy m Ef, y T is isomorphic to T \ h(u) is defined as its image i n T' . Since we derive R y y and Rf from y g 3 and / , we know Vy whose pre-image ^ dom(g) : 3 edge (£, y ) i n <5 and t G dom(g) fl dom(f) : T is duplicated in i ? and £Jf, y g 3. VTy, a child subtree of Dy i n R : the pre-image of y i n Q has an edge to node g x s.t. x G dom(g) — dom(f) : 3T , a subtree of Dy i n i?y s.t. T is isomorphic y y to Ty. y can be mapped to y' since (Dy,y) is an ad edge and (Dy,y') is a p a t h m Rf. V u £ T , y is defined as its image i n T . v • 3.2 3.2.1 Algorithm and Time Complexity Algorithms We first introduce three help functions which are used to simplify the problem solving. T h e first and second function are quite straight forward so we just give a brief description rather than details. T h e third one is the most complicated so we would show it step by step. We next introduce the m a i n procedure to find a l l useful embedings from Q to V. We show the execution of the above algorithm using the example of F i g - ure 3.3. For readability, whenever the tag constraint $x.tag = t appear i n Q and V, 13 1. F u n c t i o n : m a p - D P a t h Input: The distinguished path of Q and V, Dv and DQ. Let JV(ZV) ={vo,v\,..., Vk} and N(DQ) ={?o, 91, • • •, 9m}Output: A set of valid (partial) path mappings H from N(DQ) ~» N(Dy)Each h £ H will preserve path and tag obligations and for every unmapped node qi such that /i(<j;_i) is defined in h, if pc-edge(gi_i,qi) then h(</i_i)= ^fc(the distinguished node of V).It will also generate a candidate list Cqi for node q; s.t. Vj 6 Cqi if there exist a h' such that h(qj) = vj. 2. Function: map-Subtree Input: One node qi in Q and the other node Vj in V. Output: Return the total mapping if the tree rooted at qi in Q has a containment mapping to the tree rooted at VJ in V. Otherwise, it will return NULL. We implemented the PTIME containment mapping algorithm introduced in [1]. 3. Function: m a p - T o - D v Input: One node q[ in Q and the other node Vj e Dy. Output: Return a set of all valid (partial) tree mappings T from the tree rooted at q\ in Q to the fragment of D starting from Vj. If no mapping exists, return NULL. v As a valid tree mapping t € T, it will preserve path and tag obligations and for every unmapped node q'j such that q'j_i is mapped, if pc-edge(g^_ ,qj) then t(g^_ )= ut.(the distinguished node of V).It will also generate a candidate list Cq[ (root of the tree) for q' s.t. Vj S CgJ if there exist a t such that t(q' ) = Vj. 1 1 { { Figure 3.1: Help Functions 14 Procedure: get-UsefulEmbeddings Let the distinguished path Dv ={vo,v\,... ,i>fc} and DQ DQ and Dv denote q' and v' respectively. ={qo,<Ji, • • • ,<?[}• Input: Q and V. Output: All useful embeddings in set F. Assign unique id to each node in Q and V; H = map-Dpath(DQ, Dv); If H is empty, return NULL; For each node q, G D{Q) s.t. Cqi / 0 1. For each child node q' of qi which is not on DQ (a) For each node v' in Cqi i. ii. iii. iv. If pc(cj', <7i),get all pc-child nodes Vj of v' s.t. tag(uj)= tag(q') If ad(<?', <j;),get all descendant node Vj of v' s.t. tag(i>j)= tag(q') Save all VJS in set V For each Vj in V A. For Vj is not in Dv, map-Subtree(qi, VJ){ B. If success, record the mapping and add Vj to Cqi. Break;} C. If VJ is in Dy, map-To-Dv(</;,i>j){ D. If success, record all mappings, and add Vj to Cq,.Break; E. If fail and pc(g', qi), prune v' from Cqi. F. If fail and ad(q', qi), add 0 to Cqi} Use all pre-stored candidate list of query node's mapping, output all useful embeddings. Figure 3.2: Algorithm to find rewriting of Q using V 15 Node not lying on the we write t right next to $x i n the figure. /a Q: J /a I b , $9 j : • no $7 j m 5 " 1 Rl: R2: R3: / / r d 1 - R4: / d d 1 1 \ n R5: \\ <\X 1 R6: \ m Figure 3.3: Schemaless Case E x a m p l e 1. C a l l function m a p - D p a t h ( D g , Dy)- H contains only one mapping which is h = {1 -+ l ' , 2 -> 2', 3 - f 3'}. The candidate lists are C\ = l ' , C = 2 ' , C = 3'. 2 3 The C A T is the subtree rooted at node 4. 2. Node 1 i n Q has no other children besides node 2, so 1 w i l l be skipped. 3. Node 2 has one child 6 which is not on DQ. 16 h(2)=2' and ad(2,6) so 6 may map to two nodes i n V , 4' and 7' which are descendants of node 2'. W e always try to map the candidate which is not on Dy. B y doing that we may get the total mapping of the subtree rooted at 6. So we test map-Subtree(6, 7'), it fails. T h e n test map-To-Dv(6, 4'), we got three mappings: K2\ = {6 —> 4', 9 —> 5', 7 -> 6'}; h2 = {6 -> 4', 9 -> 5'}; h2 = { 0 } . h2 implies the whole subtree 2 3 3 rooted at 6 w i l l be attached to d as part of rewrite. v 4. Node 3 has one child 10 which is not on DQ. h(3)=3' and ad(3,10). Same as the operation done on Node 2, we will obtain two mappings hZ\ = {10 —> 6'}; hS = {0}. 2 5. We generate a l l the embedding using combinations of C A T , h 2 and h 3 which gives 6 different embeddings: • /x = {1 —• 1', 2 —> 2', 3 - » 3', 6 • f = { l _» l ' , 2 4', 9 -> 5', 7 - » 6', 10 - • 6'}. 2',3 -> 3', 6 -+ 4', 9 -> 5', 10 2 • / 3 = {1->1 2-^2 ,3^3 ,10^6'}. • / 4 = {1 - * 1', 2 —> 2', 3 -> 3', 6 4', 9 • / 5 = {1 -> 1', 2 -> 2', 3 - » 3', 6 4', 9 - » 5'}. • / 6 = {l->l ,2->2',3->3',10-6 }. , , 6'}. , I / 5', 7 ^ 6 ' } . / 6. Finally, we generate six rewritings R l , R2,..., R 6 corresponding to six distinct useful embeddings fi, f , 2 fe i n the following way: for each / j , mark those nodes which is not defined i n / j i n Q , copy the branches and subtrees connected by those nodes and attach them to the distinguished node of V . Please refer to the figure for final results where the root node of each rewriting Ri, is the distinguished node of V. 17 A s we mentioned before, our goal is to generate m a x i m a l rewrites for efficiency. T o achieve this, we improve the above algorithm to be as "greedy" as possible i n searching the mappings based on the result of L e m m a 3.1. In the function Map-To-Dv(q, v'), we add prune procedure i n the end: ti w i l l be pruned from T if there exist tj i n T such that dom(tj) D dom(ti) and every unmapped node m which has an edge connected w i t h some node n e dom(tj) — dom(ti), (n, m) is an ad edge. I n the example, we will prune h2z from the mappings of node 2 i n Q because dom{h2^) C dom(h22) and there is only one node 7 which has parent node 6 i n dom(h22) —dom(h2^) and ad(6,7). However, we can prune neither ft.2i nor h22A l t h o u g h dorn(h22) C dom(h2\), 7 is i n dom(h2\) — dom(h22) which has a pc child which i n unmapped. Similarly, we can not prune b o t h mapping for node 10. I n the end, we now have four useful embeddings remaining: / 2 , / 3 , / s , / 6 which corresponds to four m a x i m a l rewritings R2, R3, R5, i?6 i n the figure. 3.2.2 Time Complexity We discuss the time complexity of Q A V problem i n the absence of schema using the example i n the Figure 3.3. The example shows that the mappings of the two subtree rooted at node 6 and 10 have two choices each even after the pruning procedure is applied, which results four distinct m a x i m a l rewrites. Therefore, the number of o p t i m a l output in the worst case would be exponential which implies it is impossible to have an algorithm to solve this problem better than E X P T I M E . 18 Chapter 4 Query Answering Using Views in the Presence of Schema In this chapter, we study the problem of answering queries using views under schema for the same class of X P a t h fragments as i n the schemaless case and currently consider only acyclic choice-free D T D as database schema. Since D T D provides constraints, we need to consider those w i t h the given query and views i n the problem. A s we show i n previous chapter, our algorithm i n schemaless case is based on containment mapping. W h e n the schema is available, we first solve the containment mapping problem under D T D and then extend our algorithm to find the m a x i m a l rewriting. W i t h o u t losing the generality, we assume that both Q and V are satisfying w i t h regard to D T D A . 19 4.1 Constraints from Acyclic choice-free D T D A t the beginning, we formally define five types of constraints implied by D T D s as follows. D e f i n i t i o n 4.1 ( S i b l i n g C o n s t r a i n t s ) Let t be a document tree satisfying DTD. If whenever a node labelled a in t has children labelled with each b G B, it has a child node labelled with c, t satisfies the Sibling Constraints(SC) a: B [ c. When B is 0, the SC is called child constraint/i<S/. D e f i n i t i o n 4.2 ( F u n c t i o n a l C o n s t r a i n t s ) Let the a document tree satisfying DTD. If no node labelled a int has two distinct children labelled with b, t satisfies the Functional Constraints ( F C ) a [ b[18]. D e f i n i t i o n 4.3 ( C o u s i n C o n s t r a i n t s ) Let t be a document tree satisfying DTD. If whenever a node labelled with a in t has descendant labelled with each b G B, it has a descendant node labelled with c, then t satisfies the Cousin C o n s t r a i n t s ( C C ) D e f i n i t i o n 4.4 ( P C C o n s t r a i n t s ) Let t be a document tree satisfying DTD. If whenever there is a path from node labelled with a to a node labelled with b, the path length from a to b is always 1, then t satisfies the P C C o n s t r a i n t s ( P C ) a\b. D e f i n i t i o n 4.5 ( I n t e r m e d i a t e N o d e C o n s t r a i n t s ) Let tbe a document tree satisfying DTD. If whenever there is a path from node labelled a in t to a descendant labelled with c, b must present on this path between a and c. t satisfies the Intermediate Node Constraints(IC) a, c : | b. 20 Sibling Constraints and Functional Constraints were first introduced i n [18] where set B contains multiple elements. We prove i n L e m m a 4.1 and 4.2 that S C and C C are b o t h unary when D T D is choice-free. L e m m a 4.1 SCs are unary when DTD is choice-free. Proof Let D T D be represented i n grammar notation. Because S C only involves parent and child relationship, each S C associates w i t h one production. We presents a production using a graph Gp such that the root node is the context node a, d u m m y nodes Di are used to factor out nested occurrences if any, leaf nodes are child nodes of a, and quantifies('*', '+', ' 1 ' , '?') are labels on the edges. We call it production graph. Assume that a S C inferred by D T D be a : B j c, B is a set of child nodes of a. Cases (a) D T D is duplicate-free: Clearly, the resulting production graph is a tree because each child element only appear once i n a production graph because child node of a appear at most once i n the right side of the production. Also c must be connect to G p w i t h edge labelled w i t h ' 1 ' or '+'. Otherwise c can not guarantee to present i n any case. Let node <p be the highest ancestor of c such that a l l edges on the path from <p to c are labelled w i t h ' 1 ' or '+'. There are two possibilities: • (a.l) ip — a: c is a guaranteed child node of a, therefore B = 0 i n S C ; • (a.2) ip — Di (some dummy node): T h i s means that if <p ^ 0 then c must present as a's child. Hence any leaf node hi except c itself reachable from <p can ensure it. So B = bj. S C a : hi J, c is unary. Cases (b) D T D allows duplicates: since a child node may appear multiple times i n the right side of the production, the production graph may contain a D A G 21 such that leaf node may have in-degree greater than 1. Same as i n Case (a), c must be connect to Gp w i t h edge labelled w i t h ' 1 ' or ' + ' to make S C a : B j c hold. Let node (p be the highest ancestor of c such that a l l edges on the path from tp to c are labelled w i t h ' 1 ' or '+'. There are also two possibilities: • (b.l)<p = a: same as i n Case (&.1)B — 0 i n S C ; • (b.2)(p = Di'. Since D T D is not duplicate free, a leaf node may be reachable from multiple paths. Hence for every leaf node bi reachable from Df. SC a : bi I c may not be true if there is a node £)/, unreachable from Di, but reachable to 6,. So bfs presence can be independent from c's presence. B u t S t i l l bi itself is sufficient to guarantee c's presence without requiring that such Di exists. S C s are unary. • T h e following is an example of choice-free D T D w i t h duplications. In example (1), there are multiple paths to b and c. S C a : b j c is not true. Example: One production of a choice-free D T D w i t h duplications and its production graph: a —> ((&?, c)+, (b*, e, c?)?)*. In the above example, we find that S C a : b j e is not true. A l t h o u g h there is one path from a to b through D I and D 3 where the presence of b is related to the presence of e, there is another p a t h from node a to node b through D 2 where the presence of b is not relevant to e. We denote D T D implies a specific constraint c as A N c. 22 a * Dl D2 D3 Figure 4.1: Duplicate D T D Example C l a i m 4.1 DTD A\=a:bii}.ciffV path Pj from a to hi in A : 3 a node di on Pj such that there exists a guaranteed path from di to c. Proof (If) Assume that every path from a to hi there is a node di such that the path from di to c are all labelled w i t h ' + ' or ' 1 ' . Obviously, for any valid instance t of A , i n any path from a to c i n t, c must present as a's descendant. Therefore, A N a : h JJ. c. (Only if) Assume that A 1= a : hi JJ. c and there exists one path Pi i n A from a to c such that there is a node dj on Pi that has an optional path to c. T h e n we can create a valid instance t of A i n which Pj is selected from a to hi and c is not present in the path starting at dj. T h i s is a contradiction. C l a i m 4.2 If DTD A Proof a : b JJ. c and A t a : bj JJ. c, tfien A • a : {b , bj} J| c. t F r o m C l a i m 1, we know if A ¥• a. : 6; JJ. c then there must exist a path P i from a to 6; such that none of node on Pi has a guaranteed path to c. Similarly, if A ¥ a : hj JJ. c then there must exist a path Pj from a to 6j such that none of node 23 on Pj has a guaranteed path to c. Assume that A ¥• a : 0 ij. c and J&fc, descendant of a and A N a : 6^ -IJ- c. We can create a tree t b y choosing P i and Pj and extend other paths and nodes as A required to make t valid to A . In t, b o t h bi and bj are a's descendants, but c is not present as a's descendant under our assumption of A . Therefore, A ¥ a : {bi, bj} J| c. • L e m m a 4.2 C C s are unary when DTD A is choice-free and acyclic. Assume that A N o : bi, bj JJ. c. F r o m C l a i m 4.2, we know if A Proof and A 4.2 a : 6j Jj- c, then A ¥ a : {bi, bj] JJ. c. Therefore, C C s are unary. a : ^ JJ. c • Decidability of Containment Under Acyclic Schema T h e correctness of rewritings need to be verified v i a containment mapping. Therefore, it is necessary to study the containment mapping problem first before solving the problem of answering query using views. In order to test query containment under a set of constraints C of ICs, P C s , SCs, F C s and C C s for Q G XP^'H^- ^, we introduce a variation of the chase, a procedure for applying constraints i n C to V : 1. Change ad edge to pc edge using P C : Let p G PC of the form a \ b . For a l l ad(a,b) i n V , change it to pc(a, b) i n V . 2. A d d guaranteed pc children using S C : Let s e SC of the form a : b J. c, where B — b\, ...,b . Let a be a node i n V w i t h pc children b\, ...,b , and a does not n n have a pc child labelled c. T h e n add pc edge(a,c) i n V where c is a new node. 24 3. Merge pc children using F C : Let / G FC of the form a j c. Let a be a node in V w i t h distinct children c\ and c-i labelled as c. T h e n merge c\ and c i n 2 V . (Note: we w i l l never need to merge ad children. If ad(a,b) is retained i n chase V , this means there exist multiple paths from a to b according to D . 4. A d d guaranteed intermediate nodes for ad-edges using I C : Let i G IC of the form a,c :| b. For all ad(a,c) i n (chased) V , insert b between (a, c) using ad edges. 5. A d d guaranteed ad children using C C : Let c G CC of the form a : b JJ. c. Let a be a node i n V w i t h a l l ad children b G B and if a has no ad children labelled w i t h c present i n V , add c as a's ad child i n V where c is a new node. We denote by Chasec(Q) the result of applying the set of constraints C to Q. T h e set of trees satisfying D T D A is denoted S A T ( A ) . Let C be a set of ICs, P C s , SCs, F C s and C C s implied by A , S A T ( C ) denotes the set of trees satisfying each constraint i n C . T h e following sequence of results present that C is sufficient and necessary to show A — containment of queries i n XPU'H^ ^ when A is choice-free and acyclic. Lemma 4.3 =SAT(C) Proof Q Let C be a set of ICs, PCs, SCs, FCs and CCs implied by A. Q Chasec{Q). =SAT(C) Chasec(Q) if for any document tree t satisfying C , Q(t) = Chase (Q{t)). c First we prove that a single application of each chase rule to an X P a t h query i n XPU'^^ ^ maintains equivalence w.r.t. C . T h e result then follows by an induction on the length of a chasing procedure. 25 1. Chase rule one only applies to ad edges i n (chased) Q. a [ b and Q' be the result applying p to Q. ad(x,y) w i l l be changed to pc(x,y) i n Q(T). Q' is same as Q except one i n Q'. Obviously Q' C Q because there is a containment mapping from Q to Q'. (x,y) Let p be the P C So Q' Qc Q- Let T S SAT(C) and Since Q satisfies C, there exists a homomorphism / i from Q to T . Since C implies p, T also satisfies p. If a node z labelled a has i n T must have a descendant w labelled b, then w must be z's pc child. Hence,h can be extended to a homomorphism from Q' to T without any change. So Q Qc Q'• 2. Chase rule two is applied only to pc edges i n (chased) Q. Let s be the S C a : b I I and Q' be the result of applying s to Q. Q' is Q w i t h one extra pc child u labelled / for some node v labelled a i n Q. Clearly Q' C Q because 3 a containment mapping g from Q to Q'. So Q' C c Q. Let T € SAT(C) and (x,y) i n Q ( T ) . Since Q satisfies C, there exists a homomorphism h from Q to T . Since C implies s, T also satisfies s. E v e r y node z labelled a i n T must . have a child w labelled I if z has a child labelled b. Hence, h can be extended to a homomorphism from Q' to T by mapping it to w. So Q CQ Q'. 3. Chase rule three only applies to pc edges i n (chased) Q . Let / be the F C a :[ I and Q' be the result of applying / to Q. Q' is Q w i t h pc children b\,...,bi labelled / merged to one node b labelled I for some node v labelled a in Q. Clearly Q' C Q because 3 a containment mapping g from Q to Q'. So Q' Qc Q- Let T G SAT(C) and (x,y) i n Q(T). Since Q satisfies C , there exists a homomorphism h from Q to T . Since C implies f, T also satisfies /. Every node z labelled a i n T have only have a unique child w labelled I. Hence, h can be extended to a homomorphism from Q' to T by replacing 26 h(h) = I, ....,h(bi) = I w i t h h(b) = I. So Q C C Q'. 4. Chase rule four only applies to ad edges i n (chased) Q. Let i be the I C a, c : J b and Q' be the result of applying i to Q. Q' is Q w i t h one extra node u labelled b inserted between an ad edge (a, c). Clearly Q' C Q because there is a containment mapping g from Q to Q'. So Q' CQ Q. Let T £ SAT(C) (x,y) i n Q(T). and Since Q satisfies C, there exists a homomorphism h from <5 to T . Since C implies i, T also satisfies i. If every node z labelled a i n T has a p a t h to w labelled c w i t h length greater than 1, then one node / labelled b must be present i n this path. Hence, h can be extended to a homomorphism from Q' to T by mapping u to /. So Q Qc Q'• 5. Chase rule five is applied only to ad edges i n (chased) Q . Let c be the C C a : b JJ. / and Q' be the result of applying c to Q. Q' is Q w i t h one extra ad child u labelled I for some node v labelled a i n Q. Clearly Q' C Q because 3 a containment mapping g from Q to Q'. So Q ' (x,y) i n Q(T). Q. Let T £ SAT(C) and Since Q satisfies C , there exists a homomorphism /i from Q to T . Since C implies c, T also satisfies c. Every node z labelled a i n T must have a descendant to labelled / if z has a descendant labelled w i t h 6. Hence, h can be extended to a homomorphism from Q ' to T by mapping u to w . So Q c c Q'. • L e m m a 4.4 Proof Ze£ A be a choiceless DTD. Q =SAT(A) Chasec{Q). Since S A T ( C ) contains S A T ( A ) , by L e m m a 4.3 L e m m a 4.4 holds. So we prove the soundness of the chase. • 27 Lemma 4.5 Let A be a choiceless acyclic DTD, C be the set of ICs, PCs, SCs, FCs and CCs implied by A, and Q be 1-1 homomorphic XPIA//-I 1> to a subtree of a tree in query satisfied with A. Chasec{Q) SAT(A). Proof Because Q is satisfiable w i t h A and the chase is sound, Chasec(Q) isfiable w i t h A ; hence there is a non-empty set of trees S G SAT(A) is a homomorphism from Chasec(Q) is is also sat- such that there to each tree i n S. Assume that ChaseciQ) 1-1 homomorphic to no subtree of a tree i n S. T h i s can only be the case if there is always a pair of child nodes i n Chasec(Q) which are mapped to a single node of a tree i n S. Let the child nodes be labelled b and have parent node labelled a. There are three cases i n Chasec(Q)'- 1. Node labelled a has two pc children labelled b: the F C constraint a I b must not be implied by A . Otherwise, Chasec(Q) would have merged two b pc children of a node. T h e n there must exist trees i n S w i t h unbounded number of pc children labelled b of node a. Therefore there would be a subtree to which Chasec(Q)is 1-1 homomorphic, a contradiction; 2. Node labelled a has one pc child labelled b and one ad child labelled b: the P C a\b must not be implied by A , otherwise ad(a,b) w i l l be chased to pc(a,b) in Chasec(Q)- T h i s means there are trees i n S w i t h an path from a-node to b-node w i t h p a t h length greater than 1 and there would be a subtree to which Chasec(Q) 1S 1"! homomorphic, a contradiction; 3. Node labelled a has two ad children labelled b: the only possible failure of homomorphism is that i n all trees i n S, a node has an unique b node as its descendant, which means there is a only one d t d path p from a to 6 and each node i n p has a unique child node i n p, then chains of ICs and F C s would be 28 implied by this duplicate-free A . Thus by the end of chase procedure, the two b nodes would have been merged by using ICs and F C s . Chasec{Q) is 1-1 homomorphic to some subtree of trees i n S, a contradiction. • D e f i n i t i o n 4.6 ( C o r e N o d e ) Let Q be A-satisfiable, of trees with a subtree that Chasec{Q) satisfying R C SAT (A), is 1-1 homomorphic to. be the set We call R the set for Q. Each tree in R has a core subtree to which Chasec(Q) homomorphic, is 1-1 and each node in the core subtree is called a core node. Each node which is not a core node is called a non — core node. F r o m the definition of Core Node, node i n Chasec{Q) may be mapped to a set of nodes X = x\, ...,Xi i n t G R. A l l Xi G X are core nodes. In addition, every node lying on the path from one core node x to another core node y is core node. L e m m a 4.6 Let A be a choiceless acyclic DTD, FCs and CCs implied by A, and R C SAT(A) SCs, and P and Q be XPf././/.[ 11 queries satisfied with A in which Chasec{Q) tree in R. If P ^SAT(A) C be the set of ICs, PCs, Q> f or has 1-1 homomorphism to a subtree in each each node w in P, either w can be mapped to a core node in every tree in R or w can be mapped to a non-core node in every tree in R. Proof Since Q is satisfiable w i t h A and chase is sound, Chasec(Q) w i t h A . Hence R ± 0 . Assume that P 5SAT(A) is satisfiable Q but there are trees ti,t 2 G R such that node w i n P can be mapped to only a core node i n t\ and only to a non-core node i n t . Let V — vi,...,v be mapped to i n ti. B y the definition of core node and the property of R , each 2 n node i n V also appear i n t . 2 be the set of core nodes to which w can According to our assumption, w can not map to any 29 subtree rooted at a node of V i n t<i- Because A is context-free, we can replace each Vi tree i n t\ w i t h the corresponding V{ tree i n £2 and obtain tree t[ which still i n set R . However, w can not be mapped to any node i n t[. Therefore P(t[) <5(£') 7^ 1 0; then P ^SAT(A) Q, is a = 0 while contradiction. • L e m m a 4.7 Let C be the set of ICs, PCs, SCs, FCs and CCs implied by A , and P and Q be X P l / . / A l II queries. P 2SAT(C) Proof (If) Assume that P c (Only If) Assume that P 2sAT(C) Q, then for a l l tree t G SAT(C), P{t) D Chasec(Q) Chasec{Q), , then P Chase (Q). From D Chasec{Q) 2 Chasec(Q). L e m m a 4.3, Q =SAT(C) Q(t). Q iff P hence P 2 S A T ( C ) 2 S A T ( C ) Q- is a quasi-instance which satisfies C and may contain ad edges. We can extend Chasec(Q) to a tree instance t' as following: for each ad edge(x, y), insert a node z i n between, (label(z) never appear i n A ) ; and connect z w i t h x and y using pc edge. Obviously t' G SAT(C) because z does not involve i n C and the extended p a t h x-z-y with' length 2 satisfies ad(x,y) obligation. So P(t') 2 Q{t'). F r o m the chasing procedure defined i n previous paragraph, there is a mapping from Q to Chasec{Q). T h e n result(t') Hence there is a mapping from Q to t' and result(t') G P(t') follows and there exists a mapping c from P to t'. Since z nodes we added i n t' never appear i n Chasec(Q) back to Chasec(Q) G Q{t'). and P, we can easily convert t' by replacing every two pc edges connected by z node w i t h an ad edge, c is a mapping from P to node i n Chasec(Q). mapping from P to Chasec{Q), Hence c is a containment and therefore P D Chasec{Q). T h e o r e m 4.1 Let A be choiceless acyclic DTD and C be the set of ICs, PCs, FCs and CCs implied by A . For XPU>H'\ 30 1> queries P and Q, P ^SAT(A) • SCs, Q iff P ^SAT(C) Proof Q- (If) Assume P 2SAT(C) (Only if) Assume P Q, then P ^SAT(A) diction. B y L e m m a 4.5, Chasec(Q) ^SAT(A) Q but P Q because SAT(C) ~£SAT(C) D SAT (A). Q- We w i l l derive a contra- is 1-1 homomorphic to a subtree of a tree i n SAT(A). Let R C SAT (A) be the satisfying set for Q . Since P 2SAT(A) Q A N D there is a homomorphism from Q to each T S R, there must be a homomorphism from P to each T £ R. If P Q> by L e m m a 4.7, there is no containment mapping from P to ^>SAT(C) C hasec (Q)- It must be the following two cases (See Figure 4.2) (a) single path in P fail to map to any p a t h i n Chasec (Q) (b) each path i n P can map but for some node w which is common ancestor of node u and v, two mapped paths for u and v in Chasec(Q) c a n n ° t be joint on w. Case (a) There is a node x i n P w i t h parent y such that y is mapping to a node i n Chasec(Q) but no mapping from x to any node i n C hasec (Q)- So i n any T G R,x can never be mapped to a core node while y can always be mapped to a core node u. Because P ^SAT(A) QI by L e m m a 4.6, x can always be mapped to a non-core node v i n every T G R. Since b o t h P and Chasec(Q) may contain ad edges, there are two possibilities: (1) y is a pc child of x i n P . Since v is non-core node and x cannot map to a core node, A cannot i m p l y the S C label(u) : bi J, label(v), where bi is the labels of a core child of u . So there must be a tree U £ SAT(A) which has node w w i t h label(u) and child labelled w i t h bi but no child w i t h label(v). B u t we can replace the non-core child subtrees of u by the non-core child subtrees of w and still have a tree T' e R. Node x cannot map to any non-core node i n T ' , a contradiction. (2) y is an ad child of x i n P . Similar as case (1), v is non-core node and x cannot 31 map to a core node, A cannot imply neither the S C label(u) : bj J, label(v) where bj is the label of core child of u, nor the C C label(u) : bk JJ- label(v) where bj is the label of core descendant of u. So there must be a tree U G SAT(A) which has node w w i t h label(u) and descendant labelled w i t h bj but no descendant w i t h label(v). B u t we can replace the non-core child subtrees of u by the non-core child subtrees of w and still have a tree T" G R. Node x cannot map to any non-core node i n T ' , a contradiction. Case(b) Let w n be the common ancestor of u and v that is closest to the root Figure 4.2: Theorem 4.1 P r o o f - Case a i n P such that the p a t h from root to u v i a w n maps to a path i n Chasec(Q) us- ing embedding function f\ and the p a t h from root to v v i a w n maps to a path i n ^ f2(w)(See Figure 4.3). Since Chasec(Q) using embedding function h but f\(w) Chasec(Q) has a single root, node w of w n n caii not be root itself. T h i s means the parent is mapped to the same node WQ i n Chasec(Q) 32 by / i and fc and WQ has a pair of children w i t h the same label as w . n 1. w n There are two possibilities: is a pc child of wo. Hence A cannot imply the FC : label(wo) J, label(w ), n this means WQ can have unbounded number of pc children w i t h label(w ). n Let T be the smallest tree i n set R such that the node i n T corresponding to WQ i n Chasec(Q) has only two child nodes w i t h label(w ). n So there is no homomorphism from P to T , a contradiction. Wi, w . 2. w is an ad child of WQ. Let the subpath P ' from wo to w i n P be wo, n n n Hence A cannot imply y has an unique w descendant by the chain of ICs and F C s from WQ to wn, n — 2) and FC : i.e. IC : label(wi),label(wi 2) + label(wi) i m p l y the PC : label(wo) \ J. label(wi+\) label(w ) n :\ (0 ^ i 4, label(wi+\) (0 < i ^ n — 1). Also A cannot and the FC : label(wo) [ label(w ). n These means WQ may have unbounded number of descendants w i t h label w. n Let T be the smallest tree i n R such the node i n T corresponding to u>o i n Chasec(Q) has two distinct paths to two nodes w i t h label(w ).So n there is no homomorphism from P to T , a contradiction. • B y L e m m a 4.7 and Theorem 4.1, we drive the following: Let A be a choiceless acyclic D T D and C be the set of ICs, P C s , S C s , F C s and C C s constraints implied by A . For XPEV'HA P 2 Chasec 4.3 1> queries P and Q , P 2SAT(A) (Q). Query Answering Using Views Under D T D We first introduce several useful definitions. 33 Q iff / ° w / / I W„ W„ ' »\ I / u /w \ \ 0 / ' 1/ \ \\ V f\ • 1 \ \ 0 \ U V ChasectQ) 7 W \ | \ / / p W D U W n V \ \ T Figure 4.3: Theorem 4.1 Proof - Case b D e f i n i t i o n 4 . 7 ( S o u n d R e w r i t e w . r . t . D T D ) We say that Q is rewritable using V w.r.t. DTD Aif3E\Vt£ SAT(A) : E o V{t) C Q(t) and E is said to be a sound rewrite of Q using V w.r.t. A . (Note E can not be a null expression means 3t e SAT(A) s.t. EoV(t) which ^0). D e f i n i t i o n 4.8 ( U s e f u l E m b e d d i n g w . r . t . D T D ) An embedding f : N{Q) ~> N(V) is said to be a useful embedding w.r.t. D T D A if f is a Useful Embedding and in addition f satisfies the following label(dv) to label(u) constraints: V u ^ dom(f): 3 a path from in A. L e m m a 4.8 Q is rewritable using V w.r.t. DTD A iff Q is rewritable using ChaseciV). Proof It follows from L e m m a 4.4, V E o Chasec{V). Q 2SAT(A) E O =SAT(A) V iff Q rewrite of Q using V w.r.t. A . Chasec(V). ^SAT(A) E O Hence, E oV Chasec(V). =SAT(A) E is a sound • T h e o r e m 4.2 Q is rewritable using V w.r.t. DTD A iff there exists a useful embedding f : Q ~> ChaseciV) w.r.t. A. 34 Proof (Only if) B y L e m m a 4.8, Q is rewritable using V w.r.t. A iff Q is rewritable using Chasec(V). Theorem 3.1 still holds here. Hence, there must exist a useful embedding from Q to Chasec(V). (If): Let / : ChaseciV) be a useful embedding w.r.t. A . We construct E and extend f as follows: V x £ Q s.t. x G dom(f) and 3y s.t. edge(x,y) G Q and y £ dom(f). the subtree rooted at y. D o the following: add a copy T' of T y of d v and define for every node z G T ,f(z) y y Let T y denote as a child subtree = z' where z' is the corresponding node i n T' . If (x,y) all such T s. T h e extended / is the required homomorphism because V x defined i n y y is a pc(ad) edge, then (d ,y') v is a pc(ad) edge. E contains the useful embedding is defined i n / and V x N O T defined i n the useful embedding, f(x) is its image i n the corresponding T . y Q —> E o Chasec(V). f is a valid homomorphism function from is satisfiable w i t h A . Therefore it is a sound E o ChaseciV) rewrite of Q using V. • W h e n acyclic d t d is present, there is an important property that there are no two nodes on any p a t h w i t h a duplicated tag. L e m m a 4.9 is based on this intuition. Lemma 4.9 There exists at most one maximal acyclic DTD A. Proof sound rewrite of Q using V w.r.t. Assume that there are two distinct m a x i m a l rewrite E\ and E which derived 2 from two useful embedding f\ and f , 2 qi G dom(fi) — dom(f ). 2 T h e n there must exist at least one node Since i n acyclic D T D there is no repeated tags on any root to leaf path i n Q and V . It is obvious that C A T is unique. Hence, qi lying on the branching above C A T i n Q. Let /(<&) is defined i n f\ but not defined i n 35 f. 2 Function: get-IC(src,des) Input: DTD nodes src, des Output: return a set of DTD nodes A such that Va G A (IC) src, des :J a. Check whether IC between src and des has been computed already. sult; For each node n { If so, return pre-saved re- 1. block all paths connected A; 2. check whether there is a path from src to des; 3. If it is not, add n to the list A; } Store the result and return A; Figure 4.4: F i n d i n g I C from D T D T h e n there are two cases: \)f(qi) £ D: v tag(qi) can not appear twice on D , v f 2 w i l l not be a sound embedding because a contradiction; 2) f(qi) ^ D: v according to the definition of useful embedding, the whole branching qi lying on must be defined i n f\, then f 2 can not be a m a x i m a l rewrite, a contradiction. Therefore, neither of these two cases can occurs. Hence, it is impossible to exist two distinct m a x i m a l rewrites. 4.4 • Algorithms and Time Complexity In order to chase on view, we need to find those five types of constraints from D T D . Obviously inferring P C , F C , S C can be done i n P T I M E because they only involves parent-child relationship corresponding to one production i n D T D . However, the cost of inferring ICs and C C s are not trivial. To find these two constraints, we use the P T I M E algorithms shown i n Figure 4.4 and 4.5, which use depth-first search i n D T D graph. Therefore, the time complexity to get each of five types of d t d constraints 36 Function: get-CC(src, des) Input: DTD node src, des, and all the other DTD node Output: return a set of DTD nodes A such that Va 6 A: (CC)src, des Q a. Check whether C C between src and des has been computed already. sult; For each node n, which src is its ancestor and des is its cousin{ If so, return pre-saved re- 1. Find all of n's guaranteed ancestors g and block all paths connected g; 2. Check whether there is a path from src to des; 3. If it is not, add n to the list; } Store the result and return A. Figure 4.5: F i n d i n g C C from D T D are at most 0(N ) 2 where N is the number of d t d elements i n A because we use graph depth first search. To improve the efficiency, we save the computation result of each type of constraints so that we w i l l not repeat the computation of same constraint i n the chase procedure when the premise is the same. We next briefly introduce an efficient implementation of the chase i n Figure 4.6, which only scans the (Chased) query tree three times. In each scan, we always start from the edges on the distinguished path of the tree. The Chase Procedure would take time 0((V + N) 2 * N ), 2 where V is the number of nodes i n query and N is the number of elements i n A . T h i s is because finding each constraints from A takes 0(N ) 2 at most; updating query tree each time takes linear time and the number of query node would increase to (V+2N) i n the worst case when I C , C C , S C are involved. F i n a l l y we briefly introduce an algorithm to compute useful embedding. T h e whole algorithm is very similar to the one we present i n previous chapter when schema is not present. However, we can further simplify it based on the result of the uniqueness of the rewriting shown i n L e m m a 4.9. Here are several major modi- 37 Procedure: F a s t C h a s e ( Q u e r y Q , D T D A ) Input: Query Q and D T D A Output: Chasec(Q)For each ad edge e in Q connected two nodes a and b{ 1. If get-PC(a,b, A ) is true, update e as pc edge; 2. Else If list L = get-IC(a,b, A ) is not empty, insert each node c in L between a and b in order and preserve pc, ad obligation as implied by A ; 3. Update Q; } For each edge e in Q connected two nodes a and b{ 1. If e is an ad edge and list B = get-CC(a,b, A ) is not empty, add each node in B as a ad child of a; 2. If e is a pc edge and list C = get-PC(a,b, A ) , get all other children of a which has same tag name as a in list C { (a) If C is not empty, merge all nodes in C with a; (b) Update Q; } } For each pc edge e in Q connected two nodes a and b{ 1. List D =get-SC(a,b, A ) ; 2. If D is not empty { (a) For each node c in D, add c as pc child of a if c is not already present in Q and also keep SC chasing on c until saturation; (b) Update Q; } } Return Q; Figure 4.6: Apply Chase on Q 38 fications i n the algorithm: • V would be replaced by ChaseciV), path of we use D' v to present the distinguished ChaseciV); • In the function map-Dpath and map-To-Dv, we don't need to keep can- didate list anymore. T h e mapping would be unique as shown i n L e m m a 4.9 which implies we intend to map some node n i n Q , and if there is any node m on D' w i t h the same tag, then m is the only candidate to be mapped; v otherwise the mapping fails. • One extra step need to be done: for any node 1 i n Q which is not mapped to V , we need to check whether there is a path from tag(l) to tag(dy) i n A. If it is not, the mapping fails. The time complexity of compute the useful embedding and rewrite i n this case would be P T I M E . We know map-Subtree takes P T I M E , and both m a p - D p a t h and map-To-Dv also take O ( V ) where V is the number of query nodes. Since the useful embedding is unique, we only need one iteration to go through the query tree. 39 Chapter 5 Experimental Results To study the effectiveness of our work, we systematically ran a range of experiments to measure the impact of various parameters. W e focus on testing the schema aware case. In addition to measure savings and overhead, we also measure the scalability when executed over large collections of views and test the performance when the query size varies. We ran our experiments on the X M a r k benchmark dataset[19]. W e con- structed the document of size 1 0 0 M B using the I B M X M L G e n e r a t o r [ 9 ] . W e used W u t k a DTDparser[10] to parse the D T D , which is needed for static analysis of schema. For query evaluation, we use an X Q u e r y engine X Q E n g i n e f l l ] for convenience and flexibility. B o t h tools are open sources developed i n Java. We implemented our tests i n Java as well. Setup: W e ran our experiments on a spare workstation sunning SunOS version 5.9 w i t h 8 processors each have a speed of 9 0 0 M H z and 3 2 G B of R A M . A l l values reported are the average of 5 trials after dropping the m a x i m u m and m i n i m u m , observed during different workloads. 40 • Simple Selection Query QI: for $a in doc("auction.xml")/site[//person][//quantity]//itemref where $a/@item >= "item20" return <result> {$a} </result> VI: < view> {doc(" auction.xml")/site[//profile] / / open-auction [privacy]//itemref} </view> R l : <result> {for $a in doc("view.xml")/view/itemref[@item>= "item20"] return $a} <result> VI': <view>{doc(" auction.xml" )/site[//person/@id][//privacy]//itemref/@item}</view> • Complex Selection Query Q2: for $p in doc("auction.xml" )//people/person[//profile[gender/text()="female"][interest]]//address where $p/country/text()= "United States" and $p/province/text() = "Maryland" return <result> {$p/city}</result> V2:<view>{doc(" auction.xml" )//person[[//profile/gender/text()= "female"][profile/interest][//country/text() = "United States"]] /address }</view> R2: for $p' in doc("view.xml")/view/address where Sp'/province/text() = "Maryland" return <result>$p'/city</result> V2': doc(" auction. xml")//person[[//profile/gender/text()='female'] [//country/text ()= Sates" ]] / address "United Figure 5.1: Selection Queries and Views on auction.dtd 5.1 Q u e r y Set We r u n the tests over the queries and views listed i n Figure 5.1 and Figure 5.2. XQueries are labelled w i t h initial " Q " , useful X p a t h views are labelled w i t h initial " V " . These views are used to test saving ratio while their primed variants are useless views which are used to test overhead ratio. A l l rewritings using given views are equivalent to the original query result. We give the formal definition of saving and overhead ratio i n the next section. Before we show the experimental results, we explain how to set up our experiment to apply our technique of answering X P a t h queries using X P a t h views to solve the problem of rewriting XQueries using X P a t h views by using Q and given 3 41 Simple Join Query Q3: for $t in doc(" auction.xml")/site//closed_auctions/closed_auction[annotation] , $p in doc("auction.xml")//regions/europe[//description//text/bold]/item where $t/itemref/@item=$p/@id and $t/price/text( ) >= "100" return <result> $t/itemref</result> V3a: <view>{doc("auction.xml")//closed-auction[price/text() >= " 100"]//itemref }</view> V3b: <view> {doc("auction.xml")//europe[//description//bold]//item} </view> R3: for $t' in doc("view3a.xml")/view/iternref, $p' in doc("view3b.xml")/view/item where St'/Oitem = $p'/@id return <result>$t'</result> V3': doc("auction.xml")/europe[//text]//item Complex Join Query Q4: for $p in doc("auction.xml")/site/people/person, $t in doc("auction.xml")/site/closed_auctions[//annotation], $t2 in doc("auction.xml")/site/regions/europe/item where $p/@id = $t/closed.auction/buyer/@person and $t/closed.auction/happiness/text() >="0.6" and $t/closed_auction/itemref/@item = $t2/@id return <result> {$t2/name/text()} {$p/name/text()} </result> V4a: <view>{doc("auction.xml") //person[profile/gender/text() = "female"]}</view> V4b: <view>{doc("auction.xml")//closed_auction}</view> V4c: <view>{doc("auction.xml")/europe/item }</item> R4: for $p in doc("view4a.xml")/view/person, $t in doc("view4b.xml")/view/closed_auction, $t2 in doc("view4c.xmP)/view/item where $p/@id = $t/buyer/@person and $t/itemref/@item = $t2/@id and $t/happiness/text() >="0.6" return <result> {$t2/name/text()} {$p/name/text()} </result> V4c': doc(" auction.xml")/regions[//text]/item Group By Query Q5: for $p in doc("auction.xml")/site/people/person[//age/text()>="40"] let $1 := for $i in doc(" auction.xml")/site/open_auctions[//privacy]/open_auction where $p/profile/@income > 5000 * $i/initial/text() return $i- return <items>{$l//itemref}</items> <person> {$p/name} < / person> V5a: <view>{doc("auction.xml") //person[//age/text() >= "40"]}</view> V5b: <view>{doc("auction.xml")/site[//privacy]//open_auction}</view> R5: for $p in doc("view5a.xml")//person let $1 := for $i in doc("view5b.xml")//open_auction where $p/profue/@income > 5000 * $i/initial/text() return $i return <items>{$l//itemref}</items> <person > { $p/name} </person> V5b': doc("auction.xml")/site//person[//age/text() <= "35"] Figure 5.2: Join/Group By Queries and Views on auction.dtd 42 views i n Figure 5.2. Step 1: G i v e n a query i n X Q u e r y expression and a set of views i n X P a t h expression, we build a generalized tree p a t t e r n ( G T P ) [4] for each independent variables declared i n F O R or L E T clauses i n Q. A variable is "independent" when its declaration is directly related w i t h document. In Qz, both $t and $p are independent variables. So we b u i l d two separate trees, one for $t and the other for $p. Step 2: M a r k the interest nodes and return nodes i n each tree. We capture all nodes appearing i n W H E R E and R E T U R N clauses associated w i t h each independent variable i n its G T P . R e t u r n nodes and those involved i n join predicate must be reachable from the distinguished node of the useful view. Figure 5.3 shows two trees Q and Q t p constructed for Q 3 . Step 3: E a c h G T P can be represented as a T P Q which is equivalent to a X P a t h expression Q ' . W e test Q ' against a single view each time. If the view is usable, we search the useful embedding and compute the rewriting. In the example, V 3 a is usable for Q and V 3 b is usable for Q . t p T h e rewritings are the following: • R : doc("view3a.xml")/view/itemref t • R: q doc("view3b.xml")/view/item Step 4: After we obtain rewriting, we w i l l replace the declarations for each variable w i t h the corresponding rewriting and reassemble the query based on the structure i n original query to give the final rewriting. 43 Qt Qp /site //regions closed auctions curope closed auction annotation item itemref description @id text( )>= "100" @item text JOIN VBC bold RETURN Figure 5.3: G T P s built for Q3 Here is the final rewriting for Q3 using V3a and V3b: for $t' in $p' in doc("view3a.xml")/view/itemref, doc("view3b.xml")/view/item where $t'/@item return = $p'/@id <result>$t'</result> 44 5.2 Savings and Overhead on Queries Answering using Views Let e be the time taken to evaluate the original query over the document. Let c be q c the time taken to determine whether a given view isuseful for rewriting the query and let Cj. be the time to compute the rewrite using the useful views and let e r be the time it takes to evaluate the rewrites over the materialized view documents. T h e saving ratio SQ obtained by using the usability check procedure on useful views is defined as SQ = c s + c r+ r _ e T h e overhead ratio OQ obtained by using the usability check procedure on useless views is defined as OQ = + i. Cc e Intuitively, the closer to 0 the saving ratio is the better and the closer to 1 the overhead ratio is the better. 5.2.1 Savings on useful views Figure 5.4 shows the saving ratio w i t h the same document size for the five queries Q1-Q5 using their corresponding useful views. We expect the saving ratio to be close to 0 because the computation time of rewriting is very small. If the document size of each view is less than 1/3 of the size of the original database then the evaluation of the query rewriting is much faster than original query evaluation. T h i s is exactly what happened i n the experiments. 5.2.2 Overheads on useless views Figure 5.5 shows the overhead ratio w i t h same document size for the five queries Q1-Q5 using their corresponding useless views. We expect the overhead ratio to be very close to 1 because the computation time of checking embedding is very small. T h e result shows the overhead is a negligible fraction compare to the query 45 Usable Views - auction.dtd •0.1.2 0.10 - 1 0.08 CC. ra 0.06 • c 1 m Saving 0.040.02 0.00QAV1 QAV2 QAV3 QAV4 QAV5 Query & Views Figure 5.4: Saving R a t i o - Useful Views evaluation time. 5.2.3 Various number of views Now we test how the performance is when the number of views varies from 1 to 100 and none of view is useful. Figure 5.6 shows the overhead ratio of Q1-Q3 when the number of useless views varies from 1 to 100. Figure 5.7 shows the saving ratio of Q3-Q5 when the number of views varies from 5 to 100. E a c h of Q3-Q5 needs multiple views to rewrite the query. We design this test i n such a way that there is only one useful view and the others are all useless views. In the rewriting we access original database if no view can be used to extract the required information. 46 Unusable Views - auction.dtd 1.008 0 1.006 | 1.004 | 1.002 ] i Overhead o 1.000 > 0 0.998 JUKI 0.996 Q-V'1 Q-V'2 Q-V'3 Q-V'4 Q-V'5 Query Q and Views V Figure 5.5: Overhead R a t i o - Useless Views 5.2.4 Varying query size Figure 5.8 and Figure 5.9 show the saving ratio and the overhead ratio of a simple join query Q when the query size varies from 5 to 50. W e increase the query size by adding more query nodes and value based constraints. W h e n we test saving, the exact number of useful views are provided. W h e n we test overhead, the exact number of useless views are given. We found both ratio d i d not vary much as the query sizes changed. T h i s may result from the fact that the more complex the query is, the more query evaluation time is required i n general although the rewrite computation and evaluation time or the embedding checking time would take longer, the ratio would remain at the same level. 47 Overhead Ratio vs. Various Number of Views 1.25 T Figure 5.6: Overhead R a t i o - Various Number of Useless Views Saving Ratio v s . Various Number of Views 0.8 -, 0.7 I 0.3 - in 0.2 0.1 0 5 10 20 40 60 Number of Views 80 100 Figure 5.7: Saving R a t i o - Various Number of Useless Views 48 Saving Ratio vs. Various Query Size 0.16 0.14 o 0.12 % oc 0.1 in 0.08 > 0.06 °> 0.04 0,02 0 I 10 20 30 40 50 Number of Query Predicates Figure 5.8: Saving Ratio - Various Query Size Overhead Ratio vs Various Query Size 1.008 i 1.007 1 1.006 -1 i t is 1.005 | t 1.004 / X 1.003 o 1.002 \ 1.001 i \ 10 20 30 AO 50 Number of Query Predicates Figure 5.9: Overhead R a t i o - Various Query Size 49 Chapter 6 Related Work X P a t h query containment is close related to the use of materialized views i n answering query. T h i s relation provides a necessary condition for designing and testing sound algorithm for query rewriting using views. There has been much work on query containment and minimization of various X P a t h fragments [1, 5, 14, 15]. Containment checking of X P a t h queries, i n the absence of constraints, containment is i n P T I M E for XPV<//A 1> as shown i n [1], while it is proven to be C O N P - c o m p l e t e for XP{H'\-1'*} i n [14]. Containment under constraints is shown to be undecidable i n [5], when X P ^ / ' l 1'*} is allowed along w i t h disjunction, variable binding and equality testing, and the bounded/unbounded simple X P a t h integrity constraints(SXICs) implied by D T D . A comprehensive study of the complexity of containment of X P a t h fragments under D T D constraints are given i n [15]. The most relevant work is [18], which shows that containment is decidable for XPVI'W'*} when the constraints are D T D s . The same paper also identifies XP^ D for which containment under duplicate-free D T D s can be decided i n P T I M E . In our work, we consider a richer subset of X P a t h queries, including descendant edges under choice-free acyclic D T D s 50 and provide P T I M E algorithm to decide the containment problem. Query answering using materialized views for X M L is recently studied i n [2], where they propose a framework for using X P a t h views i n X M L query processing i n the absence of schema. However, there are important differences i n the contribution of the two papers, as we explain i n detail below. T h e major contribution of [2] was presenting an X P a t h matching algorithm to determine certain class of views which can be used to answer query containing X P a t h expression and construct compensation expressions to be applied on views to produce the query result without schema knowledge. T h e y explored a class of materialized X P a t h views, which may contain a combination of X M L fragments, typed data values, full paths and node references. T h i s means that the users may access to the original database when necessary and the goal is to obtain equivalent results between evaluating query and applying a compensation expression on views. B y contrast, we target different applications where the original database is no longer available to the users and it is replaced by as a set of materialized X P a t h views. Therefore our effort is to produce maximally contained results instead of equivalent results, depending on the given views. More importantly, we classify a class of X P a t h fragment and D T D s for which we provide an efficient algorithm to decide whether a view is useful for query rewriting and compute the rewrite when it is possible under D T D constraints. In the experiment, we also illustrate the possible use of our work to answer XQueries. To the best of our knowledge, the problem we study here is not addressed i n the literature. 51 Chapter 7 Conclusion W h i l e there has been considerable work on query answering using views i n relational world, the same problem has not been extensively studied for X M L . We developed a method for testing the usability of X P a t h view for answering X P a t h / X q u e r y queries. We study this problem both w i t h and without a schema and identify cases i n which it is E X P T I M E and when it is P T I M E . In the latter case, we developed efficient algorithms based on a chase procedure and containment mapping. W e complemented our analytical results w i t h an extensive set of experiments. Our study i n the presence of database schema is confined to schema without cycles and choices. In the presence of either of them, the reasoning becomes considerably more complex. It would be interesting to determine whether the techniques proposed here can be extended to solve this problem efficiently when there are cycles and/or choices i n the database schema. T h e other direction is to consider more complex X P a t h queries invloving join, wildcards, etc. 52 Bibliography Sihem A m e r - Y a h i a , SungRan C h o , Laks V . S. Lakshmanan, and Divesh Srivastava. M i n i m i z a t i o n of tree pattern queries. In ACM SIGMOD Conference, pages 497-508, 2001. A n d r e y B a l m i n , F a t m a Ozcan, K e v i n S. Beyer, Roberta Cochrane, and H a m i d Pirahesh. A framework for using materialized x p a t h views i n x m l query processing. In VLDB, pages 64-71, 2004. Surajit Chaudhuri, R a v i Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. O p t i m i z i n g queries w i t h materialized views. In ICDE, 1995. Z h i m i n C h e n , H . V . Jagadish, Laks V . S. Lakshmanan, and Stelios Paparizos. F r o m tree patterns to generalized tree patterns: O n efficient evaluation of xquery. In VLDB, 2003. A l i n Deutsch and V a l Tannen. Containment and integrity constraints for xpath. In KRDB, 2001. Jonathan Goldstein and Per ke Larson. O p t i m i z i n g queries using materialized views: A practical, scalable solution. In SIGMOD Conference, 2001. A l o n Y . Halvy. Answering queries using views: A survey. In Journal VLDB J. Volume 10 Number 4, pages 270-294, 2001. A l o n Y . Halvy, Zachary Ives, Peter M o r k , and Ignor Tatarinov. P i a z z a : D a t a management infrastructure for sematic web applications. In WWW, pages 556567, 2003. I B M . http: / / www. alphaworks.ibm.com/tech/xmlgenertor. W u t k a Consulting Inc. Howard K a t z . http://www.wutka.com/dtdparser.html. http://xqengine.sourceforge.net. 53 [12] Laks V . S. Lakshmanan, Ganesh Ramesh, H u i W a n g , and Zheng Zhao. O n testing satisfiability of tree pattern queries. In VLDB, pages 120-131, 2004. [13] A l o n Y . Levy, A l b e r t o O. Mendelzon, Yehoshua Sagiv, and Divesh Srivastava. Answering queries using views. In PODS, 1995. [14] Gerome M i k l a u and D a n Suciu. Containment and equivalence for an xpath fragment. In PODS, pages 65-76, 2002. [15] Frank Neven and Thomas Schwentick. X p a t h containment i n the presence of disjuction, dtds, and variables. In ICDT, [16] W 3 C . XML Path pages 315-329, 2003. Language: XPath Version 1.0. http://www.w3.org/TR/xpath. [17] W 3 C . X q u e r y 1.0: A n x m l query language. h t t p : / / w w w . w 3 . o r g / T R / x q u e r y . [18] Peter W o o d . Containment for x p a t h fragments under d t d constraints. In ICDT, pages 300-314, 2003. [19] X M a r k . X m a r k an x m l benchmark project, http://monetdb.cwi.nl/xml. [20] Markos Zaharioudakis, Roberta Cochrane, George Lapis, H a m i d Pirahesh, and M o n i c a U r a t a . Answering complex sql queries using automatic summary tables. In SIGMOD Conference, 2000. 54
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Query answering using views for XML
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Query answering using views for XML Zhao, Zheng 2005
pdf
Page Metadata
Item Metadata
Title | Query answering using views for XML |
Creator |
Zhao, Zheng |
Date Issued | 2005 |
Description | The problem of answering query using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database. As XML becomes the standard of data representation and exchange over the internet, the problem has recently drawn more attentions because of its relevance to a wide varieties of XML data management problems, there is a pressing needs to develop more techniques to solve it for XML data effectively and efficiently. We study a class of XPath queries and materialized views which may contain child, descendant axis and predicates. We first describe an algorithm to find the maximally-contained rewritings in the absence of database schema. We then present an efficient algorithm to search the maximally-contained rewriting under choice-free acyclic schema and prove the uniqueness of the maximally-contained rewriting. Finally we show its performance experimentally by extending our algorithm to answer queries in XQuery expression. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2009-12-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051566 |
URI | http://hdl.handle.net/2429/16451 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2005-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2005-0349.pdf [ 2.12MB ]
- Metadata
- JSON: 831-1.0051566.json
- JSON-LD: 831-1.0051566-ld.json
- RDF/XML (Pretty): 831-1.0051566-rdf.xml
- RDF/JSON: 831-1.0051566-rdf.json
- Turtle: 831-1.0051566-turtle.txt
- N-Triples: 831-1.0051566-rdf-ntriples.txt
- Original Record: 831-1.0051566-source.json
- Full Text
- 831-1.0051566-fulltext.txt
- Citation
- 831-1.0051566.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051566/manifest