UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

From tree patterns to generalized tree patterns : on efficient evaluation of XQuery Chen, Zhimin 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-0188.pdf [ 3.18MB ]
Metadata
JSON: 831-1.0051223.json
JSON-LD: 831-1.0051223-ld.json
RDF/XML (Pretty): 831-1.0051223-rdf.xml
RDF/JSON: 831-1.0051223-rdf.json
Turtle: 831-1.0051223-turtle.txt
N-Triples: 831-1.0051223-rdf-ntriples.txt
Original Record: 831-1.0051223-source.json
Full Text
831-1.0051223-fulltext.txt
Citation
831-1.0051223.ris

Full Text

F r o m Tree Patterns to Generalized Tree Patterns: Efficient Evaluation of X Q u e r y i>.y Zhimin Chen B.S., Zhongshan University, China, 1994 A THESIS S U B M I T T E D IN PARTIAL F U L F I L L M E N T OF T H E R E Q U I R E M E N T S FOR T H E D E G R E E OF M a s t e r of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The Universi ty of Br i t i sh Columbia August 2003 © Zhimin Chen, 2003 Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: r>oi \ \vtC P A ttef >i s to 6*£<ij£fakze<l p A t f P N W On irff-l iYeat- (TvaUfafyt C f X iWw Degree: AW(a cj' >cu'i(e Y ear: loc^j-Department of ( gH^it^"^ v>(-" 'uc* The University of British Columbia Vancouver, BC Canada A b s t r a c t XQuery is the de facto standard X M L query language, and it is important to have efficient query evaluation techniques available for it. It is a well-known fact that a formal bulk algebra is essential for efficient query evaluation, and the Tree Algebra for X M L (TAX), among others, is invented for this purpose. It can be shown in this thesis that a substantial subset of XQuery can be expressed as T A X . An X M L document is often modelled as an ordered label tree. A core opera-tion in the evaluation of XQuery is the finding of matches for specified tree patterns against the data tree (or forest), and there has been much work towards algorithms for finding such matches efficiently. Multiple XPath expressions can be evaluated by computing one or more tree pattern matches. However, because of the flexibility of X M L data, the efficient evaluation of XQuery queries as a whole is much more than a tree pattern match and combining matchings of multiple tree patterns is not the most efficient evaluation plan for XQuery. In this thesis a structure called generalized tree pattern (GTP) is proposed to concisely represent a whole XQuery expression. Evaluating a query reduces to finding the matches of its GTP, which leads to more efficient evaluation plans. Algorithms are developed to translate an XQuery expression, possibly involving join, quantifiers, grouping, aggregation and nesting, to its GTP, and to generate ii efficient physical plans for a specified GTP. X M L data often conforms to a schema. Relevant constraints from the schema give rise to further opportunities to optimize queries. Algorithms are given in the thesis to automatically infer structural constraints from a given schema and to sim-plify a G T P given a set of structural constraints. Finally, a detailed set of experi-ments using the T I M B E R X M L database system shows that plans via GTPs (with or without schema knowledge) significantly outperform plans based on navigation and straightforward plans obtained directly from the query. m Contents Abstract ii Contents iv List of Figures vii Acknowledgments ix Dedication x 1 Introduction 1 1.1 X M L 1 1.2 Tree Pattern Query 2 1.3 XQuery 3 1.4 Challenges In Translating XQuery Into TPQs 4 1.5 Generalized Tree Pattern 5 1.6 Problem Statement and Contributions 6 1.7 Related Work 7 1.8 Chapters Overview 8 iv 2 Background 10 2.1 Tree Algebra For X M L 10 2.1.1 Basic T A X Operators 11 2.2 Physical Algebra 13 2.2.1 Physical Operators 13 3 TAX Translation For XQuery 16 3.1 Grammar For XQuery Fragment 16 3.2 Normalization 17 3.2.1 XQuery in Canonical Form 17 3.2.2 XQuery Normalization 18 3.3 Derived T A X Operators 23 3.4 Translation 24 3.4.1 Single Block XQuery Translation . '24 4 Generalized Tree Patterns 38 4.1 Basic GTPs 38 4.2 Join Queries 41 4.3 Grouping, Aggregation, and Quantifiers 42 4.4 Nested Queries 45 4.5 Translating XQuery to G T P 47 4.6 Translating G T P Into an Evaluation Plan '. 49 5 Schema-Aware Optimization 57 5.1 Logical Optimization 57 5.1.1 Avoidance Constraint and Internal node elimination 58 5.1.2 Identified Constraint and Identifying Nodes With the Same Tag 58 v 5.1.3 Leaf elimination and Emptiness detection 59 5.1.4 G T P Simplification . 59 5.2 Physical Optimization 64 5.3 Constraint Inference 66 5.3.1 Regular Tree Grammar 66 5.3.2 Inferring Avoidance Constraints 66 5.3.3 Inferring Quantifier Constraints 67 6 Experiment 72 6.1 Experiment Setting 72 6.2 G T P Plans and T A X Plans 73 6.3 Scalability 75 6.4 Schema-Aware Optimization 76 7 Conclusion and Future Work 77 Bibliography 79 vi Lis t of Figures 1.1 Two Example Tree Pattern Queries: (a) P i and (b) P 4 . Single (dou-ble) edge represents parent-child (ancestor-descendant) relationship. 2 1.2 An Example XQuery query and corresponding Generalized Tree Pat-tern Query. Solid (dotted) edges = compulsory (optional) relation-ship. Group numbers of nodes in parentheses 6 3.1 Grammar for XQuery Fragment 16 3.2 Example of normalization. Qa is an example query, Qb is the result after applying Lemma 3.1 to Qa to remove filters, Qc is the result after applying Lemma 3.2 to Qb, and Qd is the result of applying Lemma 3.3 to Qc 21 3.3 An example query and the translated T A X expression 35 3.4 An example disjunctive query and the translated T A X expression. . 36 3.5 An example nested query and the translated T A X expression 37 4.1 Extension to handle 'undefined? truth value 39 4.2 (a) Sample X M L data, (b) Pattern matches of G T P of Fig. 1.2(b). The numbers in (b) are the startPoss of the nodes in (a) numbered in the interval encoding scheme [2] 40 vii 4.3 A query involving join and corresponding G T P 42 4.4 A n example universal query and corresponding universal GTP. . . . 44 4.5 An Example query with nesting & join and corresponding GTP. . . 45 4.6 Algorithm GTP 48 4.7 Physical Plan from the G T P of Figure 1.2(b) 52 4.8 Physical Plan from the G T P of Figure 4.5(b) 54 4.9 Algorithm planGen 56 5.1 Algorithm pruneGTP 60 5.2 A sample XQuery query, the associated GTP, and the simplified G T P 61 5.3 Algorithm c-occurrences 68 5.4 Algorithm d-occurrences 71 6.1 Queries Qa, Qb, Qc 73 6.2 C P U timings (sees) for XMark factor 1. Algorithms used: B A S E — Baseline plan, G T P = G T P plan, SCH = G T P with schema opti-mization. The queries are XMark queries (XM5, XM20, . . . ) and the queries (Qa, Qb, Qc) seen in Figure 6.1 74 6.3 C P U timings (sees). Using G T P with no schema-aware optimization or value index 75 6.4 C P U timings (Comparison of G T P and G T P with schema optimiza-tion plans for XM5, Qa and XM20 76 viii Acknowledgments This thesis would not have been possible without the helps from many people. First and foremost, I am very grateful to Dr. Laks V.S. Lakshmanan, my supervisor, for his great guidance and support throughout this thesis work. I would also like to thank Dr. Raymond T. Ng and Dr. Anne Condon for being in my thesis committee and for their valuable advices and comments. Other thanks are due to Dr. George Tsiknis for reviewing my thesis and providing much advice and feedback. I have had the pleasure of working with all members of the database lab over the years, and owe a lot of thanks for their help and suggestion. Last but not the least, I owe a great deal of thanks to the T I M B E R research group in University of Michigan, especially to Dr. H.V. Jagadish and Stelios Paparizos, for all their help in this thesis work. In particular, the experimental results reported in this thesis was conducted on the T I M B E R system at University of Michigan by Stelios Paparizos. Z H I M I N C H E N The University of British Columbia August 2003 ix To my parents x Chapter 1 Introduct ion 1.1 X M L X M L [4] is the de facto standard format for data exchange. Syntactically, a well-formed X M L document is often modelled as a labelled rooted ordered tree. The W3C's D O M API [5] manifests this model. Basically there are four kinds of node in a data tree: a document node for the document itself which is the root for the data tree, an element node for every element in the document, an attribute node for every attribute specified in every element, and text nodes which can be viewed as the values of elements. Some attributes (for instance the IDREF attributes) may have special semantics of acting as hyperlinks within a tree or even crossing the trees, and therefore the semantic data model of an X M L document may become a rooted directed graph. Nevertheless, in a formal query algebra such as T A X , these attributes are treated just like other common attributes. A query involving navigation along the IDREF links can be expressed as predicates constrained on the value of the ID and IDREF attributes, though the implementation may choose some 1 S p S p . t a g = p e r s o n & $ s . t a g = s t a t e & $ l . t a g = p r o f i l e & $ g . t a g = a g e & S g . c o n t e n t > 2 5 & S s . c o n t e n t ! = ' M l ' S p . t a g = p e r s o n & S w . t a g = w a t c h e s & $ t . t a g = w a t c h Figure 1.1: Two Example Tree Pattern Queries: (a) Pi and (b) P4. Single (double) edge represents parent-child (ancestor-descendant) relationship. special data structure (join index, for instance) to speed up the evaluation of such predicates. Thus, conceptually this labelled-rooted-ordered-tree data model can be used without loss of generality, and is adopted within this thesis. 1.2 Tree Pat tern Query With the rapid adoption of X M L as an alternative data model, querying X M L data attracts considerable research efforts. In most proposed X M L query languages, including XQuery [8], which is considered as the de facto standard of X M L query language, query is performed by binding variables to the data nodes of interest. The abstraction for specifying the bindings between variables and data nodes is the so-called tree pattern (query) (TP(Q)), which is a tree T with nodes labelled by variables, together with a boolean formula F specifying constraints on the nodes and their properties, including their tags, attributes, and contents. The tree consists of two kinds of edges - parent-child (pc) and ancestor-descendant (ad) edges. Figure l.l(a)-(b) shows example TPQs; in (b), we call node %w an ad-child of $p, and a pc-parent of $t. 2 The semantics of a T P Q P — (T, F) is captured by the notion of a pattern match - a mapping from the pattern nodes to nodes in an X M L database such that the formula associated with the pattern as well as the structural relationships among pattern nodes is satisfied. The T P Q in Figure 1.1(a) (if applied to the auction.xml document of the XMark benchmark [29]) matches person nodes that have a state subelement with value ^ 'MP and a p r o f i l e with age > 25. The state node may be any descendant of the person node. Viewed as a query, the answer to a T P Q is the set of all node bindings corresponding to valid matches. The central importance of TPQs to X M L query evaluation is evident from the flurry of recent research on efficient evaluation of TPQs [31, 2, 11]. 1.3 X Q u e r y XQuery [8] is the recommended query language by W3C to retrieve information from many types of X M L data sources, whether they store data physically in X M L or expose a view as X M L via middleware. One characteristic in the data model of XQuery is that every expression is evaluated to a sequence. The order of the sequence is by default the one obtained by a depth first traversal of the document, or a user-defined order specified in an OrderBy clause. The basic construct of XQuery expression is the so-called F L W O R expression. The semantics of the FLWOR expression can be informally illustrated by a simple query against the auction.xml document from the XMark project [29] shown in Figure 1.2(a). The for clause generates a sequence of person elements and binds $p to each element in turn, and then for each binding of $p, binds $1 to each child p r o f i l e element in turn. The result of the for clause is a stream of pairs of 3 bindings of $p and $1. The where clause filters the stream by retaining only the pairs which represent a person who is not living in the Michigan state and is older than 25. The return clause constructs a new resu l t element for each surviving pair, which contains a sequence of watches elements and a sequence of in te res t elements generated from the bindings in that pair. The result of the whole FLWOR expression is a sequence generated by concatenating all the newly constructed resu l t elements. The l e t clause binds a variable to all elements in a sequence as a whole. In contrast, the for clause binds a variable to each element in a sequence by iterating the sequence in turn. The optional OrderBy clause orders the surviving tuples of bindings generated by the for and l e t clause according to some criteria specified by the user. 1.4 Challenges In Translating X Q u e r y Into T P Q s While XQuery expression evaluation includes the matching of tree patterns, and hence can include T P Q evaluation as a component, there is much more to XQuery than simply TPQ. In particular, the possibility of quantification in conditions (e.g., E V E R Y ) , the possibility of optional elements in a return clause, and the many different forms of return results that can be constructed using just slightly differing XQuery expressions, all involve much more than merely obtaining variable bindings from a T P Q evaluation. Combining query results from multiple TPQs to answer an XQuery query results in extra pattern matchings and extra joins which are not necessary. Although a smart optimizer can possibly eliminate these unnecessary pattern matchings and joins, it is unclear how to do it in a complete and systematic way. 4 1.5 Generalized Tree Pat tern To facilitate an efficient evaluation of XQuery queries, the notion of generalized tree pattern (GTP) is proposed in this thesis work. Intuitively, a G T P provides an abstraction of the work that needs to be done toward query evaluation, and provides clues for doing this work while making as few passes over the input data as possible. As a preview, Figure 1.2(b) shows a sample query (against the auction.xml document of the XMark benchmark [29]) as well as the associated (rather simple in this case) GTP. The G T P has solid and dotted edges. Solid edges represent mandatory relationships (pc or ad) just like edges of a T P Q . Dotted edges denote optional relationships: e.g., $ i optionally may be a child of $1, and $w optionally may be a descendant of $p. The G T P can be informally understood as follows: (1) Find matches for all nodes connected to the root by only solid edges. (2) Next, find matches to the remaining nodes (whose path to the G T P root involves one or more dotted edges), if they exist. We will show later that this semantics of GTPs is a key feature enabling them to be used as a basic data structure with which to obtain all valid bindings necessary to answer an XQuery query. In particular, G T P can be used to answer a complex query involving quantifiers, grouping, aggregation, and nesting. In the experimental study section (Chapter 6), we will compare plans from translating XQuery into G T P and then into physical algebraic expressions with the plans from translating XQuery into a sequel of TPQs and then into physical algebra. We will show that the G T P plans always far outperform the T P Q plans, by an order of magnitude in most cases. We will also demonstrate the savings obtained by incorporation of schema knowledge into the G T P plans in query optimization. 5 FOR $p IN document("auction.xml")//person, $1 IN $p/profile WHERE $l/age > 25 AND $p/state != 'MI' RETURN <result>{$p//watches/watch}{$l/interest} </result> (a) $p (0) Sp.tag = person & Ss.tag = state & Sl.tag = profile & Si.tag = interest & Sw.tag = watches & St.tag = watch & Sg.tag = age & Sg.content > 25 & Ss.content != 'Ml' (0) $s< (1) St 1 (0) > $1 $g (0) $i (2) (b) Figure 1.2: An Example XQuery query and corresponding Generalized Tree Pattern Query. Solid (dotted) edges = compulsory (optional) relationship. Group numbers of nodes in parentheses. 1.6 Prob lem Statement and Contributions This thesis aims at the problem of XQuery evaluation in the setting of native X M L DBMS: Given a (restricted) F L W R expression, our method will translate it into an evaluation plan expressed in a physical algebra typically available in a native X M L DBMS, and optimize the plan with or without schema knowledge. The main contributions of this thesis include the following: • Show that an XQuery F L W R expression can be translated into a sequel of TPQs expressed in a formal bulk algebra for X M L (Chapter 3) • Propose the notion of generalized tree pattern (Chapter 4) • Give an algorithm for translating an XQuery F L W R expression into a G T P (Chapter 4) 6 • Give an algorithm that translates a G T P into an equivalent plan in the physical algebra used in TIMBER, a native X M L database (Chapter 4) • Show how schema knowledge can be exploited to remove redundant parts of the GTP, and to eliminate unnecessary operators in the physical query plan (Chapter 5) • Give algorithms to extract the relevant knowledge from a D T D (Chapter 5) 1.7 Related Work There are three major approaches to X M L data management and query evalua-tion: the navigation-based, the relational, and the native approach. Galax [24] is a well-known example of a navigation-based XQuery query processing system. The major issue of the navigational approach is that it results in an implementation as evaluating a query as a series of nested loops, whereas a more efficient evaluation plan is frequently possible. Relational approaches to XQuery implementation include [13, 25, 10, 30, 23], while [3] uses an object-relational approach. The core of the relational approach is to map X M L data to relational schemas and to translate the queries into SQL. Because of the mismatch of the rich structure of the X M L data model and the rigid tabular relational data model, the translated SQL queries usually are complex correlated queries involving many joins and even more expensive recursions. Most of the relational implementations have focused on a restrictive subset of XQuery, with the exception of the "dynamic intervals" paper [10], in which algorithms were given to translate a broader subset of XQuery, similar to that used in this thesis work, into SQL. 7 Some examples of native approaches to X M L query processing include Natix [12], Tamino [22], and T I M B E R [16]. Most previous work on native XQuery imple-mentation has focused on efficient physical placement of X M L data at the page level [12], efficient evaluation of XPath expressions via structural join [2] and holistic join [11], and optimal ordering of structural joins for pattern matching [14]. T I M B E R [16] makes extensive use of structural joins for pattern match, as does the Niagara system [21]. We are not aware of any papers focusing on optimization and plan generation for XQuery queries as a whole for native systems. Recently, there has been much interest in optimizing (fragments of) XPath expressions by reasoning with TPQs or irs variants, possibly making use of any available schema knowledge [32, 19, 28]. GTPs enable similar logical optimization to be performed for XQueries as a whole, with or without schema knowledge. Sec-ondly, previous work on schema-aware optimization for X M L queries (e.g., [28]) only focused on XPath fragments. In this thesis we have developed the initial ideas for a similar exercise for XQuery and reported results from our experiments suggesting when schema-based optimization can (not) be expected to pay off big time. We are not aware of similar results from previous work. 1.8 Chapters Overview The remainder of this thesis is organized as follows. Chapter 2 reviews the Tree Algebra for X M L (TAX) and the physical operators in the T I M B E R system. Chap-ter 3 discusses the algorithm to translate XQuery into T A X . Chapter 4 develops G T P for variant constructs of XQuery. Chapter 5 discusses logical and physical optimizations by applying schema knowledge to GTP, and gives algorithm to ex-tract relevant schema information for optimization from DTD. Chapter 6 focuses 8 on the experimental studies of the performance of evaluation plan and the optimiza-tions facilitated by GTP. Chapter 7 summarizes the thesis work and discuss future extension to GTP. 9 Chapter 2 Background 2.1 Tree Algebra For XML It is a long established fact that a formal bulk algebra which processes data in a set-at-a-time fashion is essential to efficient processing on large databases. The algebra for XQuery proposed by the W3C XQuery working group [9], though is useful in investigating the formal semantics of XQuery, yet appears unlikely to form the basis of efficient evaluation of XQuery because of its lack of set-at-a-time processing capability. Consequently, we choose the Tree Algebra for X M L (TAX) [15] as the basis to study query translation and optimization for XQuery in this thesis work. T A X is a bulk algebra similar to relational algebra augmented with aggre-gation, except that each T A X operator manipulates sets of ordered labelled trees instead of set of tuples. Two concepts in T A X , namely pattern tree and witness tree, are key to manipulating trees. Their formal definitions [15] are given as follows. Definition 2.1 (Pattern Tree). A pattern tree is a pair P = (T,F) where T = (V, E) is a node-labelled and edge-labelled tree and F is a boolean formula such 10 that: (i) each node in V is labelled by a distinct integer; (ii) each edge in E is labelled either as pc (for parent-child) or ad (ancestor-descendent); and (iii) F is a boolean combination of predicates applicable to nodes. Definition 2.2 (Embedding). Let V = {T,F) be a pattern and C a collection of trees. A n embedding of V into C is a total mapping h : V —> C from the nodes of V such that: (i) h preserves the structural relationships in V, i.e., whenever h is defined on nodes u, v and there is a pc (ad) edge (u, v) in G, then h(y) is a child (descendant) of h(u); and (ii) the image of h satisfies the boolean formula F. Each embedding induces a witness tree by retaining only the nodes in the embedding and restricting the original tree structure to the retained nodes. Multiple ways of embedding a pattern into a data tree results in multiple witness tree, one per each embedding. T A X operators are defined on top of the concepts of pattern tree and witness tree. 2.1.1 B a s i c T A X O p e r a t o r s Selection. Selection <TP%SL takes a collection of trees C as input and produces a collection of trees as output. Each tree in the output collection is a witness tree induced by some embedding of P into C, except that if a pattern node $2; appears in SL, the whole subtree rooted at the node which witnesses $x is returned as the output. Projection. Projection TTP^PL takes a pattern P and a projection list PL as its parameters, where PL is a list of pattern nodes in P, possibly adorned with "*". Given a collection of trees C as input, projection TTP,PL retains a node in the output if it witnesses a pattern node in PL under some embedding of P into C, or if it is the descendant of a node which witnesses a pattern node adorned with "*" in PL. 11 Product . Product takes two collections of trees C and V as input. For each pair Ti £ C and Tj £ T>, the output of C x D contains a tree, whose root is a new node with a tag name of taxjprodjroot and with Ti as its left subtree and Tj as its right subtree. Join is expressed as product followed by selection; projection join is expressed as product followed by projection. Grouping. The groupby operator jpt3btol takes a pattern tree, a groupby basis and an order list as its parameters. The groupby operator partitions all the witness trees obtained from embedding P into C into groups, where all the witness trees in the same group have the same value on the nodes witnessed a node appearing in gb. For each group, a tree is output as follows. The tree has a root, which has a tag tax_group_root and two children. The left child has a tag tax-group -basis, and a subtree for the groupby basis. The right child has a tag tax_group_subroot, and its children are the source trees in C for the witness tree in the group, ordered according to the order list. The duplicate elimination operator dp^L (where P is a pattern tree and DL is a list of pattern nodes) can be derived from the groupby operator [15] by applying jp,DL,ol, where ol is any order list, to the input collection C and keeping only one tree per partition. Aggregation. The aggregation operator takes a pattern tree P, an aggregate func-t i on / , and a update specif i c a t i o n as its parameters. Theupdate s p e c i f i c a t i o n denotes where to insert the new node representing the aggregation value. For in-stance, an aggregation operator with an update specification a f t e r l a s t c h i l d ( $ i ) results in a new node with attribute-value pairs {<tag,tax^agg_node>,<aggAttr,v>} inserted as the rightmost child of the witness node for $i. 12 2.2 P h y s i c a l A l g e b r a The native X M L database T I M B E R [16] develops a physical algebra [18] complete with respect to T A X . A query is eventually translated- into a physical plan in the format of a directed acyclic graph (DAG) of physical operators. The engine then executes the physical plan to produce query result. 2.2.1 P h y s i c a l Opera to r s Index Scan. The index scan operator ISP(S) takes a predicate p as its parameter. For each input tree in S, output each node satisfying p using an index. Fi l ter . The filter operator FP(S) takes a predicate p as its parameter. Given a sequence of trees S, it outputs only the trees satisfying the filter predicate p. The order in the input sequence is preserved in the output. Sort. Given a sequence of trees S, the sort operator Sf,(S) sort S according to the sorting basis b. The order of the output sequence reflects the sorting procedure (e.g., by value or by node id order of specified node). Value Join . The value join operator Jj,(Si, S2) takes two sequences of trees as input and a join predicate as its parameter. It joins S i and S2 based on a value-based comparison (in contrast to the structural join operator below) via the join predicate p. The order of the output sequence reflects that of Si. A sort-merge join algorithm similar to that used in the relational DBMS is used, except that an extra sorting procedure is included to resort the output of the sort-merge join to conform to the left Si input sequence order. As in the relational world, the physical algebra includes the left-outer value join as a variant with its standard meaning: each tree in Si will be returned in the output sequence even if there is no match with a tree in S2. 13 Structural Jo in . The structural join operator SJr(Si, S2) takes two sequences of trees, S i and S2, as input. Both S i and 52 must be sorted based on the node id of the desired structural relationship. The operator joins S\ and S2 based on the structural relationship r (ad or pc) between them for each pair. The output is sorted by S i or S2 as needed. Variants include: the Outer Structural Join (OSJ) where all of S i is included in the output, Semi Structural Join (SSJ) [1] where only S i is retained in the output, Structural Anti-Join (ASJ) where the two inputs are joined based on one not being the ad/pc-relative of the other, and combinations. Interval encoding is key to the underlying algorithm of the structural join operator [2]. Each node is assigned a pair (startPos, endPos) such that the followings hold for each pair of nodes m and n: startPosm < startPosn if m precedes n in document order; the interval [startPosm, endPosm] contains [startPosn, endPosn] if m is an ancestor of n, or their intervals are disjoined if there is no ancestor-descendant structural relationship between m and n. Each node also has a special attribute pedigree, whose value is a pair (docld, startPos). The pedigree attribute can be use as node id. Group B y . The groupby operator Gb(S) takes a sequence of trees as input. As-suming the input is sorted on the grouping basis b, it groups the trees based on the grouping basis. For each group it creates an output tree containing dummy nodes for grouping root, sub-root and basis and the corresponding grouped trees. Order in the input sequence is retained in the output. Aggregate. The aggregate operator Aint0n(S,name) takes a sequence of trees S and an aggregate function name as input. For each tree, it applies the aggregate function on the node specified by the input node reference expression in and store the result to the node specified by the output node reference expression on. The 14 order of input sequence is preserved in the output. Merge. The merge operator M{S\,..., Sn) takes a number of sequences of trees as input. The Sfs are assumed to have the same cardinality, say k. It performs a "naive" n-way merge of the input tree sequences. For each 1 < i < k, it merges tree i from each input under an artificial root and produces an output tree. Order is preserved. The merge operator is an extremely lightweight operator to stitch together multiple groups of optional return elements. 15 Chapter 3 T A X T r a n s l a t i o n F o r X Q u e r y In this chapter we show how to translate a substantially expressive fragment of XQuery into T A X . 3.1 G r a m m a r For X Q u e r y Fragment While most of function-free XQuery can be handled by this algorithm, we restrict our exposition here to a simplified, yet substantially expressive, fragment of XQuery, captured by the grammar in Fig. 3.1. FLWR ::= (ForClause | LetClause)+ WhereClause ReturnClause. ForClause ::= FOR $fvi IN S i , $fvn IN En. LetClause ::= LET $lvi := Ei, .... $lvn : = En. WhereClause ::= WHERE <p{Ei,En). ReturnClause ::= RETURN {Ex}.. .{En} . Ei ::= FLWR | XPATH. Figure 3.1: Grammar for XQuery Fragment. In addition, for simplicity, we make further assumptions on the grammar as follows. • The atomic predicates allowed in the boolean formula ip are the built-in relop 16 predicates (<, <, =, >, >, 7^ ) or the built-in predicate empty(FLWR). The operand of a relop predicate can be one of the followings: constant c, XPath expression XPE, or agg(XPE), where agg is one of the built-in aggregate functions, namely, avg, count, min, max, or sum. • No backward edges or wildcard appears in FLWR. 3.2 Normal iza t ion 3.2.1 X Q u e r y i n C a n o n i c a l F o r m XQuery has a highly flexible syntax. To simplify the exposition, we introduce the following definition: Definit ion 3.1. XQuery in Canonical Form. An XQuery statement in canonical form is (for $fv\ in range\, $fvm in rangem)? (let $lvi := expri, $lvn := exprn)l (where </>)? return < result > < tagi > {argx} < tagi > < tagk > {argk} < tagk > < result > where either the for or the let clause must appear; there is no filter construct in the XPath expressions; each rangei is an XPath expression; each expri is a n XPath expression or another canonical XQuery statement; tp is a Boolean combination 17 of atomic conditions; each argi is an XPath expression or an aggregation; all the aggregations are bound to let variables only. 3.2.2 X Q u e r y N o r m a l i z a t i o n In this section, we show that a F L W R expression conforming to the syntax given in Figure 3.1 can be normalized into a canonical XQuery statement. L e m m a 3.1. : A FLWR expression Q can be transformed into an equivalent FLWR Q' free of filter constructs ([J). Proof: Prove by an induction on the number of filter constructs in Q. If Q has no filter construct, no rewriting is needed and the statement is obviously true. Assume that any F L W R Q with less than k filters can be transformed into an equivalent F L W R Q' free of filter constructs. If a F L W R Q has k filters, let \(p] be a filter in Q that is not embedded inside any other filter, and Qi be the inmost F L W R expression that contains [up]. There are four possibilities: [</?] appears in the for, the let, the where, or the return clause of Qv (1) The filter construct appears in the for clause, i.e., it appears in an XPath expression as the range of some for variable $x. If $x is declared as u$x in XPE[tp]", i.e., the XPath expression ends with [</?], let if' be (p with all the context nodes, implicit or explicit, of the XPath expressions in <p, except those inside the nested filters, replaced by $x. For instance, if <p is "./publisher", <p' is "$x/publisher"; if <p is "©year > 1991", <p' is "$x/@year > 1991". Let Q[ be Qi except that %x is declared as u$x in XPE", and its where clause is as "where tp' and ipn, assuming the where clause in Qi is "where tp". Let 18 Q' be Q by replacing Q\ in Q with Q^, Q' is equivalent to Q and has less than k filters. From induction Q can be transformed into an equivalent F L W R free of filter constructs. If $x is declared as "So; in XPEi[<p]PE2n, rewrite it as "$t in XPEl[<p], $x in $tPE2v where $t is a variable that does not appear in Q\. This turns Q\ into the above case. (2) The filter construct appears in the let clause in Q\ and some let variable $2; is declared as "$2: := XPE" where XPE contains the filter. Rewrite the let clause as "$x := (for $t in XPE return $t)". This turns Qx into case (1). (3) The filter construct appears in the where clause in Q\ and let XPE be the path expression that contains the filter. Rewrite Q\ by adding a let clause "let %t := XPE" right before the where clause, where $t is a new variable to Q\, and replacing XPE with %t. This turns Q\ into case (2). (4) The filter appears in the return clause and let XPE be the path expression that contains the filter. Rewrite Q\ by adding a let clause "let %t := XPE" right before the where clause, where %t is a new variable to Q\, and replacing XPE with %t. This turns Q\ into case (2). From (1), (2), (3) and (4), Q that has k filter constructs can be transformed into an equivalent Q' without filters. Thus, any F L W R Q can be transformed into an equivalent F L W R Q' free of filter constructs. • Lemma 3.2. : A FLWR Q can be transformed into an equivalent Q' such that the nested FLWR subqueries in Q' occur only in the let clauses. Proof: Because of lemma 3.1, we can assume that Q is filter-free. (1) If a subquery 5 occurs in the for clause, i.e., a for variable $x is declared as "for $x in 5", rewrite Q by adding "let $t := S" immediately preceding the for 19 clause, and by replacing "for $x in S" with "for $2; in $£" where $t is a new variable. (2) If a subquery S occurs in the where clause, i.e., it is in an atomic condition "empty(S)", rewrite Q by inserting a let clause "let $t :— 5" immediately preceding the where clause and replacing uempty(S)n with "'empty'($£)". (3) If a subquery S occurs in the return clause, i.e., it is in the return argument "{S}", rewrite Q by inserting a let clause "let $t := 5"' right before the where clause and replacing "{S}" with Repeating (1) or (2) or (3) can rewrite Q into Q' such that all the subqueries in Q' only occur in the let clause. • L e m m a 3.3. A FLWR Q with the universal quantifier every can be transformed into a FLWR Q' free of univeral quantifier. Proof: Because an every quantifier can only appear in the filters or in the W H E R E clauses, as a consequence of lemma 3.1, we only need to consider the case in which the every quantifier appears in a where clause. Let Q contain an every clause "every $v\ in rangei, $vn in rangen satisfies <£>". Though rangei and <p can be a F L W R by themselves, we can assume them to be XPath expression because of lemma 3.2. Rewrite Q by adding a let clause "let $t := ( for $vi in rangei, in rangen where tjj return $i>i)" right before the where clause, where tp is the Boolean compliment of <p, and replace the every clause with ucount($t) — 0". Q' is equivalent to Q and with one fewer every clause. Repeating the rewriting can make Q free of universal quantifiers. • Similarly, we have a lemma for the some quantifier as follows. L e m m a 3.4. A FLWR Q with the existential quantifier some can be transformed into a FLWR Q' free of the some quantifier. 20 The example in Figure 3.2 illustrates the rewriting using Lemma 3.1, 3.2 and 3.3. f o r $b i n document("bib")//book[./author[./hobby="tennis"]/addr/state!="MI"], $rt i n ( fo r $r i n document("review")//review[every $a i n ./author s a t i s f i e s $a=$b/author] return {$r/rating} ) where $b/Syear > 1995 return < r e s u l t > { $ b / t i t l e } {$rt} < / r e s u l t > Qb: f o r $b i n document("bib")//book, $rt i n ( fo r $r i n document("review")//review where every $a i n $r/author s a t i s f i e s $a = $b/author return {$r/rating} ) l e t $ t l := ( fo r $t3 i n $b/author, $t2 i n $t3/address/state where $t3/hobby = "tennis" return $t2 ) where $b/Syear > 1995 and $ t l != "HI" return < r e s u l t > {$b/title} {$rt} < / r e s u l t > Qc: l e t $t4 := ( fo r $r i n document("review")//review where every $a i n $r/author s a t i s f i e s $a = $b/author return {$r/rating} ) fo r $b i n document("bib")//book, $rt i n $t4 l e t $ t l := ( fo r $t3 i n $b/author, $t2 i n $t3/address/state where $t3/hobby = "tennis" return $t2 ) where $b/Syear > 1995 and $ t l != "MI" return < r e s u l t > {$b/title} {$rt} < / r e s u l t > Qd: f o r $r i n document("review")//review l e t $t5 := (for $a i n $r/author where $a != $b/author return $a) where count($t5) = 0 return $ r / r a t i n g Figure 3.2: Example of normalization. Qa is an example query, Qb is the result after applying Lemma 3.1 to Qa to remove filters, Qc is the result after applying Lemma 3.2 to Qb, and Qd is the result of applying Lemma 3.3 to Qc. L e m m a 3.5. A FLWR Q can be transformed into an equivalent Q' such that in each FLWR subexpression E in Q', its for clause always precedes its let clause if E contains both. 21 Proof: If a F L W R subexpression E has a let clause preceding a for clause, let F' be the last for clause in E and there must be a let clause V immediately preceding F' (note that such a F L W R subexpression can appear at any level of nesting), i.e., E is as follows. {- L ' -} let $lv\ := expr\, §lvm '•= exprm { - F ' - } for $fvi in rangei, $fvn in rangen let ... where ... return ... Rewrite E by inserting a return keyword between V and F' and turning the part of E starting from F' into a parenthesis expression, i.e., E is transformed as follows: {" L ' -} let %lv\ := expri, %lvm := exprm return ( { -F ' -} for $fv\ in rangei, •••! $ / ^ n in rangen let ... where ... return ... ) Repeat the step above, and Q can be transformed into Q' such that the for clauses 22 in Q' precede the let clauses in the same F L W R subexpression. • L e m m a 3.6. A FLWR Q can be transformed into an equivalent Q' such that the aggregations in Q' are only bound to the let variables. Proof: Similar to the proof of lemma 3.2. • From lemma 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6, we have the following: Theorem 3.1. A function-free FLWR Q can be transformed into an normalized XQuery statement Q'. 3.3 Derived T A X Operators When translating XQuery into T A X , it turns out that some T A X operators with similar pattern tree parameters appear as a group in the translation. One straight-forward way to optimize the T A X expression resulting from such translation is to define the groups of close related operators as derived operators. As such, we define the followings: S E L E C T - P R O J E C T - D U P E L I M . The select-project-dupelim (SPDpSPD}prjL,) operator takes two parameters: a pattern tree PSPD and a list of pattern nodes PDL used for projection and duplicate elimination. It is defined as the following: SPDPSPD,PDL(C) = 8pDTDL(npP,PDL(o-pSPDXC))) where Pp is the same as PSPD except with ad edges replaced with pc edges. Pp, is the projection of Pp onto the nodes in PDL, and DL mentions the pedigree of each node in PDL. The join-project-dupelim (JPDpJPDtpr)L) operator is a variant of SPD. It is the same as SPD except that it performs a join instead of a selection. 23 L O J - P R O J E C T - D E - G R O U P B Y - P R O J E C T . The loj-project-de-groupby-project (LG) operator takes three parameters: a join pattern tree PLG, a group-by list GL, and a singleton return list RL. LG is defined as the following: LGpLGiGL,RL(C,T>) = TTpPDtPL(,yPG,GL,RL{SpD,PDL('n'Pp,PDL(CZ M p ^ £ > ) ) ) ) Pp is the same as PIG except with ad edges replaced with pc edges and PDL consists of the root node of Pp (i.e., tax-productjroot), all the nodes in GL and RL. Pp, is the projection of pattern Pp onto the nodes in PDL, and DL mentions the pedigree of each node in PDL. PQ is the same as PD, GL mentions the pedigrees of all the nodes in GL, and RL mentions the pedigree of the single node in RL. PPD is obtained from the resultant pattern of the 7 operator by keeping the tax-group jroot, tax .group-basis and the nodes in PG, tax-groupsubroot, and the single node in RL. The LG with Aggregation (LGA) operator is a variant of the LG operator. It takes one extra parameter, a list of aggregation functions FL =< f\,..-,fn >, where £ {avg, count, min, max; sum}. The LGA operator performs one extra aggregation operation on the single node in RL after LG, inserting the nodes cor-responding to the aggregation values after its last child. 3.4 Translation 3.4.1 Single B l o c k X Q u e r y T r a n s l a t i o n Definit ion 3.2 (Single-block Query) . : A query in canonical form is a single-block query provided all expri are (extended) XPath expressions. Recall that in a query of canonical form, only let—expr can possibly be F L W R expressions (that are not XPath expressions). So, effectively the above definition 24 ensures there is no nesting of F L W R expressions in a single-block query. L e m m a 3.7. A single-block conjunctive FLWR Q in canonical form can be trans-lated into TAX. Proof: Let Q be (for $/t>i in range\, $fvm in rangem)l (let %lv\ := expri, $lvn := exprn)l (where <p)? return < result > < tagi > {argi} < tagx > < tagk > {argk} < tagk > < result > where every rangei and every expri are XPath expressions, <p is a conjunction of atomic conditions, and every argi is an XPath expression or an aggregation. Q can be translated into T A X expressions as follows. Step 1 - translating the for-where clause. (La) Factor out all atomic conditions in the where clause that use LET-variables. (Lb) Identify all tree patterns: two for variables $x and $y are related, denoted as $x = $y, if one of them occurs in the range of the other or if they are both related to a third variable. Clearly, = is an equivalent relation. Partition for variables into equivalence classes based on =. For each class, construct a pattern tree as follows, (l.c) Create one node corresponding to each variable, and one edge from $x to $y whenever $y occurs in the range of $x. (l.d) Expand each edge into an appropriate sequence of ad and pc edges, creating intermediate nodes as required, based on the (partial) path expression correspond-25 ing to the edge. (l.e) Instantiate node predicates corresponding to appropriate nodes from the path expressions. (l.f) If a variable $x occurs in the where clause, expand the tree pattern from the node representing $x to include any path expressions extending $a; in the where clause using step (l.d) and (l.e). Note that such an expanding will create new branches in the pattern tree. (l.g) Generate T A X expression EQ: if there is one pattern generated from the above steps, emit a S P D operator; otherwise, emit a JPD operator. The S P D operator or the JPD operator takes the tree pattern(s) generated from the above steps as the pattern argument, and the list of nodes that represent the for variables as the P D L argument, and is applied to the X M L document(s). (l.h) Record the resulting pattern of the S P D or JPD operator as the for — where handle pattern PFW-Step 2 - translating each let variable %x. (2.a) Construct; a pattern tree PR from the path expression for $2; as in step 1. (2.b) Create a tree pattern P that takes a node with tag taxproductroot as its root, and PL as its left subtree and PR in (2.a) as its right subtree, where PL is defined as follows. If another let variable $y occurs in the expression, use the source pattern of $y as P L ; otherwise, (i.e. if a for variable or a document() built-in function occurs in the path expression for $x) use PFW in step (l.h) as P L . (2.c) Emit LG (or L G A if %x is bound to aggregate function in the where clause or return clause) operator, which takes P in (2.b) as the pattern argument, a list composed of all the nodes in PFW if depends on a for variable or an empty list otherwise as the grouping-by-list argument, and the node representing $2; as the 26 grouping-list argument. The LGA operator takes a list of aggregate functions as its agg — func — list argument. (2.d) Record the resulting pattern of the LG (or LGA) as the source pattern of %x. Note that the leftmost subtree of that pattern is a copy of PFW because there is a projection after the group-by in the LG (or LGA) operator. Step 3 - Completing the translation of the where clause if some atomic conditions are factored out in step 1 because they depend on L E T variables. (3.a) Create a tree pattern P that takes a node with tag tax jproduct_root as its root, and PFW as its leftmost subtree. For each atomic condition being translated, create a pattern for its path expressions as follows and make that pattern as a subtree of P. (3.b) If the path expression starts with a for variable $x, create one node nx corresponding to $x, and according to the path expression, construct an appro-priate sequence of edges and create intermediate nodes required and instantiate node predicates. Find out the node mx in PFW that corresponds to $a;, add mx.pedigree = nx.pedigree to P's Boolean formula. (3.c) If the path expression is an aggregation aggfunc($y), where $y is a let variable, look up the T A X expression EY in step 2 that computes aggfunc($y) and retrieve the resultant pattern PY of E Y . Create a copy of PY and make it the subtree of P . Note that PY contains an exact copy of PFW, i-e., there is a 1-to-l mapping from the nodes of PFW to those of P Y . We call two nodes $m and $n a matching pair if $m is in PFW and $n is $m's mapping image in PY (i.e., they refer to the same for variable). For each matching pair $m and $n, add $m.pedigree = $n.pedigree to P's Boolean formula. (3.d) If the path expression starts with a let variable $y, look up the T A X expression 27 Ey in step 2 that computes $y and retrieve the resultant pattern PY of E Y . Create a copy of PY and expand it from the node represented %y according to the path expression and make it the subtree of P. For each matching pair of nodes $m and %n add $m.pedigree = %n.pedigree to P's Boolean formula. (3.e) Emit a JPD operator as EFW- It takes P as its pattern argument and all the nodes in PFW as its PDL argument, and applies to EQ as its first operand, and an X M L document if the subtree constructed in step (3.b), or EY if the subtree constructed in step (3.c) or (3.d). Note that step 3 can be a no-op if there is no atomic condition depending on any let variable. Step 4 - translating each return argument (4.a) If the R E T U R N argument starts with a FOR variable %x, create a tree pattern P that takes a node with tag tax.product-root as its root, and PFW as its left sub-tree. Create P's right subtree as follows. Create one node nx corresponding to $x, and according to the path expression, construct an appropriate sequence of edges and create intermediate nodes required and instantiate node predicates. Find out the node mx in PFW that corresponds to $x, add mx.pedigree — nx.pedigree to P's Boolean formula. (4.b) If the return argument is an aggregation aggfunc($y), where $y is a let variable, look up the T A X expression EY in step 2 that computes aggfunc($y) and retrieve the resultant pattern PY of E Y . Create a tree pattern P that takes a node with tag tax-product-root as its root, and PFW as its left subtree and PY as its right subtree. For each matching pair (as defined in 3.c) $m and $n, add %m.pedigree — $n.pedigree to P's Boolean formula. (4.c) If the return argument starts with a let variable $y, look up the T A X expres-28 sion Ey in step 2 that computes %y and retrieve the resultant pattern Py of Ey. Create a copy of Py as the skeleton of a pattern tree P\ and expand P' from the node represented $y according to the path expression. Create a tree pattern P that takes a node with tag tax .product .root as its root, and Ppw as its left subtree and Py as its right subtree. For each matching pair (as defined in 3.c) $m and $n, add $m.pedigree = $n.pedigree to P's Boolean formula. (4.d) Emit a LG operator, which takes P as the pattern argument, a list composed of all the nodes in PFW as the grouping-by-list argument, and the node representing the return argument as the grouping-list argument. It applies to EQ or Epyv as its first operand, and an X M L document (case 4.a) or Ey (case 4.b or 4.c) as its second operand. Note that in all three cases, all the resulting patterns have a copy of PFW as its left subtree because there is a projection after the group-by in the LG (or LGA) operator. Step 5 - stitch all the return arguments together if there are multiple return argu-ments. (5.a) Let the T A X expressions generated in step 4 be Ei, Ey., and the re-sulting patterns be P i , Pk. Create a pattern P that takes a node with tag tax .product soot as its root, and P i , Pk as its subtrees. For each pair of nodes $m and $n, where $m is in the left subtree of P i and $n is in the left subtree of Pi, i — 2,...,k, and $m and $n denote the same for variable, add %m.pedigree = %n.pedigree to P's Boolean formula. Emit a PJ operator as expression Efinai, which takes P as its pattern argument, and a list composed of all the pattern nodes representing the return arguments as its PL argument. It applies to Ei, Ek as its operands. 29 Efinal is a T A X expression equivalent to Q. • Example 3.1 (Translating Single Block Query into T A X ) . Figure 3.3 is an example single block query and its corresponding T A X translation. Step (1) translates the f or-where clause, (2) and (3) translate the return arguments %b/title and %r/rating respectively, and (4) combines (2) and (3) to produce the output. The last step of renaming the tag to resu l t is omitted in the figure. • L e m m a 3.8. A single-block FLWR Q of canonical form can be translated into TAX. Proof: Let Q be (for %fv\ in rangei, %fvm in rangem)l (let %lv\ := expr\, §lvn := exprn)l (where ip)? return < result > < tagi > {argi} < tagi > < tagk > {argk} < tagk > < result > Rewrite <p into its equivalent D N F <pi or • • • or tp^, where each ipi is a conjunction. Let Qi be (for $fvi in rangei, $ / « m in rangem)l (let $lvi :— expri, $lvn :— exprn)l (where return < result > < tagi > {argi} < tagi > 30 < tagk > {argk} < tagk > < result > Translate the for, the let and the where clause in each Qi using step 1, 2 and 3 in lemma 3.7. Let E0 be the expression emitted in step 3 to produce the for-where handle of Qi. Note that the resulting pattern of each E0 is the same. Record it as PFW-Emit a Union operator as expression £ ™ M " , which takes all the EQ as its operands. Emit a DE operator as expression EQ, which takes PFW as its pattern argument, a DL list argument that mentions the pedigrees of all the nodes in PFW, and takes gumon ^ ^ s 0 p e r a n c i . Record EQ as the expression for the for-where handle of Q. For each let variable %IVJ in Q, let Ej be the expression emitted for $IVJ when translating Qi. Note that the resulting pattern of each Ej is the same Pj which contains the for-where pattern PFW as its left subtree. Emit a Union operator as expression E™mon, which takes all the Ej as its operands. Emit a DE operator as expression Ej, which takes Pj as its pattern argument, a DL list argument that mentions the pedigrees of all the nodes in the for-where pattern part in Pj, and takes E'jmon as its operand. Record Ej as the expression for %lvj and Pj as the resultant pattern for %IVJ. Proceed to translate the return clause of Q using step 4 and 5 in lemma 3.7. The T A X expression Efina[ emitted in step 5 is equivalent to Q. • Example 3.2 (Translating Disjunctive Single Block Query into T A X ) . Fig. 3.4 (a) is an example disjunctive single block query and (b) is the corresponding translation of its for-where clause. The translation of the return clause is the same as Fig. 3.3 (2), (3) and (4). • 31 L e m m a 3.9. : A FLWR Q in canonical form can be translated into TAX. Proof: Prove by induction on the number of nested blocks in Q. Base case: the number of nested block is one, i.e., there is a let variable $lvi in Q defined as $lvi :— Q\, where Qi is a single block FLWR. We write Q\ as Qi{$fv\,--- ,$fvn,$lvi, ,$lvi-i) to make explicit its correlation with all the for variables in Q and possibly with some let variables defined in the l e t clause pre-ceding it. We can assume that before translating Q\, step (1) and (2) in lemma 3.7 has already translated Epw for the candidate bindings of $/i>i, • • • , %fvn, and E\, • • • , Ei-\ for the candidate bindings of %lv\, • • • , $/i>i_i, respectively. Also as-sume the resultant pattern trees of Epw, E\, • • • , are PFW, PI, • • • , Pi-i, re-spectively. Use the same translation procedure as that of lemma 3.8 to translate Q\, except for a few changes as follows. Step (l.a)-Step (l.f). Create a pattern P' taking a node with tag tax .product jroot as its root, PFW as its left subtree. If an XPath expression in the for and the where clause of Q\ depends on some %lvj where j is one of 1, — 1, add Pj as a subtree of P', and for each matching pair of nodes $m in PFW and $n in Pj, add the conjunct $m.pedigree = %n.pedigree to the Boolean formula of P'. Step (l.g). Emit a JPD operator. It takes P' as its pattern argument, and its PDL argument includes all the nodes in PFW- Its first operand is the T A X expression Epw, other operands are from the proper X M L document, or some Ej if $lvj is used in the preceding step. Step (5). The PL argument of the PJ operator includes the root and all nodes in PFW- Record the resulting pattern as Pj and the P J operator as E{. After step (5) of translating Q\, emit three more operators as follows to combine 32 the translation of Q i with the rest of the translation of Q. (6.a) Emit a GROUPBY operator. It takes PpW as its pattern argument, and all nodes in PFW a s its groupby basis, and the pedigree order of the groupby basis nodes as its ordering function argument. It takes Ei as its operand. Let the resul-tant pattern be Pg. (6.b) Create a pattern P that takes a node with tag tax .product soot as its root, PFW as its left subtree and Pg in step (6.a) as its right subtree. For each matching pair of nodes %m in PFW and $n in Pg, i.e., they correspond to the same for variable, add the conjunct $m.pedigree — %n.pedigree to P's Boolean formula. Emit a LOJ operator. It takes P as its pattern argument, and Epw as its first operand, and Ei as its second operand. (6.c) Emit a PROJECT operator as T A X expression Ej. It takes P in step (6.b) as its pattern argument, and a list composed of all the nodes in the right sub-tree of the root, i.e., the sub-tree rooted at tax.group soot. Record the resulting pattern Pi as the source pattern of $lv{. Note that Pi has the same structural scheme as P i , P i _ i : The left subtree is the PFW of the outer block. Thus the translation procedure in lemma 3.8 can proceed. Continue applying the translation procedure of lemma 3.8 to the rest of Q. This completes the translation of a query with one nested block. Induction case: Assume that if Q has k nested blocks, Q can be translated into T A X . If Q has k + 1 blocks, there must be a let variable $lvi in Q defined as §lvi .— Qinner ($fvi,...,$fvn,$lvi,...,$lvi-i) where Qinner is a single block FLWR, $fvi,...,$fvn and $lv\,...,$lvi-i are variables defined in the outer blocks that en-close Qinner-Because there are at most k nested blocks before translating Qinner, the translation scheme can emit a T A X expressions EFW to compute the candi-33 date bindings of $fv\, $fvn and the T A X expressions E\, Ei-\ to com-pute the candidate bindings of %lv\,...,$lvi-\. Assume the resultant pattern trees of EFw, Ei,..., Ei^i are PFW, P I , - - , Pi-i, respectively. Use the same translation procedure that translates Qi in the base case to translate Qinner, and after step 6.c in the base case, Qinner is translated into a T A X expres-sion with a resultant pattern having PFW as its left subtree. Thus Q is unfolded into a F L W R of k nested blocks. By induction we know that Q can be translated into T A X expressions. • Example 3.3 (Translating Nested Query into T A X ) . Fig. 3.5 (a) is an exam-ple nested query and (b) is the corresponding translation of its re turn argument {$a}. The projection list in (8) is composed of all the nodes in the subtree rooted at $tgr, i.e., {$tgr, $tgb, $tgs, %tpr', %p', $t', $n}. The projection list in (11) consists of {$tgr, $tgb, $tgs, $p', $n}. Note that after (11), $a has the same structure as the other return argument %p/name, which means we can use the Project Jo in as in Fig. 3.3 (4) to stitch them together to generate the answer of the query. • 34 for $b in document("bib")//book, $r i n document("review")//review where $b/author=$r/author and $b/year>1995 return <result> {$b/title} {$r/rating} </result> (a) (4)PJ((2),(3)) Stprl Sb $tgsl < $tgb2 > $t / I $tpr2 $r1 Sb: Stgs2 Srt $tpr.tag = $tpr1 .tag = $tpr2.tag = tax_product_root & $tgr1 .tag = tgr2.tag = tax_group_root & $b1.tag = Sb2.tag book & $r1.tag = Sr2.tag = review & $tgs1 .tag = $tgs2.tag = tax_group_subroot & Stgbl.tag = $tgb2.tag = tax_group_basis & St.tag = title & Srt.tag = rating Sr2 (2)LG((1), "bib") $tpr Stpr.tag = tax_product_root & $tpr1.tag = tax_product_root & Sb.tag = book & Sr.tag = review & Sb.pedigree = Sb'.pedigree & St.tag = title (3) LG((1), "review") $tpr Stpr.tag = tax_product_root & Stprl .tag = tax_product_root & Sb.tag = book & Sr.tag = review & Sr.pedigree = Sr'.pedigree & Srt.tag = rating (1) JPD ("bib", "review") Stpr Stpr.tag = tax_product_root & Sb.tag = book & Sy.tag = year & Sal.tag = Sa2.tag = author & Sr.tag = review & Sy.content > 1995 (b) Figure 3.3: A n example query and the translated T A X expression. 35 for $b in document("bib")//book, $r in document("review")//review where $b/author=$r/author and ($b/year>1995 or $r/year>1995) return <result> {$b/title} {$r/rating} </result> (a) (3)DE((2)) (2)Unlon((0),(1)) (O)JPD ("bib", "review" $tpr Stpr.tag = tax_product_root & Sb.tag = book & Sy.tag = year & Sattag = $a2.tag = author & Sr.tag = review & $y.content > 1995 (1)JPD ("bib", "review") Stpr.tag = tax_product_root & Sb.tag = book & Sy.tag = year & Saltag = $a2.tag = author & Sr.tag = review & $y.content> 1995 (b) Figure 3.4: An example disjunctive query and the translated T A X expression. 36 for $p in document("auction")//person l e t $a := ( for $t in document( Mauction M)//closed_auction let $ t l = ( for $t2 in document("auction")//europe/item where $t/itemref/<3item=$t2/@id return {$t2/name} ) where $p/@id=$t/buyer/@person return <item> {$tl} </item> ) where $p//age>25 return <person> {$p/name} {$a} </person> (5) P r o j e c t ( ( 4 » $tgr Stgs $tgr.tag = tax_group_root & $tpr1.tag = tax_product_root & $t2.tag = item & Sn.tag = name & St.tag = closed_auction & Stpr2.tag = tax_product_root & Sp.tag = person & Stgb.tag = tax_group_basis & Stgs.tag = tax_group_subroot (4) L G ((3), " a u c t i o n " ) Stpr $tpr.tag = tax_product_root & $tpr1.tag = tax_product_root & St2.tag = item & $n.tag = name & St2.pedigree = St2'.pedigree & St.tag = closed_auction & Stpr2.tag = tax_product_root & $p.tag = person (11) Project((10)) T h e pattern is the s a m e a s that in (10) (10) LOJ((1), (9)) $tpr.tag - tax_product__root & Stgr.tag = tax_group_root & Stgs.tag = tax__group_subroot & Stgb.tag = tax_group_basis & Sn.tag = name & Sp-pedigree = $p'.pedigree & Sp.tag = person (9) G r o u p B y ((8)) Sp • f Sp.tag = person (3) J P D ( (2), "auction") Stpr Stpr.tag = tax_product_root & Stpr 1.tag = tax„product_root & Sp.tag - person & Se.tag = europe & St.tag = ciosed_auctlon & Sr.tag = Itemref & $t2.tag = Item & Sr.ltem = $t2.ld (8) Project((7)) T h e pattern is the s a m e a s that in (7) (7) LOJ((2), (6)) StprO (2) J P D ((1), "auc t ion" ) Stpr $p Stpr.tag = tax_product_root & Sp.tag = person & Sp-id = Sb.person & St.tag = closed^auctlon & Sb.tag = buyer StprO.tag = tax_product_root & Stpr.tag = tax_product_root & Stpr'.tag - tax_product_root & Stgr.tag = tax_group_root & Stgs.tag = tax„group_subroot & Stgb.tag = tax_group_basis & Sp.tag = person & St.tag = closed_auctlon & Sn.tag = name & $p.pedigree = Sp'pedigree & St. pedigree = $t'.pedigree (1) S P D ("auction") Sp (6) G r o u p B y (5) $tpr Sa Sp.tag = person & Sa.tag = age & Sa.content > 25 Sp Stpr.tag = tax_product_root & Sp.tag = person & St.tag = closed_auction (b) Figure 3.5: An example nested query and the translated T A X expression. 37 Chapter 4 General ized Tree Patterns In this chapter, we introduce generalized tree patterns (GTP), define their semantics in terms of pattern match, and show how to represent XQuery expressions as GTPs. For expository reasons, we first define the most basic type of G T P and then extend its features as we consider more complex fragments of XQuery. 4.1 Basic G T P s Definit ion 4.1 (Basic G T P s ) . A basic generalized pattern tree is a pair G = (T, F) where T is a tree and F is a boolean formula such that: • each node of T is labelled by a distinct variable and has an associated group number • each edge of T has a pair of associated labels (x,m), where x £ {pc,ad} specifies the axis (parent-child and ancestor-descendant, respectively) and m £ {mandatory, optional} specifies the edge status • F is a boolean combination of predicates applicable to nodes. 38 A _ L 1 0 _L _L 1 0 1 1 1 0 0 0 0 0 V _L 1 0 _ L _L 1 0 1 1 1 1 0 0 1 0 —1 _L 1 0 ± 0 1 Figure 4.1: Extension to handle 'undefined' truth value. Fig. 1.2(b) is an example of a (basic) GTP. Rather than edge labels, we use solid (dotted) edges for mandatory (optional) relationship and single (double) edges for pc (ad) relationship. We call each maximal set of nodes in a G T P connected to each other by paths not involving dotted edges a group. Groups are disjoint so that each node in a G T P is a member of exactly one group. We arbitrarily number groups, but use the convention that the group containing the for clause variables (including the G T P root) is group 0. In Fig. 1.2(b) group numbers are shown in parentheses next to each node. Let G — (T, F) be a G T P and C a collection of trees. A pattern match of G into C is a partial mapping h : G —> C such that: • h is defined on all group 0 nodes. • if h is defined on a node in a group, then it is necessarily defined on all nodes in that group. • h preserves the structural relationships in G, i.e., whenever h is defined on nodes u, v and there is a pc (ad) edge (u, v) in G, then h(v) is a child (descen-dant) of h(u). • h satisfies the boolean formula F. Observe that h is partial matching since elements connected by optional edges may not be mapped. Yet, we may want the mapping as a whole to be valid in the sense of satisfying the formula F. To this end, we extend boolean connectives 39 to handle the 'undefined1 truth value, denoted as _L. Fig. 4.1 shows the required extension. In a nutshell, the extension treats _L as an identity for both A and V and as its own complement for In determining whether a pattern match satisfies the formula F, we set each condition depending on a node not mapped by h to _L and use the extensions to connectives in Fig. 4.1 to evaluate F. We say h satisfies F iff it evaluates to true. The optional status of edges is accounted for by not allowing groups (other than 0) to be mapped at all, while still satisfying F. As an example, consider a pattern match h that maps only nodes $p, $s, $1, $g, $i in Fig. 1.2(b) and satisfies only conditions depending on these nodes. Setting all other conditions to _L, it is easy to check h does indeed satisfy the formula in Fig. 1.2(b). We call a pattern match of a G T P valid if it satisfies the boolean formula associated with the GTP. s i t e m a d d r e s s t a in te res t h i : $ p - > 2 , $ s ->4 , $ l ->13 , $ w - > 9 , $ t ->10, $g ->16 , $ i - > 1 4 h 2 : $p ->19 , $ s - > 2 1 , $ l ->30 , $ w - > 2 4 , $ t ->25, $g->31 h4 : $p ->35 , $ s - > 3 7 , $ l ->42 , $ g - > 4 3 , $ i ->45 h 3 : $p ->19 , $s—>21, $ l ->30 , $ w - > 2 4 , $ t ->26, $g->31 Figure 4.2: (a) Sample X M L data, (b) Pattern matches of G T P of Fig. 1.2(b). The numbers in (b) are the startPoss of the nodes in (a) numbered in the interval encoding scheme [2]. 40 Fig. 4.2 shows a sample X M L document (in tree form) and the set of valid pattern matches of the G T P of Fig. 1.2(b) against it. Each node is numbered in the interval encoding scheme [2]. The number can be assigned as follows. Let the counter start at 0, traverse the tree in pre-order, assign the counter as the startPos of the node and increase the counter by 1 before visiting its children. Upon returning from traversing all its descendants, assign the counter as a node's endPos and increase the counter by 1. The numbers in Fig. 4.2 are the startPoss of the nodes. Note that /12, h% are not defined on group 2, while h\ is not defined on group 1. Also note that matches h% and /13 belong to the same logical group since they are identical except on pattern node $t. 4.2 Jo in Queries A join query clearly warrants one G T P per document mentioned in the query. How-ever, we need to evaluate these GTPs in sync, in the sense that there are parts in different GTPs that must both be mapped or not at all. For instance, consider the query in Fig. 4.3. For every pair of person and open_auction elements satisfying the conditions in the where clause, we need to find corresponding in teres t subele-ments and the bidder subelements of open_auction element whenever the latter is referenced by attribute open_auction of the subelement watch of the person. Note that for every valid (person, open_auction) pair the existence of elements corre-sponding to the first and second arguments is independent, as usual. This logic is correctly captured in the pair of GTPs shown in Fig. 4.3(b), one for each operand of the join. In particular, note that nodes from different GTPs may belong to the same group, signifying that either they are all to be mapped by a match or not at all. 41 f o r $p i n d o c u m e n t ( " a u c t i o n . x m l " ) / / p e r s o n , $o i n d o c u m e n t ( " a u c t i o n . x m l " ) / / o p e n _ a u c t i o n w h e r e $ p / / a g e > 2 5 a n d $ o / i n i t i a l > 1 0 0 0 r e t u r n < r e s u l t > { $ p / / i n t e r e s t } {$o [ @ i d = $ p / / w a t c h / @ o p e n . a u c t i o n ] / b i d d e r } < / r e s u l t > $p.tag - pcrso n & 55*3 .tag — age & Sg.content : - 2 5 & S>" .tag = interest & $w.tag = watch & 3»a la cj — open auction & Join Condition $a-content — $d.content (b) Figure 4.3: A query involving join and corresponding GTP. In many cases, join queries are also nested queries, which we will discuss it at length in section 4.4. $o.tag = ooen auction & $l.tag — Initial & $l_content - 1 OOO & Sci .tag = id & Sb.taq = bidder 4.3 Grouping, Aggregation, and Quantifiers In this section, we discuss the necessary extensions to a basic G T P for handling quantifiers correctly. We assume other than quantification, the query does not in-volve nesting. First, note that a query involving the some quantifier can be rewrit-ten into an equivalent one without it. Specifically, the expression "where some $v 42 in XPathExpression satisfies expression" is equivalent, according to XQuery se-mantics, to "where new Expression", where newExpression is expression with all occurrences of $v replaced by XPathExpression. Since there is no nesting FLWR expression in it, expression must be of the form of boolean combination of XPath expressions (extended with use of variables) are compared with others or with con-stants, making this translation well-defined. Conventional value aggregation in itself does not raise any special issues for G T P construction. Structural aggregation, whereby collections are grouped together to form new groups, is naturally handled via nested queries, discussed in Section 4.4. So we next focus just on quantifiers. Basic GTPs can already handle the some quantifier, since an XQuery expres-sion with some can be rewritten as an one without it. Handling the every quantifier requires an extension to GTPs. Defini t ion 4.2 (Universal G T P s ) . A universal G T P is a G T P G = (T,F) such that some solid edges may be labelled ' E V E R Y ' . We require that: • the G T P includes a pair of formulas associated with an E V E R Y edge, say FL and FR, that are boolean combinations of predicates applicable to nodes, including structural ones • nodes that are beneath the E V E R Y edge and mentioned in FL should be in a separate group by themselves • nodes that are beneath the E V E R Y edge and mentioned only in FR (i.e., not in FL) should be in a separate group by themselves Example 4.1 (Universal G T P ) . Figure 4.4 shows a query with universal quan-tifier and a corresponding universal GTP. The G T P codifies the condition that for 43 f o r $o i n d o c u m e n t ( " a u c t i o n . x m l " ) / / o p e n _ a u c t i o n w h e r e e v e r y $ b i n $ o / b i d d e r s a t i s f i e s $ b / i n c r e a s e > 100 r e t u r n < r e s u l t > {$0} < / r e s u l t > (a) (1) $b« (2) l$i (0) $ 0 EVERY: F_L = pc($o,$b) & $b.tag = bidder F_R: pc($b,$i) & $i.tag = increase & $i.content > 100. $o.tag = open_auction (b) Figure 4.4: A n example universal query and corresponding universal GTP. every bidder $b that is a subelement of the open^auction element $o, there is an increase subelement of the bidder with value > 100. • The formulas associated with the E V E R Y edge represents the constraint (V$xl)...(V$xm) : [FL -» (3$yi)... (3$yn) : (FR)], where $xu...,$xm are the nodes that are mentioned in FT, and beneath the E V E R Y edge, and $y\,...,$yn are the nodes mentioned only in FR and beneath the E V E R Y edge. For the above example, the constraint is V$6 : [$b.tag. = bidder Sz pc($o, $b) — > 3$i : (Si.tag = interest & pc($b, $i) & $i.content > 100)]. In this example, it turns out that the only nodes mentioned in FR are those in a separate group (2), or in F^s group (1). No other nodes appear in the formula FR. This kind of E V E R Y edge can be efficiently evaluated by an anti-semi-structural-join. 44 4.4 Nested Queries We use a simple device of a hierarchical group numbering scheme to capture the dependence between a block and the corresponding outer block in the query. The idea is to add a new level of hierarchy to the group number when entering a new block in the construction of GTP, as illustrated in the following example. for $p in document("auction.xml")//person l e t $a := for $t in document("auction.xml")//closed_auction where $p/@id=$t/buyer/@person return <item> {for $t2 in document("auction.xml")//europe/item where $t/itemref/@item=$t2/@id return {$t2/name}} </item> where $p//age>25 return <person name=$p/name/text()> {$a} </person> (a) (1.0) $1 $e (1.1.0) Si ; $n1 $b (2) (1.0) (1-1-0)4 ^1.1.1) $p.tag=person & $g.tag=age & $n1.tag=Sn2.tag=name & $b.tag=buyer & $t.tag=closed_auction & $i.tag=itemref & $t2.tag=item & $g.conetent>25 Join Condition $p.id=$b.person & $i.item=$t2.id (b) Figure 4.5: An Example query with nesting & join and corresponding GTP. Example 4.2 (Nested Query) . Consider the nested query in Fig. 4.5(a). Corre-sponding to the outer for/where clause, we create a tree with root $p (person) and one solid pc-child $g (age). They are both in group 0. We process the inner F L W R 45 statement binding $a. Accordingly, we generate a tree with root $t (closed -auction) with a solid pc-child $b (buyer). Put these nodes in group 1.0, indicating they are in the next group after group 0, but correspond to the for/where part of the nested query. Finally, we process the return statement and the nested query there. For the for/where part, we create a tree with root $e (europe) with a solid pc-child $t2 (item), both being in group 1.1.0. We also create a dotted pc-child $i (itemref) for $t, corresponding to the join condition $t/itemref/@item=$t2/@id in the corre-sponding where clause. Since it's part of the for clause above, we assign this node the same group number 1.1.0. The only return argument of this inner-most query is $t2/name, suggesting a dotted pc-child $n2 (name) for node $t2, which we add and put in group 1.1.1. We also create a dotted pc-child $i (itemref) for $t, correspond-ing to the join condition $t/itemref/@item=$t2/@id in the inner where. Finally, exiting to the outer return statement, we see the expressions $p/name/text() and $a. The first of these suggests a dotted pc-child $n (name) for $p, which we add and put in group 2. The second of these, $a, corresponds to the sequence of european item names bound to it by the l e t statement, and as such is covered by the node $n2. The G T P we just constructed is shown in Fig. 4.5(b). • How do we define pattern matches of GTPs with nestings so they correspond to XQuery semantics? A simple way to understand this is to start with the set of valid pattern matches, for group 0, corresponding to the outer-most for/where. At this point, if any of these can be extended into a successful match for node $n (group 2), we extend them. Let Mo be the resulting set of (partial) matches. Next, turn to group 1. Group l's for/where part is captured by the subgroup 1.0, so try to extend each match in M so it matches nodes in group 1.0. Let M i be the resulting set of partial matches. This includes matches in M that could not be extended to 46 cover group 1.0 nodes. Try to extend each match in M\ so it matches group 1.1.0 nodes. Let Mi be the resulting partial matches (including those in M\ that could not be extended to cover group 1.1.0 nodes). Finally, extend matches in Mi so as to match the only group 1.1.1 node, $n2. Let M3 be the resulting set of partial matches. Now, M3 contains the necessary information for constructing answers to the XQuery query in Fig. 4.5(a). In general, we can only match a group (e.g., 1.1.0) after its "parent" group (1.0) is matched. As usual, either all nodes in a given group are matched or none at all. For this example, the sequence in which matches should be determined for different groups is concisely captured by the expression 0[2][1.0[1.1.0(1.1.1]]], where [G] means the groups mentioned in G are matched optionally. 4.5 T r a n s l a t i n g X Q u e r y to G T P Putting the above ideas together, one can obtain Algorithm GTP in Fig. 4.6 to translate an XQuery query into a corresponding GTP. The algorithm can translate the fragment of XQuery specified in Fig. 3.1 with the following constraints: • the for clauses precede the l e t clauses in the same F L W R expression • no F L W R expression appears in the predicate tp • no F L W R expression appears in the quantifier (some or every) clause The algorithm has a global parsing environment ENV for bookkeeping the information collected from parsing, including, e.g., variable name-pattern node asso-ciation, G T P - X M L document source association, variable name-variable properties (e.g., for/let variable?, which index in the for variable list?) association, etc. It also 47 Algor i thm G T P Input: a F L W R expression Exp, a context group number g Output: a G T P or G T P s wi th a join formula if (g's last level != 0) let g = g + " .0"; / * stage 1 - processing ForClause * / foreach ("For $fv in E") do p&rse(E,g); let ng = g\ / * stage 2 - processing LetClause * / foreach ("Let $lv := E " ) do{ let ng = ng + 1; p&rse(E,ng); } / * stage 3 - processing WhereClause * / foreach predicate p in </> do { if (p is "every EL satisfies ER"){ let ng = ng + 1; pa.vse(Ei,,ng); let Fi, be the formula associated wi th the pattern resulted from EL', let ng — ng + 1; p&rse(ER,ng); let F R be the formula associated wi th the pattern resulted from ER\ } else { foreach Ei as p's argument do parse(Sj,g); add p to G T P ' s formula or the jo in formula; if (p is "count($n')> c" & & c > = 0){ s'=group($n'); if (g is the prefix of g') set the group number of al l nodes in g' to g; } i f (p refers to max( /min /avg/sum)($n ' ) and $n & & group($n)==g){ g'=group($n'); if (g is the prefix of g') set the group number of al l nodes in g' to g; > > > / * stage 4 - processing ReturnClause * / foreach "{Ei}" do { let ng = ng + 1; parse(i?,ng); } end Algor i thm procedure parse Input: F L W R expression or X P a t h expression E, context group number g Output: Par t of G T P resulting from E if (E is F L W R expression) G T P ( £ , 5 ) ; else b u i l d T P Q ( E ) ; end procedure Figure 4.6: Algorithm GTP 48 uses a helper function buildTPQ(xp), where xp is an (extended)1 XPath expres-sion, that builds a part of G T P from the xp. Whenever xp starts with the built-in document function, a new G T P is added to ENV; if xp starts with a variable, the pattern node associated with that variable is looked up and the new part resulting from xp starts from it. The function examines each location step in xp, creates a new edge and a new node, annotates the edge as pc(or ad, cp, da) as appropriate, according to the axis of the location step and adds a predicate about the node's tag and/or its properties. It returns the distinguished node of xp. Any filter expressions in xp are handled in a way similar to the where clause is, except they are simpler. Group numbers produced for G T P nodes are strings of numbers. The algo-rithm accepts a group number as its parameter, which is initialized to the empty string when invoking the algorithm for the first time. We use the shorthand g + v .xn for appending the number "x" to the string g, and g + 1 for adding 1 to the rightmost number in the string g. 4.6 Translating G T P Into an Evaluation P l a n The main motivation behind G T P is that it provides a basis for efficient implemen-tation. We show one way to do so, in terms of the physical algebra for X M L used in T IMBER, as discussed in Chapter 2. The Evaluation algorithm translates G T P into a physical plan. The plan is a DAG, in which each node is a physical operator or is an input document. To match EVERY edges in the G T P to structural anti-join, it converts them into "forbidden" edges using the transformation V$a; : [FL —» 3$y : FR] = ->3$x : : FR}. It also ignores the issue of value join order, assuming it can borrow such techniques 1XQuery allows XPath expressions extended with variables. 49 from the relational domain. The algorithm uses a helper function findOrder(SJs,$n), where SJs is a list of structural joins, $n is a pattern node that may be optional. The func-tion rearranges the order of SJs such that executing SJs in the order of SJs\, SJs-n is the optimal order. After executing the SJs, if $n is present, the re-turned witness trees are in the ascending order of $n's node id. The helper function getGroupBasis(g) takes a group number g as its parameter and returns an ap-propriate nested sequence of pattern nodes that are related to the for variables in all the 0 groups that are prefixes to g. (Abusing terminology, we say group g is a prefix to group g' provided g ends with 0 and after excluding the 0 at its end, it is a prefix to g'.) For instance, assume that $noi and $no2 are the nodes re-lated to fvi and fv2 in group 0, respectively, and $nn is the node related to the only fv in group 1.0, then getGroupBasis(l.l) returns < $rioi, $fio2> < >>. The helper function getGroupEvalOrder(G) returns the evaluation order of the groups in a G T P G. Basically, the order it returns is the alphabetical order of the group number, except that in the presence of a forbidden edge, the group under the forbidden edge is evaluated before the group above the forbidden edge. The algorithm also prepares the input stream for a pattern node $n from the X M L doc-ument using a tag index scan operator, or a value index scan operator if there is such value index and there is a predicate p($n,c) in the formula. In such case, there may be a sorting operator following the value index scan. For instance, if a node has a constraint $n.tag — age$z%n.content > 40, the plan to fetch the data is Filterc<mtent>25(IStag=age(TagIndex)) if there is no value index on age, or Sort(lScontent>25(V'aluelndexonage)) otherwise. Each intermediate result of an operator in the plan has a record about what 50 pattern nodes are bound in the output after executing the operator, whether the out-put of operator is duplicate free, and whether, if any, the output maintains a sorting order on some nodes' node-id. The output of some operators, e.g., SJ (structural join) or S (sort), maintain some node-id order, while some, e.g., VJ (value-based join), do not. Such information can be easily obtained from the property of the physical operator. The algorithm generates the plan by following the following stages for each group: 1. compute structural joins 2. filter the resulting stream based on the evaluable predicates dependent on the contents of more than two pattern nodes if needed (a predicate is evaluable when all its dependent pattern nodes are bound or the aggregations have been computed) 3. compute value joins if needed 4. compute aggregations, if needed 5. filter the resulting stream based on the predicates dependent on the aggrega-tion values, if needed 6. compute value joins based on aggregation values, if needed 7. group the return argument, if there is any In stage 3 and 6, if the join predicates depends on nodes at different hierarchy levels, the value join operator is a left-outer-join; otherwise, it is a normal inner-join. Sorting and duplicate elimination are added between the stages if needed. 51 Specifically, duplicate elimination is needed in stages 4 and 7. Sorting is needed if the order of the input sequence does not conform to the requirement of the physical operator. ARGUMENT #2. F: filter. IS: tag index scan. S S J : structural semi-join. S J : structural join. O S J : outer structural join. S: sort. M: merge. state BIND FOR/WHERE VARIABLES. Figure 4.7: Physical Plan from the G T P of Figure 1.2(b). Example 4.3 (Translating G T P into a plan). When the above algorithm is applied to the G T P in Figure 1.2 we obtain the plan shown in Figure 4.7. In this plan, we first do an appropriate sequence of structural joins to find matches for group 0 nodes in the GTP. Two important points to note here are: (1) We rely on a techniques such as [14] to find an optimal order of structural joins, (2) We use structural semi-joins wherever appropriate so a need for explicit projection and duplicate elimination is avoided [1]. As an example, the structural join between person and state elements is done as a structural semi-join, so even if there are 52 multiple state elements below a person, with value != 'MP, that person would be retained only once. In the bottom of Figure 4.7, we can see the plan for obtaining the said witness trees. The left operand of the SJ node computes persons with a state != 'MP while the right operand computes profiles with age > 25. The SJ operator computes (person, profile) pairs satisfying a pc relationship. Second, we make use of selection conditions in the where clause to restrict generation of bindings for return arguments. E.g., for the first return argument, it is sufficient to find watch subelements for those person elements $p satisfying $p//state != 'MP and $p/profile/age > 25. This is depicted in Figure 4.7 by forking the result of the SJ node above to the two (independent) subplans computing the two return arguments. Third, rather than compute bindings for the for/where variables and for each return argument separately and combine them with l e f t - o u t e r - j o i n based on the node-id, we use an outer version of structural join. E.g., the left-outer structural join between person and watches under the ad relationship finds all person elements without descendant watches as well as (person, watches) ad pairs. The output (person, profile) pairs of the S J node needs to be sorted by profile (node id) before it can be used for outer structural join with interest. Finally, the sequences from the subplans for the two arguments are both sorted by person node id so they can be merged to form the output sequence. • Example 4.4 (Physical p lan for a nesting G T P ) . Here is an example of translating a more complicated G T P into a physical plan. Applying the algo-rithm to the G T P in Figure 4.5(b) which involves join and nesting subqueries, we obtain the physical plan shown in Figure 4.8. Two things to note are: (1) In the subplan to bind variables in the inner f or-where clause, the value join op-53 FOR-WHERE age ! " " VARIABLES Figure 4.8: Physical Plan from the G T P of Figure. 4.5(b). erator is a LOJ operator because the join predicates use variables from different groups (2) In the subplan to process the second return argument, the group-by ba-sis is < person, < close-auction, < item >>> because the group number of $ri2 in Fig. 4.5(b) is 1.1.1 and getGroupBasis(l.l.l) returns < $p, < $t, < $t2 »>. U Having defined universal GTPs, how do we find matches for them? The main idea is that conceptually, a G T P with EVERY edges can be regarded as consisting of several parts: (i) the part of the G T P consisting of nodes reachable from the root by paths free of EVERY edges, (ii) the part corresponding to FT,, and the part corresponding to FR. If FR mentions nodes outside its group, such nodes are tem-porarily ignored. For the universal G T P in Figure 4.4(b), part (i) consists of just the root node $o with the constraint $o.tag = open_auction. Part (ii) consists of the tree with root $o and one pc-child $b with constraint Sb.tag — bidder. Part 54 (iii) consists of the the tree with root $b and one pc-child $i with constraint $i.tag = increase & $i.content > 100. Pattern matches for such a G T P are obtained as follows. First find all matches for part (i). For each match ft, if it can be extended into a match that maps all nodes in part (ii) correctly, then check whether there is a way to extend it further so part (iii) is also correctly and successfully matched. Then and only then, we conclude [i is a valid match of the given universal GTP. Combining the discussion above, Figure 4.9 shows the algorithm to translate G T P into an evaluation plan. 5 5 Algorithm planGen Input: G T P G Output: a physical plan to evaluate G let G.RPs=getGroupEvalOrder(G); foreach group g in GRPs do { / * stage 1*/ let GB=getGroupBasis(g); let SJs^the set of structural joins (edges) in g\ if (g ends with 0) let $n=the node related to fv\ in g; f indOrder(SJs,$n); foreach sj in SJs do{ if (one input stream of sj depends on a node in other group) set sj to structural outer join; if (one input stream will not be used further) project out the unused node and turn sj to structural semi-join, if possible; } /*stage 2*/ let C = { p | p is a predicate in G T P ' s formula and p refers to a node in g and p is evaluable and p has not been evaluated }; add Filter to the plan, which takes the formula from C as its argument and the output of SJs as input stream; / * stage 3 * / while (3 predicate p in the join formula and p refers to a node in g and p is evaluable and p has not been evaluated) { let J C = t h e set of such ps that depend on the same two inputs; add VJ to the plan, which takes the formula from JC as its argument; if (Bp £JC && p refers to a node in other group){ set V J to outer join; make the output of preceding step be V J's right input stream; } } / * stage 4 * / let AG={agg($n) j $n in g and agg($n) in G T P ' s formula}; add Groupby to the plan, which takes GB and appropriate aggregations as its argument; / * stage 5 * / let AC={p | p is a predicate in G T P ' s formula and p refers to a node in AG and p is evaluable and p has not been evaluated }; add Filter to the plan, which takes the formula from AC as its argument and the output of preceding step as input stream; / * stage 6 * / while (3 predicate p in the join formula and p refers to a node in AG and p is evaluable and p has not been evaluated){ let AJC=the set of such ps that depends on the same two input; add VJ to the plan, which takes the formula from AJC as its argument; if (3p £ JC & & p refers to a node in other group){ set V J to outer join; make the output of preceding step V J's right input stream; } } / * stage 7 * / if (g has a return argument) add Groupby to the plan, which takes GB as its argument; if (g is the last group in its hierarchy) add Merge operator to the plan; } end of Algorithm Figure 4.9: Algorithm planGen 56 Chapter 5 Schema-Aware Opt imiza t ion X M L with its irregular structure poses a great challenge for efficient query process-ing. In the absence of schema knowledge, one must anticipate all the possibilities of optional, repeated, or recursive elements for every element! Often X M L docu-ments conform to some schema, say a DTD or X M L schema, the knowledge of which can be beneficial in two ways: (i) at the logical level, we can simplify the G T P by eliminating nodes, thus reducing the number of structural joins required; (ii) at the physical level, we can eliminate redundant operators (e.g., sorting, duplicate elimination, etc.) in the generated physical plan. 5.1 Logical Opt imizat ion We have identified four types of constraints and their applications in simplifying a G T P based on schema information. The examples readily generalize. 57 \ t 5.1.1 A v o i d a n c e C o n s t r a i n t a n d I n t e r n a l n o d e e l i m i n a t i o n f ? Given three tags a, 6, c, if the schema implies that in any valid data tree instance, any path from a node with tag a to a node with tag c must passes through a node with tag b, we say that a, 6, c satisfy the avoidance constraint aJJ.bc. Suppose there are three nodes $a, $6, $c in a "chain" corresponding to tags a, b, c in a GTP, where $b is an ad-child of $a and is an ad-parent of $c, and $b has no other children, has no other local predicates, and does not correspond to f o r / l e t variable or a return argument. Then we can remove $b from the G T P and make $c an ad-child of $a, if the schema implies the avoidance constraint ai}.bc. The resulting ad edge ($a —> $c) is solid iff each of the edges (Sa —> $b) and ($b —> $c) is. Variations include situations where one or more of the edges could be pc. Example 5.1 (App ly ing avoidance constraint). In a G T P subgraph corre-sponding to the XPath expression book//publisher//address, we can remove publisher if address is the only child of publisher and publisher is not a f o r / l e t / r e t u r n variable and book^pubhsher address holds true. The resulting G T P subgraph is the one corresponds to an XPath expression book//address. • 5.1 .2 I d e n t i f i e d C o n s t r a i n t a n d I d e n t i f y i n g N o d e s W i t h t h e S a m e T a g Two tags t and t' satisfy child (descendant) identified constraint t —> (=>)t' if the schema implies that in any valid data tree instance, there is at most one edge(path) from a node with tag t to a node with tag t'. Suppose the G T P contains a node with tag t with two (pc- or ad-) children (each may be dotted or solid) with the same tag t'. If the schema implies t —> (=>)t\ then the two child nodes of tag t' can be identified, i.e., merged, eliminating unnecessary processing. 58 For instance, the G T P for the XQuery query for $b i n . . . / /book, $r i n . . . / / r e v i e w where $ b / t i t l e = $ r / t i t l e return <x> {$b / t i t l e} {$b/year} </x> would have two nodes corresponding to t i t l e , one in the for-group and the other in the group for return argument 1. The latter can be eliminated and the for-mer can be treated as a return node in addition to its role in the for-group, provided the schema says book—>title. The physical plan from the simplified G T P has one fewer structural join than that from the original GTP. 5.1.3 Leaf elimination and Emptiness detection Queries that test for the existence of a subelement of a certain kind can often be optimized this way, if the schema guarantees their existence or nonexistence. E.g., the XPath expression //book [//address] can be simplified to //book, provided the schema implies every book element has at least one address subelement as its descendant. On the other hand, the query tests for the existence of a year subelement under book is unsatisfiable, if the schema says books cannot have such a subelement. (Because, say books are classified by year.) In general, we say two tags t and t' satisfies child (descendant) (non)existence constraint if the schema implies that in any valid data tree instance, a node with tag t has (not) at least one child(descendant) node with tag t\ denoted as t t' (or t J,o t' in the case of nonexistence constraint). 5.1.4 G T P Simplification Combining the ideas above together, we give an algorithm pruneGTP(G) for sim-plifying a G T P given a set of child (descendant) (non)existence constraints, child (descendant) identified constraints and avoidance constraints, which are typically 59 algori thm p r u n e G T P Inputs: G T P G <T,F > Output: a simplified G while (3$ni, $ n 2 in G s.t. $ni.tag = x&.%ri2.tag = x € F) { if ($ni and $ n 2 are siblings & & both hold child (descendant) identified constraint wi th their parent) unify($ni,$n2); } while (3$n s.t. $n is a leaf of G and %n.tag — x is the only predicate about $n in F and $n is not related to any %fv or $lv or return argument and y x where y is $n's parent's tag) { delete $n from G ; } while ( 3$n a , $ri(,and$n c in G s.t. $ n a is $n;,'s parent and $nt is $n c ' s parent and $n.b-tag = b is the only predicate about $n(, and the avoidance constraint among $na, $nb and $ n c holds) { delete $iib from G ; } end of algori thm procedure unify Inputs: two pattern nodes $ n i and $ n 2 , G T P G <T,F > Outputs: G simplified by combining $ n i and $ n 2 together let g 1 = $ n 1 ' s group, g 2 = $ n 2 ' s group; make al l $n i ' s descendants be $n 2 ' s descendants; replace $ni w i th $ n 2 in F ; relate to $ n 2 a l l $/ws, $Zvs and return arguments related to $ n i ; set the group number of a l l nodes i n gi and <?2 to min{gi, g2)\ delete $ n i from G ; end of procedure Figure 5.1: Algorithm pruneGTP pre-computed from the schema specification. The algorithm applies the constraints in the following order, whenever possible: 1. detect emptiness of (sub)queries 2. identify sibling nodes with the same tag 3. eliminate redundant leaves 4. eliminate redundant internal nodes • 60 Step 1 is trivial: if the algorithm detects that an edge in G are not satisfiable, depending on the policy, it can give feedback to the user, or can remove the whole group GRP from G. Therefore we only show steps 2-4 in the algorithm presented. FOR $p In document)" auctlon.xml")/site/people/person RETURN <result> {$p/name} { FOR $c In document("auction.xml"y/closed_auctlon WHERE $c/buyer/@personref=$p/id AND $p/proflle/@lncome>$c/price RETURN <buy>{$(^priceK$c/annotation[./aufhor]/description//keyword}</buy> } { FOR So In document) "auction.xml ")//open_auctlon WHERE $o/seller/@personref=$p/@ld AND $o/lnltlak$p/proflle/@lncome RETURN <sell>{$oftypeH$o/inltlal}</sell> } </result> (a) (3.0) $o (0) Sp, (2.0) Sc Sn (1) $11 (2.0) Sb $r1 (2.0) (2.0) (b) $k (c) Ss.tag=site & $e.tag=people & $p.tag=person & $n.tag=name & $H.tag=SI2.tag=profile & $c.tag=closed_auctlon & $b.tag=buyer & $b.personref=$p.id & $M.tag=$r2.tag=price & $11.lncome>$r1.content & $a.tag=annotatlon & $u.tag=author & $d.tag=dlscrlptlon & Sk.tag=keyword & $o.tag=open_auction & Sse.tag=seller & $se.content=$p.id & $H.tag=$i2.tag=initial & Si1.content<SI2.income & $t.tag=type $o Sse $11 St (3.0) (3.0) (3.1) Sp.tag=person & Sn.tag=name & $n.tag=name & $l1.tag=proflle & $c.tag=closed_auction & Sb.tag=buyer & $b.personref=$p.ld & $r1 .tag=price & $11 .income>$r1 .content & $k.tag=keyword & $o.tag=open_auction & $se.tag=seller & $i1.tag=initial & $11 .content<$!1 .content & St.tag=type Figure 5.2: A sample XQuery query, the associated GTP, and the simplified G T P Example 5.2 (Simplifying G T P ) . Figure 5.2 shows an XQuery query (a) and its original G T P (b). Applying the descendant constraints {person JJ-? p r o f i l e , closed_auction JJ-x pr ice , open.auction JJ-i i n i t i a l } results in $Z1 and $12, as well as $ r l and $r2, $il and $i2, being identified. The new G T P is the orig-61 inal G T P with $12, $r2 and $i2 dropped and the predicates in the formula de-pendent on them refer to their identified nodes. Applying leaf elimination re-sults in dropping $u and all constraints on it. Applying the avoidance constraints {documentjroot J ] . s i t e people, s i t e J J P e ° P l e person, closed-auction J J . ^ 0 ^ 1 0 1 1 descr ipt ion, descr ip t ion {^annotation keyword} lets us drop $s, $e, $a, and $d along with all the constraints on them. The final G T P is shown in (c). • Given a particular type (avoidance constraint, identified constraint or ex-istence constraint) of constraints, We can show the following result on the G T P simplification algorithm: Theorem 5.1 (Opt imal i ty) . Let C be a set of child(descendant) identified con-straints (existence constraints, avoidance constraints, respectively). LetG be a GTP. Then there is a unique GTP Pimm equivalent to G under the presence of C, which has the smallest size among all equivalent GTPs, over databases satisfying C. The GTP simplification algorithm will correctly simplify G to Hmin and in polynomial time in the size of G. Proof: Obviously, after applying a constraint in C to prune one node from G, the resulting G T P H is equivalent to G. Let |G | denote the number of vertices in G. Inside each loop step, it needs to check at most all pairs of nodes in the case of existence constraint and identified constraint, or all triples of nodes in the case of avoidance constraint. Because each loop step eliminates a vertex from G, the loop repeats less than \G\ times. Obviously the algorithm has a polynomial bound in \G\. To show that there is a unique minimal G T P Hmin equivalent to G, it suffices to show that the algorithm results in the same G T P when it stops, no matter in what order it applies the constraints in C. We consider the following cases: 62 • C is a set of existence constraints. Let H\ and H2 be two GTPs that the algorithm generates by applying constraints in C in different orders. Because the algorithm only eliminates the leaf nodes, it cannot be the case that H\ and H2 have the same leaf nodes but have different internal nodes. If Hi and H2 are different, without loss of generality, there is a leaf node n in H\ which does not belong to H2- The fact that n does not belong to H2 implies that there is a node m in G such that m is the parent of n in G and m and n satisfy an existence constraint in C. Because n is in H\ and the algorithm only prunes leaf node, m must be in H\. It contradicts the fact that the algorithm stops at Hi, because the algorithm can proceed to apply the existence constraint to eliminate n. Thus Hi and H2 must be the same GTP. • C is a set of avoidance constraints. It can be shown the same as the case above. • C is a set of identified constraints. Notice that if m and n are identified iff they are siblings and they have the same tag. Let R be the relation such that aRb iff a and b can be identified. Obviously R is symmetric and transitive. Let Hi and H2 be two GTPs that the algorithm generates by applying constraints in C in different orders. If Hi and H2 are different, without loss of generality, there is a node n in Hx which does not belong to Hi- Let X be the set {x\x € G&ixRn}. There must be a node m e X such that m 7^  n and m 6 Hi; otherwise, n will also appear in H2. Therefore, the algorithm can identify m and n in H\. This contradicts that the algorithm stops at H\. Thus Hi and H2 must be the same GTP. • 63 When C consists of both existence constraints and avoidance constraints, the minimal equivalent G T P is no longer unique. To see this, consider a simple T P Q (which is a GTP!) P corresponding to the XPath expression t l [. / / t 2 / / t 3 ] . Suppose C consists of the descendant identified constraint "every t2 has one or more descendant t3 's" and the avoidance constraint "every t3 that is a descendant of a t l is a descendant of a t2, which is a descendant of the t l " . Then P is equivalent to each of the TPQs Px = t l [. / / t 2 ] and P2 = t l [. / / t 3 ] , but not to t l . Both P i and P2 are minimal. In this case, the simplified G T P found by our algorithm would be P i , since existence constraints are applied before avoidance constraints. The rationale is that a structural join of a leaf node with its parent is more expensive than a structural join of an internal node with its parent. At this time, it is open whether there are smaller GTPs, equivalent to the given G T P under both existence constraints and avoidance constraints, than found by the algorithm. 5.2 P h y s i c a l O p t i m i z a t i o n We have identified three important sources of physical optimization. A l l examples below refer to Figure 4.7. • E l imina t ion of sorting: Suppose we want to perform, say an ancestor-descendant structural join on two input streams ordered respectively by person and profile node id's. The algorithm can create the output in either person order or profile order, but in general, not in both. If we choose the former order, it can be used for processing of return argument 1, without further resorting. But for argument 2, where we need to match profiles with child interests, we need to resort the previous output by profile node id's. However, 64 if the schema implies no person can have person descendants, then the output of the structural join ordered by person node id will also be in profile node id order. Conversely, if the schema implies no profile can have profile descendants, then the output ordered by profile order will also be in person order. E l imina t ion of group-by: In general, for each return argument, we must group together all elements associated with a given match for the for variables, e.g., for watch and interest in Figure 4.7. But if the schema says each profile has at most one interest subelement, then the grouping on the second return argument can be eliminated. For elimination of group-by on the first, the schema needs to imply each person has at most one watches descendant and each watches has at most one watch child. E l imina t ion of duplicate elimination: In general, for each return argu-ment connected to a for variable by a path of length 2 or more and containing only ad edges, we potentially need a duplicate elimination. E.g., this would be the case for watch element (node $t), if the corresponding expression was $p//watches//watch instead of $p//watches/watch. Then $t is connected to the for variable $p by the all ad-path ($p — > $w —> $t). If watches can have watches descendants, then for a given person node, a descendant watch node may be generated multiple times, warranting duplicate elimination. However, if the schema implies watches cannot have watches descendants, this is unnec-essary. For this optimization, we need that for each intermediate node x (i.e., excluding the endpoints) on the all ad-path, the schema implies t cannot have t descendants, where t is the tag of x. 65 5.3 Constraint Inference In this section we will develop the algorithms for inferring the constraints discussed in section 5.1 from DTD. Our algorithms make use of the abstraction of regular tree grammars as structural abstractions of schemas and work off this representation. We then show our algorithms are complete and are polynomial time in the size of the regular tree grammar. 5.3.1 Regular Tree Grammar Our algorithms make use of the standard regular tree grammar (RTG) representa-tion for schemas [20]. Definition 5.1. A regular tree grammar is a 4-tuple (S,N, P,n0) where S is a finite set of terminals (tags), A'" is a finite set of nonterminals, P is a finite set of productions of the form n —> a(R), where n G N, a G S and R is a regular expression over N, and no G N is the start symbol. The structural information specified by DTD is a restricted class of RTGs. Such RTGs have the virtue that no two productions have the same terminal in the right-hand side but different nonterminals in the left-hand side. It is useful to bear in mind that there is a 1-1 correspondence between the sets S and N, so we can speak of a nonterminal corresponding to a tag. 5.3.2 Inferring Avoidance Constraints The avoidance constraints can be tested on a directed graph Q induced from the R T G G for the DTD. Q is obtained from G as follows. Each nonterminal in G is a vertex in Q\ Q has an arc < m, n > iff there exists a production m —> a(R) in G 66 such that n appears in the regular expression R. Note that the exact form of the regular expression R is not relevant for this construction. Consider avoidance constraints of the form aiybc which says every node of tag a that has a descendant node v of tag c must also have a descendant node of tag b which is an ancestor of v. To test whether the D T D implies aJJ-hc, we simply test whether C is reachable from A in Q — {B}, the graph obtained by deleting node B from Q, where A, B, C are the nonterminals corresponding to tag a, b, c. It is easy to show that G implies ai^bc iff C is not reachable from A in Q — {B}. This test can be performed in linear time in the size of Q using depth-first search. The set of all avoidance constraints can be found by testing all possible triples < A, B,C > where A, B,C are the nonterminals corresponding to tag a,b,c, and thus the algorithm is complete. The complexity of the algorithm is 0 ( n 3 x \\G\\), where n is the number of tags and \\Q\\ denotes the size of Q. 5.3.3 Inferring Quantifier Constraints Quantifier constraint of the form alqb (or a^.qb) denotes that every node of tag a has q occurrences of nodes of tag b as its child (or descendant), where q is one of 0,1, ?, +, *, with their well-known standard meaning. Identified constraint and (non)existence constraint are special cases of quantifier constraints, where q is ?, 0 and 1/+, respectively. [9] introduced three operators to determine the combined effect of quantifiers from regular subexpressions: • sum(qi,q2) finds the quantifier q in regular expression (R\,R2) if q\ and q2 are the quantifier for Ri and R2, respectively, e.g., sum(l, *) — + and sum(l, 1) = +• 67 • choice(qi, 92) finds the quantifier q in (i?i|i?2) if qi is the quantifier for Ri, e.g., choice(Q, 1) =? and choice(l, +) = *. • product(qi,q2), where q\ 6 {?, *, +}, returns the quantifier q in i?2? (resp., R2* or i?2+) if 92 is the quantifier for R2, e.g., product(l, +) = *, product^*, +) — *, product(+, 1) = +. Given a D T D as a grammar G — (S, N, P, no), algorithm c—occurences(R, b) in Figure 5.3 computes the q in Rlqb where R is a regular expression over N and b is a tag. Since there is a 1-1 mapping from tag to nonterminal in the RTG, to compute q in a\,qb for tags a and b, we just find the production n —+ a(R) 6 P , and invoke c — occurrences(R, b). Note that the algorithm takes time linear in the size of R. We can find all the constrains x[qy by applying the algorithm to all pairs of tags < x,y >. Thus the algorithm is complete and in polynomial time in the size of G. algorithm c-occurrences Inputs: regular expression R and tag b Output: q in R[qb i f R is e return 0; if (R is nonterminal n) return 1 i f there exists a production n —* b(R'), or return 0 otherwise; if (R is (R1,R2){resp., (Ri\R2))) { q ,i=c-occurrences(i?i,fe); 32=c-occurrences(i?2,^); return sum(q\, g2)(resp., choice(q\, q2) resp.); } if (R is i?i?(resp., Ri*, or R±+)) { qi =c-occurrences(i?i ,b); return product(?, q±)(resp., product(*,q{) or product(+, qi)); } end of algorithm Figure 5.3: Algorithm c-occurrences Algorithm d-occurences (a, b) in Figure 5.4 computes the q in aiqb where 68 a and b are tags. Because of potential cycles in the RTG, the algorithm needs to compute a fixed point. It thus uses two global arrays tempResult[n] and result[n] to record respectively the quantifiers for ntyqb in the preceding iteration and in the current iteration. It also uses a boolean array powerNode[n] to record those nonter-minals that are on a cycle while expanding the nonterminals to compute to ai}-qb. If powerNode[n] is true, the result[n] needs to be adjusted as product(+, result[n}), to account for the effect of the cycle. In the algorithm, A (resp. B) is the nonterminal such that A -> a(Ra) G P (resp. B -> b(Rb) <E P). Note that no nonterminal in G is expanded more than once in the function d-occurrences(i?, b). Thus the function takes at most time linear in the size of G. To find a bound for the number of iterations in the algorithm, we notice the followings: • after the first iteration, the only cause of computing the fixpoint is that there is a cycle when expanding the regular expressions • for each node n in the cycle, result[n] can only be 0 or * or + because of the adjustment • the function d-occurrences(i?, b) produces result in a specific order, which means the computation can be written as the following: Qv2 h(Q' , ( t - i ) n(i-i) 1) n /n(<3Vl' ^ v2> where V\,...,Vn are nodes in the cycle, Qy is result[Vj] in the i-th itera-69 tion, Qly} is tempResult[Vj] in the i-th iteration, fj is the composition of five basis functions sum(q\,q2), choice(q\,q2), product(?,q), product(*,q) and product(+,q). We define an order Oj*j + . Under this order all the five basis functions are monotonous, and hence / i , . . . , /„ are monotonous, too. Without loss of generality, we can assume QVi > QV~X\ Thus Qy2 > Qy~^ and ... and Qyn > Qv~l\ which in turn results in Q v ^ > QlVi. Therefore, after at most O(n) iterations, Qv, will converge to a fixpoint. It is straightforward to extend it so that it computes quantifiers for aJJ.06 for all pairs of terminals a and b, which is still in polynomial time in the size of G. Combining the analysis in section 5.3.2 and 5.3.3, we have the following theorem. Theorem 5.2 (Constraint Inference). Let G be the regular tree grammar as-sociated with a DTD. Then every avoidance constraint (resp. identified constraint, (non) existence constraint) that is implied by G is generated by the constraint infer-ence algorithms given in this section, and in polynomial time in the size of G. 70 algori thm d-occurrences Inputs: tag a and b Outputs: result[a] contains q in a fyq b initialize tempResult[N] to 0 except tempResult[B], which is 1; while (true) { mark all n 6 N as never expanded; let result[A] = d — occurrences(Ra, b); if (tempResult is the same as result) stop; copy result to tempResult; } end of algorithm function d-occurrences Inputs: regular expression R and tag b Output: q in R JJ-9 b if i? is e return 0; if (R is nonterminal n){ if (n marked as fully expanded) return result[n\; if (n marked as being expanded) /*there is a loop*/{ mark all nonterminals marked as being expanded as power nodes; return tempResult[n\; }else{ mark n as being expanded; find the production n —> x(R\); let gi=d-occurrences(i?i,6); i f (x = = 6) let qi = sum(l, qi); i f (n is power node) let 51 = product(+, qi); mark n as fully expanded; let result[n] = <ji; return result[n}; }} if ( i i is (R1,R2)(/(R1\R2))) { qi=d-occurrences(i?i,b); q2=d-occurrences(R2,b); return sum(qi,q2)(/choice(qi,q2) resp.); } if (R is R1?(/R1*/R1+)) { qi =d-occurrences(i?i ,6); return productC?,qi)(/product(*,qi)/product(+,qi) resp.); } end of function Figure 5.4: Algorithm d-occurrences 71 Chapter 6 Exper iment In this chapter we present the results of experiments comparing the execution plan from XQuery-GTP translation and from XQuery-TAX translation, testing the scal-ability of G T P plan and comparing plans with and without schema-aware opti-mization. A l l the experiments were executed using the T I M B E R [16] native X M L database. 6.1 Experiment Setting We used the X M L generator in the XMark [29] project to produce test data set. Experiments were executed on a PIII-M 866MHz machine running Windows 2000 Professional. T I M B E R was set up to use a 100MB buffer pool. A l l numbers reported are the average of the combined user and system C P U times over five executions with the highest and the lowest values removed. 72 Qa: for $b in document("auction.xml")/site /open-auctions/open-auction return {$b/bidder/increase[./=39.00]/text()} Qb: for $p in document("auction.xml")/site/people/person where SOME $i in $p/profile/interest s a t i s f i e s $i/@category="category28" return {$p/name/text()} Qc: for $p in document("auction.xml")/site/people/person where every $i in $p/profile/interest s a t i s f i e s $i/@category!="category28" return {$p/name/text()} Figure 6.1: Queries Qa, Qb, Qc 6.2 G T P Plans and T A X Plans We generated a factor 1 (100MB) X M L document that occupied 479MB when stored in the database. The queries in experiment were designed to test the effect of path length, number of return arguments, query selectivity and data materialization cost in general. We chose a few queries in the XMark benchmark [29], which demon-strated the use of these factor. The X M queries mentioned in this chapter (XM8, XM13 etc) are the corresponding XMark queries. We also created a few queries as shown in Figure 6.1: Qb and Qc to demonstrate quantification, since no XMark query does, and Qa to show a query with relatively long path and with 1 argument in the return clause. We use an index on element tag name for all the queries, which returns a sequence of node ids for a given tag name. We use a value index, which, given a content value, returns a sequence of node ids, to check the condition on content for queries XM5, XM20, Qa, Qb and Qc. We use the algorithm in chapter 3 to translate the queries into a sequence 73 Query Tested A lgo r i t hm B A S E G T P S C H Query Description X M 5 0.89 0.20 0.05 1 argument/return, short path, value index Q a 8.92 0.47 0.08 1 argument/return, long path, value index X M 2 0 11.83 1.09 0.50 > 1 argument/return, med path, value index Q b 23.41 0.39 0.37 > 1 arg/return, quantifier some, high selectivity, value index Qc 25.63 1.09 1.05 > 1 arg/return, quantifier every, low selectivity, value index X M 1 3 1.90 0.50 0.48 > 1 arg/return, long path X M 1 9 70.03 29.49 28.12 > 1 arg/return, lots of generated results X M 8 111.45 15.66 15.07 > 1 argument/return, single value jo in , nested X M 9 180.32 20.50 18.82 > 1 argument/return, mul t i value join , nested Figure 6.2: C P U timings (sees) for XMark factor 1. Algorithms used: B A S E = Baseline plan, G T P = G T P plan, SCH = G T P with schema optimization. The queries are XMark queries (XM5, XM20, . . . ) and the queries (Qa, Qb, Qc) seen in Figure 6.1. of TPQs, where each T P Q is represented by a T A X operator taking a tree pattern as its argument. We call this approach the baseline plan. It is obtained from the TPQs by mapping each edge in each tree pattern to a structural join and mapping each T A X operator to a corresponding T I M B E R physical algebra operator, e.g., the T A X join operator mapped to the value join operator in Section 2.2. We use the algorithm in Figure 4.6 to translate the queries into a G T P and then use the algorithm in Figure 4.9 to generate the physical plan. We call this approach the G T P plan. The experiment results of both plans are summarized in Figure 6.2. From the table we can see the G T P plans significantly outperformed the baseline plans for every query tested, sometimes by one or two orders of magnitude. We also note that both algorithms were affected by the path length due to the increased cost of more structural joins, by the query selectivity due to the cost of materializing the query result, and by the number of return arguments due to query result materialization costs and having to do more sorts and groupings to get the 74 X M a r k scale factor Query 0.05 0.1 0.5 1 5 X M 1 3 0.02 0.05 0.25 0.50 2.43 X M 8 0.58 1.15 7.88 15.66 73.52 Figure 6.3: C P U timings (sees). Using G T P with no schema-aware optimization or value index. final result. The baseline plans were further affected in terms of paying the extra cost of having to do the repeated pattern tree matches. In queries with joins the performance of both plans degraded because of the data materialization cost. Note that the G T P plans could benefit from an index on the join value and perform very well in queries with joins. Unfortunately such an index was not available in our tests. So the performance of the G T P plans decreased when data materialization cost was very high. The baseline plans performed even much worse because it had to carry the penalty of this materialization in all the joins and repeated pattern tree matches. 6.3 S c a l a b i l i t y We tested queries XM13 and X M 8 for scalability. We used XMark factors 0.05 (5MB/24MB), 0.1 (10MB/47MB), 0.5 (50MB/239MB), 1 (100MB/479MB) and 5 (500MB/2387MB) (The first number in the parentheses is the size of the X M L document generated by the XMark X M L generator and the second is the size of the database when storing the document in TIMBER) . The XM13 is a simple selection and XM8 is a nested F L W R query that includes a join. No value indices were used for these tests. As we can see in Figure 6.3, G T P scaled linearly with the size of the database. 75 Figure 6.4: C P U timings (Comparison of G T P and G T P with schema optimization plans for XM5, Qa and XM20. 6.4 Schema-Aware Optimization The column SCH in Figure 6.2 shows the performance of G T P with schema-aware optimization. We see that schema knowledge can greatly enhance performance in some cases, but helps very little in others. Schema-aware optimization performs well when (result) data materialization is not the dominating cost. We also note that when the path was of the form 1/1/1/few and schema optimization converted it to l//few then the benefit was again small. Schema-aware optimization performed well when the path was of the form many/(or //)many/(or //)many and was converted to many//many. This way there is a big benefit from not doing many structural joins. We present in Figure 6.4 a comparison between the G T P and the G T P with schema-aware optimization plans, using queries XM5, Qa, and XM20. As we can see, schema-aware optimization produces much faster executions in these cases. 76 Chapter 7 Conclus ion and Future W o r k This thesis has taken a significant step towards the efficient evaluation of XQuery. We showed that a F L W R expression can be expressed as a sequel of TPQs, yet the evaluation plan from such translation ensues repeated pattern matches and extra joins. To eliminate such extra processing, we proposed a novel structure called gen-eralized tree pattern that summarizes all relevant information in an XQuery into a pattern consisting of one or more trees. Such a representation of XQuery facilitates generating evaluation plan and optimizing a query with schema knowledge, as T P Q does for an XPath query. As such, GTPs can be used as a basis for physical plan generation for Xquery and also as a basis for further logical and physical query optimization, exploiting any available schema knowledge. We gave algorithms to translate XQuery into G T P and then into an evaluation plan using physical opera-tors in a typical native X M L DBMS, to simplify a G T P under relevant constraints and to infer constraints from DTD. We also demonstrated the effectiveness of GTPs with an extensive set of tests comparing G T P plans with plans generated from TPQs. In most cases, G T P plans win by at least an order of magnitude. 77 Query containment for XPath has attracted significant research efforts [32, 19, 28], and yet there is little known result for query containment for XQuery as a whole. GTPs provide an elegant framework with which to study query contain-ment for XQuery, to our knowledge for the first time. The algorithm to simplify a G T P given in chapter 5 is just one of the applications. We expect that query containment will be also applicable to query answering using (XQuery) views and incremental view maintenance. Much more work is needed to fully understand the notion of query containment in XQuery. We identified several constraints relevant to logical and physical optimization. Yet, more work is needed to fully exploit all schema knowledge (e.g. the key and the foreign key constraint) and comprehensively calibrate its performance benefits. Whereas our experimentation has been limited to the T I M B E R system, and hence can directly be extrapolated only to native X M L database systems, the G T P concept should be equally applicable to relational mappings of X M L . A rigorous evaluation of the benefits GTPs bring to one of the two fundamental problems in the relational X M L systems, namely how to translate XQuery into SQL based on a selected relational schema for storing X M L , remains part of our future work. 78 Bibl iography [1] S. Al-Khalifa and H. V . Jagadish. Multi-level Operator Combination in X M L Query Processing, pp. 286-297, C I K M , Nov. 2002. [2] S. Al-Khalifa et al. Structural joins. A primitive for efficient X M L query pattern matching. ICDE 2002. [3] K . Runapongsa and J .M. Patel. Storing and Querying X M L Data in Object-Relational DBMSs. E D B T Workshop on X M L data management, 2002: 266-285. [4] T. Bray et al. Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/2000/REC-xml-20001006, Oct. 2000. [5] T. Bray et al. Document Object Model (DOM) Level 1 Specification http://www.w3.org/TR/REC-D0M-Level-l/, Oct. 2000. [6] A . Berglund et al. X M L path lan guage (XPath) 2.0. http://www.w3.org/TR/xpath20/, Nov. 2002. [7] P. V . Biron and A. Malhotra. X M L schema part 2: Datatypes. W3C Recom-mendation. http://www.w3.org/TR/xmlschema-2/, May 2001. [8] S. Boag et al. XQuery 1.0: An X M L query language. http://www.w3.org/TR/xquery, W3C Working Draft. Nov. 2002. [9] Denise Draper et al. XQuery 1.0 and XPath 2.0 Formal Semantics http://www.w3.org/TR/xquery-semantics/, W3C Working Draft. Nov. 2003. [10] D. DeHaan et al. A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding, A C M SIGMOD 2003. [11] N . Bruno et al. Holistic twig joins: Optimal X M L pattern matching. A C M • SIGMOD, 2002. 79 [12] T. Fiebig et al. Anatomy of a native X M L base management system. VLDB Journal, 11(4):292-314, 2002. [13] D. Florescu and D. Kossman. Storing and querying X M L data using an RDMBS. IEEE Data Eng. Bull, 22(3):27-34, 1999. [14] Y . Wu et al. Structural Join Order Selection for X M L Query Optimization. ICDE 2003. To appear. [15] H.V. Jagadish et al. T A X : A Tree Algebra for X M L . D B P L 2001, Rome, Italy. [16] H. V . Jagadish et al. T IMBER: A native X M L database. VLDB Journal, 11(4):274-291, 2002. [17] H.V. Jagadish et al. Implementing XQuery using T A X . Tech. Report, U . Michi-gan. Dec. 2003. [18] Stelios Paparizos et al. A Physical Algebra for X M L . Tech. Report, U . Michigan. Dec. 2003. [19] G. Miklau and D. Suciu. Containment and Equivalence for an XPath Fragment. PODS 2002: 65-76. [20] M . Murata et al. Taxonomy of X M L schema languages using formal language theory. Extreme Markup Languages. Montreal, Canada, August 2001. [21] J. Naughton et al. The Niagara Internet Query System. Available at http://www.cs.wisc.edu/niagara/papers/ NIAGARAVLDB00.v4.pdf [22] H. Schoning. Tamino - A DBMS designed for X M L . ICDE, 2001. [23] J. Shanmugasundaram, K . Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational databases for querying X M L documents: Limitations and opportunities. In Proceedings of the International Conference on Very Large Databases, 1999. [24] J. Simeon et al. Galax, An open implementation of XQuery. h t tp : / /db .be l l - l abs .com/ga lax / . [25] I. Tatarinov et al. Storing and querying ordered X M L using & relational database system. SIGMOD, 2002. [26] H. S. Thompson, D. Beech, M . Maloney, and N . Mendelsohn. X M L schema part 1: Structures. W3C Recommendation. Available from http://www.w3.org/TR/xmlschema-l/, May 2001. 80 [27] P.T. Wood. Minimising Simple XPath Expressions, in Proc. 4th Int. Workshop on the Web and Databases (WebDB) (Santa Barbara, California, May 24-25), 2001, pp. 13-18. [28] P.T. Wood. Containment for XPath Fragments under D T D Constraints, ICDT 2003, LNCS 2572 [29] XMark, an X M L benchmark project, h t tp: //www. xml-benchmark. org/ . Com-panion paper appeared in VLDB'02. [30] Xin Zhang, B. Pielech, and E. A. Rundensteier. Honey, I Shrunk the XQuery! — An X M L Algebra Optimization Approach. Workshop on Web Information and Data Management, Nov. 2002. [31] C. Zhang et al. On supporting containment queries in relational database management systems. SIGMOD, 2001. [32] Sihem Amer-Yahia et al. Minimization of tree pattern queries. SIGMOD, 2001. 81 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051223/manifest

Comment

Related Items