Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Schema mapping and query translation in heterogeneous peer-to-peer XML databases Chang, Elaine Qing 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2005-0394.pdf [ 6.26MB ]
Metadata
JSON: 831-1.0051116.json
JSON-LD: 831-1.0051116-ld.json
RDF/XML (Pretty): 831-1.0051116-rdf.xml
RDF/JSON: 831-1.0051116-rdf.json
Turtle: 831-1.0051116-turtle.txt
N-Triples: 831-1.0051116-rdf-ntriples.txt
Original Record: 831-1.0051116-source.json
Full Text
831-1.0051116-fulltext.txt
Citation
831-1.0051116.ris

Full Text

Schema Mapping and Query Translation in Heterogeneous Peer-to-Peer XML Databases by Elaine Qing Chang B.Sc.(Honours), Trent University, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia October, 2005 © Elaine Qing Chang, 2005 ii Abstract In a peer-to-peer data management system, the peers may have heterogeneous schemas, and no mediated global schema. To' facilitate data exchange, we as-sume each entering peer provides correspondences between its schema and a small number of other peer schemas. We study the problem of schema mapping and query translation in the context of heterogeneous XML schemas, featuring data/schema conflict. We develop an algorithm for inferring precise mapping rules from informal schema correspondences. We define the semantics of query answering in this setting and develop an algorithm for query translation. Our translation can work both along and against the direction of mapping rules and can handle an expressive fragment of XQuery. We have developed the HePToX heterogeneous P2P XML data management system on top of the Emulab, a large scale P2P network emulation testbed, incorporating our ideas and results. We describe our implementation strategy and report the results of an extensive set of experiments on HePToX on both synthetic and real data sets, demonstrating its utility and scalability. iii Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgements ix 1 Introduction 1 1.1 XML 1 1.2 Thesis Background 1 1.3 Challenges, Contributions and Chapter Overview 3 2 Definitions and Related Work 5 2.1 Definitions 5 2.1.1 XQuery 5 2.1.2 Tree Pattern Query and Generalized Tree Pattern . . . . 5 2.2 Related Work 7 2.2.1 Data exchange and P2P data integration systems 7 2.2.2 Schema-matching systems 8 3 A Mot iva t ing Example 10 3.1 Example of Informal Correspondences 10 3.2 The Mapping Language . 11 4 Inferring Mapp ing Rules from Arrows and Boxes 14 4.1 Identifying Groups 14 4.2 Generating Tree Expressions 16 4.3 Generating Mapping Rules 16 4.4 Chapter Summary 18 5 Classes of Transformations 19 5.1 Unnest and Nest 19 5.2 Flip and Flop 21 5.3 Merge and Split : 22 5.4 Additional Useful Operators 23 Contents iv 5.5 Theorem 24 6 Query Translation of Tree Pattern Queries 29 6.1 Query Translation Semantics 29 6.2 XQuery Fragment Considered 30 6.3 Forward Query Translation Algorithm 30 6.3.1 Translating Tree Patterns. 30 6.3.2 Translating Join of Tree Patterns 33 6.3.3 Translating TPQ to XQuery 34 6.3.4 Forward Query Translation Algorithm 35 6.3.5 Backward Query Translation 35 6.4 Correctness of Query Translation 43 7 Query Translation for Generalized Tree Pattern 46 7.1 Generalized Tree Pattern 46 7.2 GTP Query Translation Examples 47 7.2.1 XQuery Forward 47 7.2.2 XQuery Backward 51 7.3 GTP Query Translation Algorithm 55 7.4 Conclusions and Future Work 56 8 Exploring 1:N and N : l Mappings in HePToX 57 8.1 Introduction 57 8.1.1 Challenges of dealing with N:l and 1:N Mappings . . . . 57 8.2 Different categorizations of N:l and 1:N Mappings 58 8.2.1 Mapping at Schema-Level or Data-Level 58 8.2.2 Mapping from Multiple Tuples or A Single Tuple . . . . . 59 8.2.3 String vs Arithmetic Manipulation 60 8.3 Representation of N:l and 1:N Mappings in HePToX 62 8.3.1 Visual Annotations used to Capture N:l and 1:N Mapping 62 8.3.2 Representation of N:l and 1:N Mappings in Mapping Rule 63 8.3.3 Rule Generation Algorithm with 1:N and N:l Mappings . 64 8.3.4 Variable Binding in Mapping Rules 65 8.4 Query Translation with N:l and 1:N Mappings 66 8.4.1 Query Translation Algorithm 66 8.4.2 Query Translation Example with only N:l Mapping . . . 67 8.4.3 Query Translation Example with N:l and 1:N Mapping . 68 8.5 Analysis of Query Translation involving N:l and 1:N Mappings . 69 8.5.1 Def. of Soundness and Completeness of Query Translation 69 8.5.2 Price-Cost Example 70 8.5.3 Sound but not Complete translated query results 70 8.5.4 Complete but not Sound translated query results 71 8.5.5 Summary of Soundness and Completeness 71 8.6 Conclusion and Future Work 72 Contents v 9 Experimental Study 81 9.1 Experimental Guidelines 81 9.2 Implementation and Setup 82 9.3 Datasets and Queries used for Experiments 82 9.4 Rule Inference Algorithm in HePToX 83 9.5 Query Translation in HePToX 83 9.6 Query Performance in HePToX 85 9.7 Scalability of HePToX 85 9.8 Real Dataset 86 10 Conclusions and Future Work 87 Bibliography 88 A Schema Graphs of the 10 XMark Schemas in Experiments . . 91 B XMark Schema Variation Descriptions 97 C XMark Queries in Experiments 100 D DBResearch Queries in Experiments 104 List of Tables 9.1 Query Description for XMark Dataset vii List of Figures 1.1 Mapping between Heterogeneous Peer schemas MON and BOS . 2 2.1 Tree Pattern Query representation for XQuery XQ1 6 2.2 XQuery Example XQ2 7 2.3 Generalized Tree Pattern representation for XQuery XQ2 . . . . 7 3.1 Mapping between Heterogeneous Peer schemas MON and BOS . 10 3.2 Mapping Rules from MON to BOS in Figure 3.1 12 4.1 Example of Group Determination for Figure 1 15 4.2 Rule Construction Algorithm 17 5.1 Illustrating Unnest/Nest 20 5.2 Illustrating Flip/Flop: Schemas 20 5.3 Illustrating Flip/Flop: Instances 22 5.4 Illustrating Merge/Split: (a) schema; (b) instance 23 6.1 Join of two Tree Patterns 31 6.2 Translation of TP tx from Figure 6.1 32 6.3 Translation of TP t2 from Figure 6.1 33 6.4 Translation of the join of TPs from Figure 6.1 33 6.5 Forward Query Translation Algorithm 36 6.6 Join of two Tree Patterns for Q2. . . . 37 6.7 Translation of TP 1 in Figure 6.6 38 6.8 Translation of TP 2 in Figure 6.1 with rule 2 in Figure 3.2 . . . . 39 6.9 Translation of TP 2 in Figure 6.1 with rule 4 in Figure 3.2 . . . . 40 6.10 Stitching Translated TP Pieces for TP 2 in Figure 6.6 41 6.11 Merging Figure 6.7(lc) and Figure 6.10(TP2-S) 42 6.12 Backward Query Translation Algorithm 44 7.1 An Example XQuery and corresponding GTP 47 7.2 Book-Author Schemas for GTP Query Translation Example . . . 48 7.3 HePToX Mapping Rules for Schemas in Figure 7.2 48 7.4 XQuery posed on the source schema XQsrc 48 7.5 GTP Representation for XQsrc 49 7.6 Expanded GTP for XQSTC 50 7.7 Translated GTP GTP' 50 List of Figures viii 7.8 Contracted Translated GTP GTP}gt(c) 51 7.9 Final Translated Query XQ\gt 52 7.10 GTP representation of XQtgt 52 7.11 Expanded GTP [A] against the rule head GTPfgt[A] 53 7.12 Translated GTP [A] against the source schema GTPlsrc[A\ . . . . 53 7.13 Expanded GTP[B] against the target schema GTPfgt{B] . . . . 53 7.14 Translated GTP[B] against the source schema GTP*rc[B] . . . . 54 7.15 Final Translated Query XQ\rc 54 7.16 GTP Query Translation Algorithm 55 8.1 Mapping N tag names to 1 instance value 59 8.2 Mapping N instance values to 1 tag name or instance value . . . 60 8.3 Mapping combinations of tag names and instance values 61 8.4 Aggregation Mapping .61 8.5 Book-Author example - 2:1 String Manipulation Mapping . . . . 62 8.6 Price-Cost example - 2:1 Arithmetic Manipulation Mapping . . . 62 8.7 Mapping Rule with Ntol Expression for Book-Author Example . 64 8.8 Rule Construction Algorithm for 1:1, N:l and 1:N Mappings . . . 73 8.9 Book-Author example schema in HePToX 74 8.10 Mapping Rules for Book-Author example in Figure 8.9 74 8.11 Forward Query Translation Alg. with 1:1, N:l and 1:N Mappings 75 8.12 Backward Query Translation Alg. with 1:1, N:l and 1:N Mappings 76 8.13 Tree Pattern Query representation of QI 77 8.14 Query Translation of QI in HePToX with N:l Mapping 77 8.15 Price Salary Example - with both 1:N Mapping and N:l Mapping 77 8.16 Mapping Rule with both Ntol and ltoN Expressions 78 8.17 Tree Pattern Query representation of Q2 78 8.18 Query Translation of Q2 - XQuery Forward 79 8.19 Tree Pattern Query representation of Q3 79 8.20 Query Translation of Q3 - translation against the mapping rule . 80 8.21 Mapping Rule with Ntol Mapping for Price-Cost Example . . . . 80 9.1 Query Translation and Performance for XMark Dataset 84 9.2 Time Composition in Querying for DBResearch Dataset 86 A. l Original Auction Schema: auctionO.dtd 91 A.2 Schema Variation: auctionl.dtd 92 A.3 Schema Variation: auction2.dtd 92 A.4 Schema Variation: auction3.dtd 93 A.5 Schema Variation: auction4.dtd 93 A.6 Schema Variation: auction5.dtd . 94 A.7 Schema Variation:. auction6.dtd 94 A.8 Schema Variation: auction7.dtd 95 A.9 Schema Variation: auction8.dtd 95 A.10 Schema Variation: auction9.dtd 96 ix Acknowledgements This thesis would not be possible without the help from many people. I am very grateful for all the support that I have received throughout the process. First of all, I would like to thank my supervisor, Dr. Laks V.S. Laksh-manan, for his tremendous support, great guidance, and continuous encourage-ment throughout this thesis work. I have been inspired by his enthusiasm for research and have learned from him much more than just the research. Secondly, I would like to thank my collaborator Angela Bonifati and Terence Ho for their hard work and great efforts. The HePToX project is a collabora-tion work with Laks V.S. Lakshmanan, Angela Bonifati and Terence Ho. In particular, the experiments in this thesis were conducted together with Terence Ho. Thirdly, I would like to thank Rachel Pottinger for reviewing my thesis, her tireless support and insightful advice. Many thanks also go to Alon Halevy and Igor Tatarinov from University of Washington for sharing their DBResearch dataset with us, and to Renee Miller from University of Toronto for her invaluable comments. I would also like to thank Shuan Wang and Ting Wang who were' working together with Terence Ho and me on the HePToX Demonstration. Last but not least, many thanks also go to Raymond Ng, Wendy Hui Wang, Jessica Zheng Zhao, Xiaodong Zhou, Shaofeng Bu, Zsuzsanna Hollander, Holly Kwan, Alan Hu, Anne Condon, Joyce Poon, UBC Database Lab, UBC Com-puter Science Department, Roman Holenstein, Qixing Zheng, my parents Fulin Chang and Qinglin Xie. 1 Chapter 1 Introduction 1 . 1 XML XML [34], which stands for "extensible Markup Language" is a markup lan-guage for documents containing structured information. A markup language is a mechanism to identify structures in a document. The XML specification defines a standard way to add markup to documents. Structured information contains both content (words, pictures, etc.) and some indication of what role that content plays (for example, content in a sec-tion heading has a different meaning from content in a footnote, which means something different than content in a figure caption or content in a database table, etc.). Almost all documents have some structure. An XML document is a database only in the strictest sense of the term. That is, it is a collection of data. In many ways, this makes it no different from any other file - after all, all files contain data of some sort. As a "database" format, XML has some advantages. For example, it is self-describing (the markup de-scribes the structure and type names of the data, although not the semantics), it is portable (Unicode), and it can describe data in tree or graph structures. XML differs from relational databases because the data is semi-structured, and XML has schema information which specifies the document structure. The need to generate XML data from a variety of data sources is becoming increasingly important to a wide variety of data processing environments. Inter-operability and data exchange between mainframe systems and other systems running on a variety of hardware and software platforms is essential. In this thesis, we use the graph notation to represent XML schemas since they are intuitive. Note that the representation of schema graph is rich enough to capture atomic types, sets, tuples, nested structures, keys, referential con-straints, and optionality. 1 . 2 Thesis Background Consider a large scale data sharing setting where several XML data sources in a similar domain (e.g., hospitals/medical centers) share (parts of) their data. The source schemas can be quite different. Figure 1.1 shows an example of two hospital DTDs expressed as graphs, adapted from [6, 24, 25]. Each schema is inside a rectangle and there are cross-links connecting them, which we ignore for now. Chapter 1. Introduction 2 Figure 1.1: Mapping between Heterogeneous Peer schemas M O N and BOS. To minimize clutter, some arrows between corresponding tags have been omitted (e.g., @ID to @ID). The first schema corresponds to the Montrea l Genera l H o s p i t a l ( M O N for short) database. Here, every patient is assigned a unique ID by the hospital and has a unique Medicare number (MedCr#). Symptoms of problems experienced by patients and the treatments they are administered are all grouped under patients. Patient admissions data is maintained separately and is captured via the Key/Keyref link SPatRef —><3ID (solid grey arrow K R l , Figure 1.1). A patient who moves from Montreal to Boston needs to establish a con-nection with the Mass Genera l H o s p i t a l ( B O S for short) schema. The BOS database in Boston is organized quite differently: at admission time, patients are classified on the basis of their main complaint (pulmonary, coronary, or other illnesses not shown in the figure). The usual patient details such as name, hospital ID, etc. are stored under this classification. Progress of patients during their stay in the hospital is recorded: patients' history of health problems and the treatments administered are tracked. Al l of this information is connected to the patients via the Key/Keyref link OPatRef —>®ID (solid grey arrow KR2, Figure 1.1). Consider again several autonomous health/medical data sources with het-erogeneous schemas. A natural strategy for data exchange between the sources is to embed them in a loosely coupled environment, thus leveraging maximum flexibility, with no hard commitments to conform to any rigid requirements. Currently, there is significant interest in peer-to-peer (P2P) database systems (e.g., see [4], [6], [12], [19], [25], [28], [32]). A P2P database system is essentially a collection of autonomous database systems connected by a P2P network, with the flexibility that peers may enter or leave the network at will. In this thesis, we assume that any peer at the time of entering the P2P network provides a mapping between its schema and a small number of the existing peer schemas. The peers chosen by the entering peer are called its acquaintances [4]. E.g., the M O N peer may provide correspondences between the concepts (elements and attributes) in the M O N schema and those in the BOS schema. These corre-Chapter 1. Introduction 3 spondences are either manually supplied by the peer administrator, or generated by (semi-)automatic schema mapping tools such as [18]. We assume the corre-spondences are specified intuitively in the form of arrows and boxes, shown in Figure 1.1 and explained in Chapter 3. In this thesis, we present the HePToX1 system, a HEterogeneous Peer TO peer Xml database system, which takes as input correspondences specified between pairs of peer schemas and automati-cally derives precise semantic mappings between the schemas. HePToX allows a peer user to query any peer's data using its own local schema and translates queries in a way that is consistent with the mapping semantics. This work builds upon several previous works on peer data management, but differs in key ways. E.g., [22], [28], and [8] generate mappings between schemas differing in nesting structure but do not address data/meta-data conflict. Piazza [6] uses a different mapping language and relates pairs of queries over the peers via containment or equality. [7] requires mappings to be input by the user in a more complex XQuery-like language. Differences in data representation were addressed in [19] by introducing mapping tables: e.g., we might specify that ID GDB:123 in one database corresponds to all IDs but SP:456 in another database. These are aimed at mapping value aliases in P2P networks, and are orthogonal to our contributions. Clio [28] was one of the earliest projects to pioneer the concept of correspondences (among other things). A detailed comparison with related work appears in Section 2.2. Briefly for now, to the best of our knowledge, mappings between schema and data items across peers are not considered in previous work. In HePToX, such complex mappings are automatically inferred from the intuitive correspondences input to the system. We use boxes (in ad-dition to arrows) for capturing such mappings. As a natural consequence, we study correspondences involving non-leaf elements of schemas. One of our main goals is to permit users and applications of any peer database to access data items of interest by simply posing a query to their peer, regardless of the location of the data items or the schema under which they are organized. That is, the existence of numerous peers and their schemas should be transparent to the user/application. 1.3 Challenges, Contributions and Chapter Overview The major challenges we address in this thesis are the following: (1) How can we infer precise mapping rules from an informal notation (arrows/boxes)? (2) What is the exact semantics for the evaluation of peer queries, and how do we translate queries across heterogeneous peer schemas? (3) How can we evaluate queries correctly and efficiently over the network? Summarizing, the following contributions are made in this thesis. • We propose an informal mechanism for specifying correspondences using arrows and boxes. We develop a novel algorithm for inferring mappings Pronounced Hep Talk: heterogeneous peers talk! Chapter 1. Introduction 4 between the schemas, couched in the form of SchemaLog-like rules. Our mapping language is capable of querying and restructuring semistructured' data and handles the data/meta-data interplay between schemas elegantly (Chapter 4). • The significance of the class of transformations captured by the mapping rules has been studied and their equivalence to algebraic transformations has also been established (Chapter 5). • We define the semantics of peer queries. We develop a novel query trans-lation algorithm that handles a simple but significant fragment of XQuery and show that it is correct w.r.t. the above semantics. We illustrate our algorithm with examples (Chapter 6). Translation is non-trivial even for the XQuery fragment considered. We also have studied how to translate more complicated XQuery fragment which can be represented as GTP and present a novel query translation algorithm for GTP query (Chapter 7). • Traditionally, most of the schema mapping literatures only deal with one-to-one mappings. In this thesis, We have studied the many-to-one and one-to-many mappings. We discuss the challenges of dealing with these mappings, how to represent them in HePToX, and how to make use of them to translate queries (Chapter 8) • We have developed HePToX, a HEterogeneous Peer TO peer Xml database system, incorporating the ideas developed in this thesis. I describe the im-plementation of HePToX and show the result of an extensive set of exper-iments that measured the effectiveness of the query translation algorithm, as well as the scalability of the approach. We discuss the results and the lessons learned (Chapter 9). In the next chapter, we will first introduce the basic definitions that we are going to use through the thesis, as well as the related work. Followed by that, in (Chapter 3), we will study a comprehensive motivating example, which will be used throughout the major chapters of this thesis. Then we will present the main contributions in the order mentioned above: Inferring Mapping Rules from Arrows and Boxes (Chapter 4), Classes of Transformations (Chapter 5), Query Translation of Tree Pattern Queries (Chapter 6), Query Translation of Generalized Tree Patterns (Chapter 7), Exploring Many-to-One and One-to-Many Mappings in HePToX (Chapter 8), Experiments (Chapter 9), and finally Chapter 10 summarizes the thesis and discusses possible future work directions. Note that the work on GTP query translation (Chapter 7) and Many-to-One and One-to-Many mappings (Chapter 8) are still HePToX's on-going work. These two chapters report the work that have been done so far and summarize possible future work. 5 Chapter 2 Definitions and Related Work 2 . 1 Definitions 2.1.1 XQuery XQuery [35] is a powerful and convenient language designed by W3C for query-ing XML data. This query language queries not only files in XML format, but also other data including databases whose structure - nested, named trees with attributes - is similar to XML. The basic construct of XQuery expression is the so-called PLWOR (For, Let, Where, Order By, Return) expression. The result of the whole FLWOR expression is a sequence generated by concatenating all the newly constructed R E T U R N elements. The F O R clause binds a variable to each element in a sequence by iterating the sequence in turn. The L E T clause binds a variable to all elements in a sequence as a whole. The optional O R D E R B Y clause orders the surviving tuples of bindings generated by the F O R and L E T clause according to some criteria specified in the query. Let's take the following XQuery as an example: xqi: FOR $b IN doc ("books, xml")//booki, $a IN $b/authorl WHERE $b/titlel="XML" RETURN { $a/name1 } The F O R clause selects all bookl elements into a variable called $b, and all authorl elements under bookl elements into a variable called $a. The W H E R E clause selects only bookl elements with a t i t l e l element with its value equal to "XML". Finally, the R E T U R N clause specifies what should be returned. Here it returns the namel elements. 2.1.2 Tree Pattern Query and Generalized Tree Pattern For convenience in processing XQuery statements, we need a concise represen-tation. In this thesis, we make use of tree patterns (TP) [2] and generalized tree patterns (GTP) [13] introduced in the literature as a tool for representing Chapter 2. Definitions and Related Work 6 the XQuery fragment considered and for translating it. We briefly review the notions below. A tree pattern query is a rooted tree representation of an XQuery with child and descendant edges, and with nodes labeled by variables. Each node variable may be additionally constrained to have a specific tag (e.g., name($x) = book) and (in case of leaf elements) possibly constrained on its value (e.g., $y/text() = "123" or $x/text() > $y/text()). Figure 2.1 shows the tree pattern query (TPQ) for the XQuery example XQ1 mentioned in last section. The semantics of a TPQ is that its nodes are matched to the data nodes in an input XML database while respecting edge relationships and node constraints. Each match leads to an answer that consists of the data node that matches a specially marked distinguished node of the TPQ (a distinguished node is a node with the variable specified to be returned in an XQuery). E.g., in Figure 2.1 the answer should contain namel nodes which corresponds to the output node marked with a dashed box. As a convention, if no explicit distinguished node is marked in an example, we intend that all leaf nodes of the TP are distinguished. The central importance of TPQ to XML query evaluation is evident from the attention of recent research on efficient evaluation of TPQs [1, 11, 38]. Figure 2.1: Tree Pattern Query representation for XQuery XQ1 While XQuery expression evaluation includes the matching of tree patterns, and hence can include TPQ evaluation as a component, there is much more XQuery than simply TPQ. To facilitate an efficient evaluation of more compli-cated XQuery queries, the notion generalized tree pattern was introduced in [13], which allows optional edge relationships and have grouping information in the representation. As a preview, Figure 2.2 and Figure 2.3 show a sample nested XQuery as well as the associated GTP. The GTP has solid and dotted edges. Solid edges represent mandatory relations (parent-child or ancestor-descendent) just like edges of a TPQ. Dotted edges demote optional relationships! The GTP can be informally understood as follows: (1) Find matches for all nodes con-nected to the root by only solid edges; (2)Next, find matches to the remaining nodes (whose path to the GTP root involve one or more dotted optional edges), if they exist. $book' name( $book') = bookl & name( $author') = authorl & name( $name') = namel & name( Stitle') = title 1 & $title'/text() = " X M L " $author' Stitle' Chapter 2. Definitions and Related Work 7 Xq2: FOR $b IN / / b o o k s , $ i IN $ b / i s b n , $t IN $ b / t i t l e RETURN <book> { FOR $a IN / / a u t h o r , $ i l IN $ a / b o o k / i s b n , $ t l IN $ a / b o o k / t i t l e WHERE $ i = $ i l AND $t = $ t l RETURN $a/name} </book> Figure 2.2: XQuery Example XQ2 Join Constraints: $i = $il,$t = $tl //$a r.o IiVm $bl_urj $niffl $t / \ H $i1 $t1 1J.GJ 111 $i EPJ (A) Group 0 (B) Group 1.0 name( $b) = books & name( $i) = isbn & name( $t) = title & name( $a ) = author & name( $n ) = name & name( $bl ) = book & name( $il ) : = isbn & name( $tl) = = title Figure 2.3: Generalized Tree Pattern representation for XQuery XQ2 2.2 Related Work 2.2.1 Data exchange and P2P data integration systems The Clio data exchange project is one of the pioneers in schema mapping dis-covery, query answering, and mapping maintenance against schema changes, among others (e.g., see [22], [28], [37], [33]). They were the earliest to propose the idea of schema correspondences and develop an efficient algorithm for in-ferring mappings in the form of queries. Their framework can capture both relational and X M L schemas. Further, as part of the Hyperion project (e.g., [4], [19]), the authors propose an elegant notion of mapping tables, capturing constraints on data exchange without violating peer autonomy. A key novelty of our contribution is the handling of mappings between peer schemas that involve \ Chapter 2. Definitions and Related Work 8 a data/meta-data conflict (e.g., Figure 1.1), which is not captured in prior work on schema mappings and query translation, to the best of our knowledge. HeP-ToX's way of handling such conflicts also yields a nice mechanism for mappings between internal nodes (i.e., complex elements) across schemas. Mapping ta-bles are orthogonal to our contributions and nicely complement them. Finally, consider the schemas (taken from [28]) src: Emps —> Emp*, Emp —+ A B C, and tgt: Emps —• Emp*, Spouses —> Spouse*, Emp —» A B* E, Spouse —> E C, where there is a keyref/key constraint from Emp/E to Spouse/E. In the, Clio approach, the value of E in schema tgt would be created as a function of the values of A and C, whereas in HePToX it would be created as a function of A. There are natural examples justifying both choices. E.g., let A = empName and C = spouseName. If B = child, creating E as a function of A and C nicely groups children of an employee and spouse. But, if B = employee's hobby, which is unrelated to spouse, then creating E as a function of A is more meaningful. The Piazza project ([6], [7]) defines semantic LAV/GAV-style mappings among peer schemas. Query answering semantics is based on the notion of certain answers [6]. Piazza mappings are provided in an XQuery-like language. Our mappings are easier and more intuitive to specify. Also, the exact mapping rules are derived automatically from correspondences, like Clio. Our forward query translation has a semantics similar to answering queries using views [21]. However, we are able to leverage Skolem functions and the special nature of mapping rules for obtaining the forward translation efficiently. Benedikt et al. [8] and Bohannon et al. [9] also have studied mappings between schemas, including recursive ones. They also consider node-to-path mappings. None of these works handles schema mappings involving data/meta-data conflict. Calvanese et al. [12] addresses the problem of data interoperation in P2P sys-tems using expressive schema mappings, also following the GAV/LAV paradigm, and show that the problem is in PTIME only when mapping rules are expressed in epistemic logic. [14] studies the problem of finding a minimal query reformula-tion between relational and XML schemas. Consistency of XML data exchange for tree patterns is studied in [5]. Fagin et al. [3] showed how the composition of two first order GLAV mappings can create a second order mapping. However, they did not consider mappings between data and schema items. We are able to directly generate mappings between data and schema items across schemas, from given correspondences. Our mapping language, like SchemaLog [20], is se-mantically reducible to first order. Finally, the notion of primitive groups used in mapping rule inference (Chapter 4) is similar to the notion of primary paths used in Clio [28] although the groups we generate are different from the logical relations of Clio. 2.2.2 Schema-matching systems Automated techniques for schema matching (e.g. CUPID [18], [16], [29]) are able to output elementary schema-level associations by exploiting linguistic fea-tures, context-dependent type matching, similarity functions etc. These associ-ations could constitute the input of our rule inference algorithm if the user does Chapter 2. Definitions and Related Work n o t p r o v i d e t h e - a r r o w s . 10 Chapter 3 A Motivating Example In the introduction, we introduced the HePToX system, which handles schema heterogeneity in peer-to-peer systems. As in put, it takes a set of correspon-dences between schema. It then creates a set of SchemaLog-like mapping rules from the correspondences. These mappings are then used to translate queries between schemas. In this chapter, we are going to explore the example discussed in the intro-duction (Figure 3.1) in more depth to illustrate the richness of the problems HePToX considers. .Pulmonary Coronary I PatRef Symptom Treatment >W">: u4 InsName Policy* Enter Leave Patient Date Desc J* H :—/ .,w*— Figure 3.1: Mapping between Heterogeneous Peer schemas M O N and BOS. To minimize clutter, some arrows between corresponding tags have been omitted (e.g., @ID to @ID); every unlabeled edge is labeled '1' by default. 3.1 Example of Informal Correspondences Consider the arrows linking schemas now. Each arrow specifies a correspondence between a pair of elements from the two schemas. E.g., the arrow c9 shows that DisDate in M O N corresponds to Leave in BOS. For clarity, we use a different arrow type for each correspondence. Since the schemas have shared elements, i.e., they are DAG structured, we need to specify correspondences between paths. E.g., the element Desc has a unique parent T r e a t in M O N but two parents Treatment and Symptom in BOS. To relate Desc in M O N to Treatment /Desc in BOS, we use the same line type for the edges Treat—Desc Chapter 3. A Motivating Example 11 in MON, Treatment—Desc in BOS, and the arrow c7 connecting them. Other correspondences can be understood similarly. HePToX also allows the expression of schema to data correspondences: this is represented by a dashed box around the schema-level concepts that are related to data level concepts in the other schema. For example, Admiss ion /Prob lem in MON may contain data items such as 'Pulmonary' or 'Coronary', so the Pulmonary and Coronary elements in BOS are grouped with a (thick dashed) box and are linked by the arrow c3 to Admiss ion /Prob lem in MON. Note that there may be other illnesses that correspond to values of Admiss ion /Prob lem not shown in the figure for brevity. Those illnesses which may be instances of Admiss ion /Problem in the MON database correspond to element labels in the BOS database, but there is no assumption that the set of illnesses occurring in the two databases are the same or even overlap. Thus, the notation of boxes and arrows is effectively used for correspondence between data items and schema items across schemas. While in HePToX we handle 1:1, 1:N, N:l arid M:N correspondences, in this chapter and in Figure 3.1, we show only 1:1 correspondences. N:l and 1:N mappings will be discussed in Chapter 8. 3.2 The Mapping Language The mapping language we use in this paper is an adaptation of SchemaLog [20] to deal with tree-structured data. SchemaLog is a syntactic higher order logic for querying and restructuring relational databases, treating data and schema at par. Thus, our mapping language is capable of querying and restructuring semistructured data. Mappings are expressed as rules. Each rule is of the form (rule head) <— (rule body). The building blocks of this language are called tree expressions, defined and explained below. The mapping rules are made up of atoms of the form Tag —»id, where Tag is a tag or a tag variable and id is the id associated with a node with this tag. Here, id may be a variable or any term of the form f($vi,$«„), for some variables $Vi and some Skolem function / . Similarly to Clio [28], SchemaLog [20], and many other previous works, we use Skolem functions both for creating new node id's and for grouping. Atoms can be nested inside other atoms, thus expressing nesting. A comma-separated list of atoms is used for expressing the subelements of a given element. Attributes are preceded with a '0'. We stress that the rules and the mappings are not intended for physically transforming data from one source's schema to another. As pointed out in [4], they are rather intended for expressing the semantics of data exchange - if data were to be exchanged from source 1 to 2, how would it correspond to the schema of source 2. Atoms can be nested to form tree expressions. Tree expressions are either atoms (t —> i) or are of the form t —> i[TEi,TEk], where t —> i is an atom and TEi are tree expressions. In Figure 3.2, each rule head is a tree expression while the rule body is a conjunction of tree expressions and built-in predicates (=, >, etc.). We informally explain rules 1 and 4 below. Rules 2 and 3 can be Chapter 3. A Motivating Example 12 understood similarly. 1. MassGenera l s / l ( $ M 5 / i ) [ A d m i s s i o n s f2(SMgh) [SAP/textQ s f3($AP/text(), SID, $'M, %AD, $DD, SN) [@ID -> SID, P o l i c y # s $M, Enter s Leave s $£>£>, P a t i e n t s $ATJ] MonGenHosps $Ms/i[Patient s $P [@IDs $JD, MedCr# s $M, Name s $AT], A d m i s s i o n s $4 [Problems $4P, AdmDate s $AD, DisDate s SDD, @PatRef s SPR}}, SPR = SID. 2. MassGeneral s /1($Mff/i)[Progress s /4($PP) [Symptoms f5($EP, $£L>) [Date s Desc s SEP]]] MonGenHosps $Afph[Patient s $P [@ID s $ /£>, MedCr# s $M, Name s $ 7 V , H i s t s $#[Event s $ £ [ P r o b l e m s $£P, Date s $££>]]] A d m i s s i o n s $y l [Problems $ A P , AdmDate s SAD, DisDate s SDD, © P a t R e f s SPR)}, SPR = SID. 3. MassGenera ls / l ( $ M 3 / i ) [P rogresss / 4 ( $ P P ) [Treatment s f6(STDate, STDesc) [Date s STDate, Desc s STDesc]}} MonGenHosps SMgh[Patient s $P[@ID s SID, MedCr# s SM, Name s SN, Trea t s $T[Date s STDate, Desc s STDesc]], A d m i s s i o n s $.4[Problems SAP, AdmDate s SAD, DisDate s SDD, © P a t R e f s SPR}}, SPR = SID. 4. MassGenera ls / l ( $ M 3 / i ) [Progresss / 4 ( $ J D ) [@PatRefs$ P P ] ] MonGenHosps $Mp/i[Admissions $y l [Prob lems SAP, AdmDate s SAD, DisDate s SDD, © P a t R e f s SPR]], SPR = SID. Figure 3.2: Mapping Rules from MON to BOS in Figure 3.1. The atom MassGeneral s fl($Mgh) in rule 1 says translating the unique root SMgh of MON onto the BOS schema yields a unique root f\(SMgh) of BOS. Similarly, there is a unique Admiss ion node in BOS. The rule body binds the variable SAP to the problem of a patient at admission time. $AP/text() extracts the text value associated with node SAP. This value is used to form the tag of a new node, whose id is f3($AP/text(), SID, SM, SAD, SDD, SN), i.e., it is a function of the patient's admission time problem ($AP/text()), id (SID), insurance policy (or medicare) number (SM), admission and discharge dates (SAD, SDD), and name (SN). Note that the arguments of the Skolem function Chapter 3. A Motivating Example 13 /3 are exactly the mandatory single-valued subelements of the Pulmonary and Coronary elements in BOS.1 While we do not require any knowledge of integrity constraints, we can take advantage of the presence of constraints such as keys, and key/foreign key referential constraints. E.g., if a key is known, we only need use the key elements as arguments of the Skolem function instead of all mandatory single-valued ones. Patient id, name, policy number, admission and discharge dates are all matched to their counterparts in BOS. Rule 4 maps QPatRef attribute in MON to SPatRef attribute in BOS. Note that the equality @PR=@ID ensures the rule is safe. Node id's play a key role: e.g., P rogress elements are created by rules 2 and 3 independently. Whenever the id of a P rogress node created by rule 2 matches the one created by rule 3, they refer to one and the same node. For instance, suppose p5 is the ID value of a patient. Then the subtree rooted at the P r o g r e s s node /4(p5) created by rule 2 and the subtree rooted at the P r o g r e s s node /4(p5) created by rule 3 are both glued at the node /4(p5). More generally, whenever subtrees are created by applications of the same or different rules, conceptually all these subtrees are glued together at roots having a common node id. This ensures that the pieces "computed" by rules are correctly glued together. So far, we explained the meaning of mapping rules. A main challenge is their automatic derivation from user supplied informal correspondences, and is addressed in Chapter 4. The class of algebraic transformations captured by the rules is established in Chapter 5. i.e., they are not labeled '?', '*', or '+', 14 Chapter 4 I n f e r r i n g M a p p i n g R u l e s f r o m A r r o w s a n d B o x e s In this chapter, we illustrate the HePToX automatic mapping rule inference al-gorithm, by using the running example of Figure 3.1. Given a pair of schemas, represented as graphs, and a set of arrows/boxes relating nodes across the graphs, we want to automatically infer a set of rules for mapping instances of one schema into instances of the other. Suppose we wish to infer mapping rules from a schema A i (call it source) to another A 2 (call it target) based on given correspondences (as in Figure 3.1). The algorithm consists of the following main modules. (1) Determine groups of nodes in the two schemas such that each group intuitively captures some "unit" of information. (2) For each source group, construct a tree expression describing the unit of information following the hierarchical structure of the group. (3) For each target group, identify all minimal sets of source groups necessary to populate information into the target tree expression structure and construct the rules. 4.1 Identifying Groups In a schema graph A, we define group nodes. Each group node induces a primitive group, consisting of a set of nodes. The root is a group node. Any node which has a an incoming arrow labeled '?', '+', or '*', or is an ancestor of such a node is a group node. E.g., in Figure 3.1, in the MON schema, all non-leaf nodes are group nodes. For the purpose of group formation, we turn the DAG into a tree by replicating nodes with multiple parents, if necessary. Note that the set of nodes inside a box is treated as a single node for this purpose. Let T be the resulting tree. Let u be a group node in T and suppose v i , . . . , V k are all its non-group children. Then the primitive group induced by u consists of u, all its ancestors in T, the children vi, ...,Vk, and their descendants. The primitive groups corresponding to the MON schema of Figure 3.1 are gx = {MonGenHosp}, g2 ={MonGenHosp, Patient, Id, MedCr#, Name}, 53 = {MonGenHosp, Patient, Hist, Event, Problem, Date}, 54 ={MonGenHosp, Patient, Treat, Date, Desc, Doc}, g5 ={MonGenHosp, Admission, Admission.Problem}, and g6 ={MonGenHosp, Admission, Admdate, Disdate, @PatRef}. Chapter 4. Inferring Mapping Rules from Arrows and Boxes 15 Note that since MonGenHosp has no non-group children, the group induced by it is just {MonGenHosp}. Primitive groups model basic relationships that exist between different data items in a given schema and are thus similar to the primary paths of [28]. A main difference is that primitive groups account for nodes with boxes and thus facilitate mappings between data and schema. Primitive groups can be merged under some conditions, formalized below. Let v be any node in the schema tree T above and u an ancestor of v. Then we say v is mandatory relative to u if none of the edges on the path from u to v is labeled '?' or '*'. Call v single-valued relative to u if none of the edges on the path from u to v is labeled '*' or '+'. Let g,h be two primitive groups. Let u be the least common ancestor of nodes in g and h. Then g and h can be merged provided: (i) all descendants of u in g are mandatory single-valued relative to the root; o r (ii) all descendants of u in g are mandatory single-valued relative to u and all descendants of u in h are mandatory relative to u. In our example, we can merge g\ with any one of the other primitive groups since condition (i) holds vacuously. Also, g§ and g$ have Admission as their least common ancestor, relative to which Admdate, Disdate, QPatRef are mandatory single-valued, and Problem is mandatory. So, by (ii), #6,<75 can be merged. The resulting groups for the MON schema are shown in Figure 4.1, which also shows the groups for the BOS (target) schema. Note that groups always induce connected subgraphs of the schema graph. sgi = {MonGenHosp,Pat ient ,Hist ,Event , EProblem, EDate} sg2 = {MonGenHosp, P a t i e n t , Treat , TDate, TDesc, Doc} sg3 = {MonGenHosp, Admiss ion, Admission-Problem, AdmDate DisDate , © P a t R e f } sgi = {MonGenHosp, P a t i e n t , © I D , MedCr#, Name} tgi = {MassGeneral , Admission, Pulmonary, Coronary, @ID, InsName, P o l i c y # , En te r , Leave, Pat ient} tgi = {MassGeneral , P rogress , Symptom, SDate,SDesc} = {MassGeneral, P rogress , Treatment, TDate, TDesc} £54 = {MassGeneral, P rogress , © P a t R e f } . P a i r s o f G r o u p s C o n n e c t e d b y A r r o w s : sg3<-tgi,sgi*-tgu sgi+-tg2, sg2^tg3, sg3<-tg4. Figure 4.1: Example of Group Determination for Figure 1. For mapping from one schema to another, we consider every target schema group and write a rule for it by examining those groups of the source schema that are connected to it by arrows. Figure 4.1 shows the pairs of groups connected by arrows in our example. 1 Chapter 4. Inferring Mapping Rules from Arrows and Boxes 16 4.2 Generating Tree Expressions The next step is to generate tree expressions, the building blocks of the mapping rules. For each group g, we examine the subgraph of the original schema graph induced by the nodes in the group (not counting Key —»Keyref(s) edges). If the subgraph of the schema graph induced by g is a tree, we are ready to write the tree expression for that group. If it is a DAG, then we replicate each shared node (recursively) as many times as necessary to create a tree structure. An exception is when the DAG structure is the result of multiple nodes in a box sharing the same substructure. E.g., in the target schema BOS, Pulmonary, Coronary, etc., all share the same substructure and they are in a box. In this case, we will not replicate the shared elements. In our example, all source and target groups happen to induce trees. For each source group, the tree expression is written by essentially follow-ing the recursive tree structure, using [...] to capture the nesting. For every node in the group, we write the expression Tag—>$var, where Tag is the tag name of the node and $var is a new variable. If the node is inside a box (e.g., like Pulmonary in the BOS schema), then we use a tag variable and write $Tag —t$var. E.g., for group sg2, we can write its tree expression as MonGenHosp-> $Mp/i[Patient -» $P[Treat -> $T [Date -> %TDate, Desc - » %TDesc, Doc -> $Doc]]}. For target groups, we follow the same procedure. However, there is a major difference w.r.t. source groups, i.e., we do not know the node id's to be used in the generated tree expressions. Instead, we write the tree expression as a skele-ton, leaving the node id's as '??' for now. As an example, the tree expression for tgi would be MassGeneral-+?? [Admission^?? [$Tag->?? [@ID->?? Patient—> ??]]]. The '??' will be filled in when we write the mapping rules. We drop a leaf node from consideration if there is no counterpart in the source schema. E.g., Doc is one such node. The module for creating trees and for writing tree expressions is straightforward and is not shown. 4.3 Generating Mapping Rules The last step is writing the mapping rules. Figure 4.2 shows the algorithm. Consider each target group tg. Let {sgi,sgj} be the set of source groups connected by arrows to tg. Let TE(g) denote the tree expression for group g. Applying step 1, we start with the rule skeleton TE(tg)<—TE(sgi),TE(sgj). Based on the arrows incident on the leaf elements of tg, we fill in -the variables corresponding to leaf positions in TE{tg), i.e., the right-hand-side of atoms cor-responding to leaf nodes. E.g., for tg2, we would start with TE(tg2) <— TE(sgi). The rule body only contains TE(sg\) since that is the only source group con-nected to tg2 by arrows (see Figure 4.1). Based on the arrows, we can fill in the right-hand-sides of Date and Desc in the rule head as $ED and $EP, respec-tively: MassGeneral—> ?? [Progress —> ?? [Symptom—> ?? [Date —> ??, Desc —> ??]]] < MonGenHosp-* $M [Patient-> $P [Hist-v$J7 [Event - • $E [EProblem-* $EP, Chapter 4. Inferring Mapping Rules from Arrows and Boxes 17 DeriveMappingRules. Input: source groups, and target groups Output: a set of mapping rules For every target group tg Let {sgi,sgj} be the set of source groups connected to tg by arrows. Let TE(tg) be the tree expression for group tg. 1 .Start with the rule skeleton TE{tg)<—TE(sgi),TE(sgj) Fill in the variables corresponding to leaf positions in TE(tg) based on the arrows incident on the leaf elements of tg 2. For root and each of its descendants via only single-valued edges Assign their ids as distinct Skolem functions of the root variable in the source 3. For each internal node inode, Assign its id as a distinct Skolem function of the variables associated with all its single-valued leaf descendants If any of its single-valued leaf descendant sc does not belong to tg Trace the source group sgp with an arrow pointing to sc Add TE(sgp) in the rule body: TE(tg)+—TE(s9i),TE(sgj),TE(sgp) Let scref denote the corresponding element of sc in the source schema If scref points to a node in the source schema scid and the source group sgq that scid belongs to is not in {sgi,sgj} Add TE(sgq) in the rule body: TE(tg)<—TE(sgi), ...,TE(sgj),TE(Sgp),TE(sgq) Equate the variable scref binds to ($scre/) with the variable scid binds to ($scid): $scref=$scid/text() (if scid is an attribute, then use $scref=$scid instead) Add $scref as an argument to the Skolem function on the RHS of inode. Figure 4.2: Rule Construction Algorithm. Chapter 4. Inferring Mapping Rules from Arrows and Boxes 18 E D a t e - > $ £ £ > ] ] ] ] . Next, according to step 2, for root'MassGeneral'and its single-valued child Admission, we assign their ids as distinct Skolem functions of the root variable in the source ($Mgh). Then we apply step 3: for each internal node v, we assign its id as a distinct Skolem function of the variables associated with all its mandatory single-valued leaf descendants. If a key is specified by the schema, we use the key instead. Additionally, if a variable is used for the tag of v, then this variable is also added as a Skolem argument. E.g., for the RHS of Symptom, we would use the Skolem function f3($EP, §ED), since no key is specified. For Progres s , its (only) mandatory single-valued leaf child is SPatRef, which does not belong to tg2- By following the arrow incident on (SPatRef we trace source group s<?3, so we introduce TE(sg3) in the rule body. Now, SPatRef in the source schema points to the ID attribute SID of P a t i e n t , an attribute that does not belong to sgi or sg3. However, SID belongs to,sp4 so we also add TE{sgi) to the above rule body. We equate the variable associated with the SID attribute in TE(sgi) with the variable associated with SPatRef in sgi. At this point, the rule looks as shown in Figure 3.2. 4 .4 C h a p t e r S u m m a r y So far we have presented how to automatically infer a set of SchemaLog-like rules for mapping instances of of one DTD into instances of another DTD. These rules correspond easily to SchemaLog-like rules, and are easy to understand. One might ask what is significant or fundamental about the class of transfor-mations that are captured by the rules. The key idea is'that the rules capture a class of database tree transformations that are expressible using the operators unnest/nest (similar to those for nested relations), flip/flop (which basically change nesting orders in the schema), and merge/split (which have a flavor of grouping and "ungrouping" a set of nodes. It can be shown that the rules capture precisely the class of transformations expressible using these operators together with a few additional operators like node addition/deletion and tag modification, added for completeness purposes. In the next chapter (Chapter 5), an algebraic characterization of the class of transformations captured by the mappings generated by this algorithm will be explained. 19 Chapter 5 Classes of Transformations In Chapter 4, we presented a technique for inferring SchemaLog-like rules in-volving tree expressions, that express the mapping between a pair of schemas, based on correspondences specified using boxes and arrows. How do we know that the mappings captured using the inferred rules are meaningful? What can we say about the class of mappings (transformations) that are captured by the rules? To answer these questions, we introduce a small set of simple but powerful algebraic operators for expressing transformations between (database instances of) schemas. We will eventually show that the rules inferred by our al-gorithm based on boxes and arrows correspond to complex expressions obtained by composing these algebraic operators. The operators consist of three main groups: (1) Unnest and Nest, (2) Flip and Flop, and (3) Merge and Split. Intuitively, Unnest and Nest converts a nested representation into a flattened or unnested one and vice versa; Flip and Flop transforms data instances to schema elements and vice versa; and Merge and Split change the way of grouping from one schema to the other. Addition-ally, we also allow individual node insertion, deletion, and tag modification for completeness. We illustrate all of them with examples and include a definition of one op-erator from each group. Throughout we use upper case letters (A,B,...) for denoting element types and lower case ones for denoting element instances of the corresponding types (e.g. a, a; are element instances of type A). 5.1 Unnest and Nest The first group of operators converts a nested representation into a flattened or unnested one and vice versa. For example, Figure 5.1 shows two schemas. The schema in Figure 5.1(b) is obtained by unnesting the element type P a t i e n t in Figure 5.1(a) on its child subelement Symptom. The associations between a patient and his/her symptoms are captured by a reference link from symptom to patient in Figure 5.1(b) via SPatre f —* @ID. Conversely, nesting Symptom into P a t i e n t in the schema of Figure 5.1(b) yields the schema of Figure 5.1(a). Note that the transformation sketched in Figure 5.1 at the schema level induces a corresponding obvious transformation on the instances. We formalize these operators below. Definition 1 [Unnest] Consider a schema A and two element type nodes A, B such that A contains the edge (A, B) with label £ e {'*', '+'}. Suppose A Chapter 5. Classes of Transformations 20 Records Records * * * Patient Name (a) (b) Figure 5.1: Illustrating Unnest/Nest. has an ID attribute ©id. Let D be a valid database instance of A. Then the unnest operator UA-.B(D) is an instance of the following schema A', obtained from A: B is made a child of the root instead of its current parent type A and the edge (root, B) is labeled £; B has an additional IDREF attribute, ©ARef, which points to the ID attribute ©id of A. The instance D' = UA-.B(D) is obtained from D as follows: (i) make every element instance b of B that is a child of an element instance a of A, an immediate child of the root; (ii) set the value of the ©ARef attribute of b to match the value of attribute ©id of a. • Definition 2 [Nest] Consider a schema A and two element nodes A, B such that A contains an ID attribute, say @id and B contains an IDREF attribute, say ©Aref. Let D be a valid database instance of A. Then NA-.B(D) is an instance of the following schema A', obtained from A: B is made a child of the element type A and the ARef IDREF attribute B is deleted. The instance D' = MA-.B{D) is obtained from D as follows: (i) make every B element b a child of the A element a such that the-@ARe/ attribute of the former matches the ©id attribute of the latter; (ii) delete the ©ARef attribute of b. a Records Records * Patient Pulmonary Coronary Patient ID Name Ailment ID Name (a) (b) Figure 5.2: Illustrating Flip/Flop: Schemas. Chapter 5. Classes of Transformations 21 5.2 Flip and Flop Intuitively, flip refers to the operation that take information expressed as meta-data and expressing it as data and flop refers to the operation that take data instances and express them as meta-data. Consider Figure 5.2. In Figure 5.2(a), there is a list of patients for each of whom their id, name, and ailment are shown. In Figure 5.2(b), for each patient, the patient is a child of a new node whose tag is the value of the patient's ailment (e.g., 'Pulmonary', 'Coronary' etc.). The operation that transforms instances of Figure 5.2(a) to those of Figure 5.2(b) is referred as flip. Its converse, flop, transforms instances of the schema Figure 5.2(b) into corresponding instances of Figure 5.2(a). Definition 3 [Flip] Consider a schema A and element type nodes A, B, C such that A contains the edges (A, B), (B, C), and C is a leaf. Let D be a valid database instance of A. Then the flip operator J-B-.C(D) is an instance of the following schema A', obtained from A: (i) remove the edges (A, B) and (B, C), and for every value c of C in database D, create a node with tag c and add the edges (A, c), (c, B); (ii) the labels of the edges (A, c) are all identical to the label on the edge (A, B) in A, and the labels of the edges (c, B) are all '1'. The instance D' = TB-.C(D) is obtained from D as follows: for each A element o: (i) let [&i,...,6fc] be the list of B subelements of a; delete the edges (a,6j); (ii) for each bi, for each C subelement c of fej, delete the edge (6i,c), create a new node c in D' whose tag is the value of the node c in D, and make c a child of a; make a copy of the element bi a subelement child of the node c. • Note that in the database D, if a B element does not have a C subelement (say because the (B,C) edge was labeled '?' in the schema A), then this B element will not be present in the transformed database D'. Similarly, if a B element has k > 1 C subelements with values c\, ...,Ck, then a copy of this B element (without the C subelements) will appear k times in D', once under each of the nodes associated with the c;'s. In our example in Figure 5.2, the edge (Pat ient , Ai lment) is labeled '1', so every patient in an instance of Figure 5.2(a) will be present exactly once in the transformed instance of Figure 5.2(b). Figure 5.3(a)-(b) illustrates the flip operator using instances of the schemas in Figure 5.2(a)-(b). When we flip the instance of Figure 5.3(a) w.r.t. A i lment and P a t i e n t , we obtain the instance of Figure 5.3(b). The flip operator ^ A i i m e n t : P a t i e n t applied to the instance in Figure 5.3(a) yields the one in Figure 5.3(b). Here, the tag A corresponds to the root tag Records. Note that A is needed for the operator to be well defined but is not specified as a parameter in its specification. Definition 4 [Flop] Let A be a schema containing element nodes A, C,bi,..., bk and edges (J4, bi), (bi, C), 1 < i < k. Suppose the label on edges (yl, &,) is £. Let D be a valid database instance of A. Let B be a new tag that does not appear in A. Then CB-.C(D) is an instance of the schema A', obtained from A as fol-lows: (i) remove all nodes bi and replace them with a single node with tag B; make this node a child of C; (ii) make C a child of A; (iii) the label on the edge Chapter 5. Classes of Transformations 22 (A, C) is I: The instance D' = CB-.C{D) is obtained from D as follows: for each C element c, with parent B element 6, let a be the A element parent of b; (i) then delete b and make c a direct child subelement of a; (ii) create a new child subelement with tag B under c and set its text value to b. •. m Again, the flop operator is illustrated in Figure 5.3(a)-(b). Applying C Ailment :Patient on the instance in Figure 5.3(b) yields the one in Figure 5.3(a). Here, the tag A corresponds to the root tag Records. Note A is needed for the operator to be well defined but is not included as a parameter in the specification of the flop operator. P u l m o n a r y I Patient ID Name P I J o h n Records C o r o n a r y I Patient ID Name P2 J a n e (b) P u l m o n a r y i Patient ID Name P 3 M a r y Figure 5.3: Illustrating Flip/Flop: Instances. 5.3 M e r g e a n d S p l i t Merge and Split are the transformations that capture how grouping is changed from one schema to the other. Consider the database in Figure 5.3(b). We could merge nodes that are siblings and have a common tag, and possibly satisfy some additional conditions. E.g., we could merge all Pulmonary nodes and all Coronary nodes, thus grouping together all subtrees that were rooted at these nodes in the input instance. The resulting schema and database instance are shown in Figure 5.4. Note that merge is associative and is thus well defined. Conversely, splitting Patient elements in the instance of Figure 5.4(b) would yield the instance in Figure 5.3(b). Definition 5 [Merge] Let A be a schema containing element type nodes A,bi,...,.bk,C and edges (A,bi),(bi,C), 1 < i < k. Let D be a valid database instance of A. Then the merge operator MA-.C{D) is an instance of the schema A', obtained from A as follows: (i) the label on the edges (A, bi) is set to '1'; (ii) the label on the edges (6,,C) is set to '+'. The instance D' = MB-.C{D) is obtained from D as follows: (i) let {bl,...,bm} be the set of all parents of the C elements and are children of a common A element, say a; whenever any two nodes bl,V have the same tag, merge them; (ii) subtrees rooted at 6l,6J now get rooted at the merged node. • Of these, merging is akin to SQL group-by except the aggregation is struc-tural as opposed to being value aggregation. Splitting is the converse of merging and as such has no counterpart in SQL. Chapter 5. Classes of Transformations 23 Note that node merging is associative so the operator merge above is well defined. A variation of the merge operator can be defined where the group-by criterion is not just tag equality but also equality of values of one or more leaf children of the candidate nodes to be merged. As an example, suppose we have a list of author elements, each author containing a name leaf sub-child and a book subelement. We could merge all author nodes that have the same name, thus achieving a group by author. We now turn to split. Definition 6 [Split] Let A be a schema containing element nodes A, bi,..bk,C with edges (A,bi), (&i,C). Then SA-.C(D) i s an instance of the schema A', ob-tained from A as follows: (i) the label on the edges (A, bi) is set to '+'; (ii) the label on the edges (&i,C) is set to '1'. The instance SA-.C(D) is obtained from D as follows: (i) for every A element a, for every B subelement child b of a, whenever b has n > 1 C subelement children c\,c„, replace b with n copies of b; (ii) make Ci the child of the i-th copy of b. m Records Records Pulmonary / \ Patient Patient C o r o n a r y I Patient /I ID Name PI John ID Name P3 Mary .(b) ID Name P2 Jan e Figure 5.4: Illustrating Merge/Split: (a) schema; (b) instance. Applying the split operator <SRecc,rds:Patient(-D) to the database instance in Figure 5.4(b) results in the instance of Figure 5.3(b). 5.4 A d d i t i o n a l U s e f u l O p e r a t o r s Additional useful operators include addition or deletion of nodes, the nodes being internal or leaves, as well as renaming of tags. These atomic operations are only used when the transformation cannot be captured by any of the'high-level operations defined above. Note that nest/unnest, flip/flop, merge/split are not just macros containing atomic operations, but constitute the algebra of transformations, which reflect the graphical annotations provided by the user. Expressions over this algebra define database transformations. E.g., using compositions of the operators defined, we can transform instances of the MON schema in Figure 1.1 to those of the BOS schema and vice versa. Notice that we can transform instances of the MON schema in Figure 1.1 into those of the BOS schema, by applying the following transformations in any order: (i) nest of Patient into Admission; (ii) unnest of Patient on Treat; (iii) flip of Problem and Pulmonary/Coronary; (iv) deletion of Doc and insertion of InsName. Chapter 5. Classes of Transformations 24 5.5 T h e o r e m We have the following result concerning the class of transformations that are captured by our mapping rules. Theorem 1 [Algebra to Rules] : Let E be any expression obtained by composing the algebraic operators introduced in this section, expressing a trans-formation from instances of a schema A i to those of schema A 2 . Then there is a set of mapping rules H that precisely captures E, i.e., VZ)j € ins£(Ai) : 1l{DI) = E{DI). m Proof: [Algebra to Rules] This theorem is proven by an easy induction on the structure of the alge-braic expression. Formally, we show that if an expression has i occurrences of operators, then there is a nonrecursive mapping rule that precisely captures this expression. The basis is i = 1, that is, E is an expression'obtained by a single operator Unnest, Nest, Flip, Flop, Merge or Split. For each one'of these operators, we can construct a set of mapping rule IZ that captures this operation: Unnest: Reconsider Definition 1. For the expression obtained by a single Unnest operator UA:B(D), we can construct the following mapping rules (note that the Unnest operator will result in 2 rules): l.root -> froot{%root)[k -> /„($id)[@id s %id}\ root - • %root[k -> $A[@id -> %id]\ 2.root -> froot($root)[B -+ /_(..., $ARef)[..., @ARef -> $ARef]] root -> $rooi[A -> $A[@id -> Sid, B -> $S[...]]], Ud = %ARef. Note that ... in the rule body represent the subelements of B in the source schema, while ... in the rule head represent the correspondences of the same set of the subelements of B in the target schema. The Skolem arguments for /a are the single-valued leaf descendents of B in the target schema (including SARef). For example, the mapping rules that capture llpatient:SymPtom{D) for trans-forming schema Figure 5.1(a) to (b) is as follows: l.Records —> fRecords(SRecords) [Patient -> fPatient{$ID, $Name)[@ID-^$ID, Names SA^ ame]] Chapter 5. Classes of Transformations 25 Records —* $Records[PaXient —* SPatients [@ID -> $ JD, Name -* $ATame]] 2.Records —> fRecords(SRecords) [Symptom—* fSymptom{SProblem, SDate, SPatientRef) [Problem-* SProblem, Date —> SDate, © P a t i e n t R e f —* $PatteniPe/]] Records —> $i?ecords[Patient —> SPatients [@ID —> $7D, Name —* SName, Symptom—* SSymptom [Problem-* SProblem, Date -> SDaie]]], $JD =$PatientRef. Nest: Reconsider Definition 2. For the expression obtained by a single JVest operator MA-B(D), we can construct the following mapping rule: roo t s froot($root)[X -> fA($id)[@id - » B - » / B [ . . . ] ] ] roo t - * $rooi[A -+ $A[@id->Sid], B —* $B[..., ©ARef -+ $/li?e/]], Sid = $ARe/ . Similar to the rules for Unnest, ... in the rule body represent the subelements of B in the source schema, while ... in the rule head represent the correspon-dences of the same set of the subelements of B in the target schema. The Skolem arguments for /s are the single-valued leaf descendents of B in the target schema. For example, the mapping rule that captures the operation Afpatient:SymPtom{D) for transforming schema Figure 5.1(b) to (a) is as follows: Records —* fpecords(SRecords)[Patient —»fpatient(SID, %Name) [@ID—> SID, Name —* SName, Symptom-* fsymptom(SProblem, SDate) [Problem -> SProblem, Date -> SDate]]] Records —> SRecords [Pat ient - * $Patient[@IT)-* $ID, Name - » %Name], Symptom —* SSymptom [Problem-* SProblem, Date —> %Date, © P a t i e n t R e f -> %PatientRef]}, SID = SPatientRef. Flip: Reconsider Definition 3. For the expression obtained by a single Flip operator J-B-.C(D), we can construct the following mapping rule: A -> fA(SA)[$C/text() -> / s c ( . . . ) [B - * fB[...]]] Chapter 5. Classes of Transformations 26 A-»$A[B-»$J3[...,C-»$C]] Note that ... in the rule body represent the other subelements of B (other than C) in the source schema, while ... in the rule head represent the correspon-dences of the same set of the subelements of B in the target schema. The Skolem arguments for fc and fs are the same - the single-valued leaf descendents of B in the target schema. For example, the mapping rule that captures the operation TAilment-Patient(D) for transforming schema Figure 5.2(a) to (b) is as follows: Records —>fReCoTds{%Records) [$Ailment/text() -> fsAilment($ID, $Name) [Patient -> fPatient{$ID, $Name)[@ID-» $ID, Name -> $Name}}} Records —»^Records [Patient -+ $Patient[@ID^> SID, Name —> $Name, Ailment —> $ Ailment}] Flop: Reconsider Definition 4. For the expression obtained by a single Flop operator CB-.C(D), we can construct the following mapping rule: A -» jU($A)[c - / c(..., $£?)[..., $B -> $B}} k-^$A[$b^$B[C^$C[...}}} Note that ... in the rule body represent the subelements of C in the source schema, while ... in the rule head represent the correspondences of the same set of the subelements of C in the target schema. The Skolem arguments for fc are the single-valued leaf descendents of C in the target schema, possibly including $B if the cardinality of edge(C, $B) in the target schema is 1. For example, the mapping rule that captures the operation J-AUment-.Pdtient (D) for transforming schema Figure 5.2(b) to (a) is as follows: Records —> fRecords%Re-cords [Patient —> fpatient(§ID, %Name, %Ailment) [@ID -> $ID, Name -> %Name, Ailment -> $ Ailment]} Records —> ^Records [$AilmentTag—* $ Ailment [Patient -» $Patient[@IV -> $ID, Name -> $/Vame]]]] Chapter 5. Classes of Transformations 27 Merge: Reconsider Definition 5. For the expression obtained by a single Merge operator MB-.C(D), we can construct the following mapping rule: A s / A ( $ A ) [ $ b - » / S i ( $ 6 ) [ S C - > / c [ . . . ] ] ] A - * M [ $ b - t $ B [ C - . j C [ . . . ] ] ] Similar as above, ... in the rule body represent the subelements of C in the source schema, while ... in the rule head represent the correspondences of the same set of the subelements of C in the target schema. The Skolem arguments for fc are the single-valued leaf descendents of C in the target schema. For example, the mapping rule that captures the operation MRecords-.Patient(D) for transforming schema Figure 5.2(b) to Figure 5.4(a) is as follows: Records —> / R e c o r d s ( $ R e c o r d s ) [SAilmentTag—• fSAilvnentTag(SAilmentTag) [Patient -> fpatient(SID, $JVame)[@ID-> SID, Name -> SName]]] Records —> SRecords [SAilmentTag—> S Ailment [Pat ient -* SPatient[@ID-» SID, Name -> SName]]] Split: Reconsider Definition 6. For the expression obtained by a single Split operator SA-.C{D), we can construct the following mapping rule: A - / A ( $ A ) [ $ b - » / , 6 ( $ 6 , . . . ) [ C - > / c [ . . . ] ] ] A s $ A [ $ b - + $ B [ C - > $ £ [ . . . ] ] ] Similarly, ... in the rule body represent the subelements of C in the source schema, while ... in the rule head represent the correspondences of the same set of the subelements of C in the target schema. The Skolem arguments for fc are the single-valued leaf descendents of C in the target schema. The Skolem argu-ments for /$(, are the single-valued leaf descendents of C in the target schema and the tag variable $b. For example, the mapping rule that captures the operation S Records:Patient{D) for transforming schema Figure 5.4(a) to Figure 5.2(b) is as follows: Chapter 5. Classes of Transformations 28 Records -> fnecords($ Records) [$AilmentTag—•> fsAUment($Ailment, $ID, %Name) [Pat ient - » fpatimt{%ID, $Name)[@IV-> $ J D , Name -+ SWame]]] Records —> ^Records [$AilmentTag—> $v4iZmen£ [Pat ient -> $Paiient[@ID -> $7D, Name - • $Name}}] For the induction, consider an expression E = op(E'), where op is one of the operators Unnest, Nest, Flip, Flop, Merge and Split. Then by the inductive hy-pothesis, we can construct a set of mapping rule V that captures the expression E' over the single operation, i.e. 11! (Di) = E'(Di). Then the construction of rules to capture E = op(E') is like view unfolding. We can use VJ to substitute the expression on single operator to unwind rules and eventually obtain a set of mapping rule V that captures the expression E which is a combination of all these operators. In conclusion, we obtain VDj £ inst(Ai) : V(Di) = E(Di). m One might ask if we could insert/delete nodes individually, then we can map any database tree to any other. So, what's the relevance of other operations? The problem with such a "transformation" is that it cannot map a database instance to another in a way that preserves associations between data items.. This is why the algebraic operators unnest/nest, flip/flop, and merge/split are needed. The role played by node addition/deletion and tag modification is to capture those schema elements which have no counterpart in the other schema. In this chapter, we've discussed the algebraic characterization of the class of transformations captured by the generated mapping rules. In the next chapter, we are going to discuss how we make use of the mapping rules which captured the classes of transformations discussed in this chapter to translate queries. 29 Chapter 6 Query Translation of Tree Pattern Queries So far we have established a semantic mapping between two schemas (Chap-ter 4), and now we are trying to translate queries between the two via the mapping. In this chapter, we will address two questions concerning query trans-lation: (1) Let i = 1,2, suppose we have a pair of peer XML database sources pi, with DTDs A j and underlying database instances Di, and we issue a query Q against the DTD of pi (p2), what does it mean for Q to be answered using the database of p2 (pi)? (2) Can we translate the query Q against the other peer's DTD (say A2) such that the translated query Q' against A 2 evaluated on the database at p 2 will yield the correct answers w.r.t. the semantics captured by the answer to question (1)? We also wish to do the translation efficiently. 6.1 Query Translation Semantics Suppose we have mapping rules u. mapping database instances of schema A i of one peer to those of schema A 2 of another peer, i.e., p : A'i —> A 2 . Let inst(A) denote the set of database instances of A, we define the Query Translation semantics as below. Definition 2 (Semantics) Suppose Qi is a query posed against A j , i = 1,2. Let Q\ denote a translation of Qi against Aj, j ^i. Then Q2 is correct provided VDj 6 inst(Ai) : Q2(Di) = Q2(JX(D\)). The translation Q\ is correct provided WD2 e inst(A2) : Q\(D2) = Rcf:M(of)=Da Q^Di))-The translation Q2 is correct provided Q2 applied to the transformed in-stance p(.Di) and Q2 applied to Di both yield the same results, for all D\ € inst(Ai)^ Note that in this case, the direction of translation is against that of the mapping p.. We henceforth call this backward query translation. Translating a query Qi posed against A i to the schema A 2 of peer p2 is aligned with the direction of the mapping p.. We call this direction forward translation. Intuitively, backward translation is similar to view expansion and is the easier of the two. The complication here is that 11, the mapping that transforms instances of A i to those of A 2 , may not be invertible. Thus, we define Chapter 6. Query Translation of Tree Pattern Queries 30 the semantics of query answering based on certain answers over all possible pre-images for which D2 = fJ-(Di)-6.2 XQuery Fragment Considered The fragment of XQuery we consider in this chapter corresponds to queries expressible as joins of tree patterns (TP) [see [2] for a definition] ,• where the return arguments correspond to leaf nodes of the database. We note that even for this simple fragment of XQuery, query translation is far from trivial. 6.3 Forward Query Translation Algorithm 6.3.1 Translating Tree Patterns We will use the following XQuery Qi, stated against the MON schema (Fig-ure 1.1) as a running example to illustrate forward translation of XQuery. "Find all patients with Admission/Problem = 'Coronary' whose Treatment started Dec, 25th 2003", expressed as: QI: FOR $Adm IN / / A d m i s s i o n , $Prob IN $Adm/Prob lem/ tex t ( ) , $RefVal IN $Adm/OPatRef, $Pat IN / / P a t i e n t , $PID IN $Pat /8 ID , $PDate IN $ P a t / T r e a t / D a t e / t e x t ( ) , . $PName IN $Pat/Name WHERE $Prob = "Coronary" AND $PDate = "12/25/2003" AND $RefVal = $PID RETURN $PName We represent this query as (a join of) two tree patterns, as illustrated in Figure 6.1. In the rest of this section, we focus on translating single tree patterns. Trans-lation of joins of TPs is discussed later. Our algorithm translates a tree pattern for each relevant mapping rule by applying two main steps: (i) expansion, and (ii) translation. After the tree patterns have been translated w.r.t. all relevant mapping rules, they undergo a (iii) stitching phase and possibly a (iv) contrac-tion phase. We detail each phase in the following. Expansion Let t be a TP and r : hr<—br be a mapping rule. We first find a matching of t to far, which is a substitution 6 that maps variables in br to those in t. Chapter 6. Query Translation of Tree Pattern Queries 31 TP 2: ft //$Pat $PID|"JPNa7m<f J $Patient_Treat2 $PDate name( SPat) = Patient & name( $Patient_Treat2 ) = Treat & name( SPName) = Name & name( SPDate ) = Date & $PDate/text() = "12/25/2003" Join Constraint: SRefVal = $PID Figure 6.1: Join of two Tree Patterns. This substitution 9 may be partial since br may contain components which have no counterpart in £, and not all nodes in t may be in the range of 9. If we find a non-empty substitution 9, check whether any variable $X in br such that 9($X) is a leaf variable of t, appears in the rule head hr. If not, then r is not relevant for translating t. E.g., Figure 6.1 shows (the join of) two TPs - ti and *2- Ignore the join condition for now. Consider the mapping rules in Figure 3.2 again. Rules 1, 3, 4 are relevant for t\, while rules 1 and 3 are relevant for £2- For instance, rule 2 is irrelevant for both TPs since none of the variables corresponding to the TP variables (via any substitution) appear in the head of rule 2. Let us next match the TP ti (reproduced in Figure 6.2(a)) to the body of rule 1. The corresponding expansion is shown in Figure 6.2(b). We expand the TP to correspond to the rule body so it can be matched to it. More precisely, we find a substitution (partially) from the body of 1 to t. The part of the expanded TP that was originally present in t is highlighted in Figure 6.2(a), using nodes shown in dark circles. We call (variables associated with) all other nodes in Figure 6.2(a) dummy nodes. E.g.,.in Figure 6.2(a), dummy nodes/variables are SAD, SM, SP, etc. For clarity, in the figure, we show edges leading to dummy nodes as dashed grey lines. At this point, any distinguished node present in the original TP is also identified as such in the expanded TP and distinguished nodes are tracked as so through the steps of the query translation algorithm. Translation Next, we translate the expanded TP by applying the rule to it. The correspon-dence between the variables in the original TP and those in the expanded TP (i.e., the rule body) is kept track of by means of a substitution between the two. T P 1 : //$Adm ^PatRef $Prob $RefVal name( $Adm ) = Admission & name( SProb) = Problem & SProb/textQ = "Coronary" Chapter 6. Query Translation of Tree Pattern Queries 32 S \ @ P a t r o t S P r o b S R e f V a l S M g h ^ . ^ - y ^ ^ a P a t R e f M<«JCr# * N n A o " , D u , * * A P a i « D « « " » P H *" Problem S A - > $ A d m $ A P - > $ P r o b S P R - > $ R e f V a l I K S M g h ) f 3 ( $ A P / t o x t ( ) j £ I D . $ M , $ A D , $ D D , $ N ) < § > I D I ^ - " ^ " » V * S * X ~ - -— * M S A D * D D ** S N <a) ( b ) iSAP-:,,,K.o..-.:;c<>.;o.v1^ ry.;;! l..5A?.^.v.?*.o..™.^ '.^<»r^ l.>y.ry."J f 1 ( $ M g h ) f 4 ( $ P R ) Ttttmimtnt f 6 ( $ T D a t j » , $ T D e s c ) $ T r S a t o $ T D e » c ( d ) f l ( S M g t i ) f 4 ( $ P R ) l Pro»r«o» l @ P a t R o f S P R <o) f 1 ( $ . M g h ) f 3 ( $ A P / t e x t ( ) , i l D , $ M , $ A D , $ D D t $ N ) ntwirjiswn $go^aineT »t^PoS"* * A ° r f.M.v.. p * N i ( $ - f D ata $ T D o s c j $At>'t»KlO ... "Coronu^ " A fPR xi !flDj Figure 6.2: Translation of TP ti from Figure 6.1. Figure 6.2(b) shows the substitution for TP h as {$A -> %Adm, SAP -> %Prob, SPR^SRefVal}. Using this substitution, original query constraints are prop-agated through the translation. E.g., the translated query resulting from apply-ing rule 1 to the expanded TP above is shown in Figure 6.2(c). For readability, all tag constraints are shown concisely by writing the tags (in grey) next to the appropriate nodes. Note how the constraint SProb/textQ = "Coronary" is propagated via the substitution as $AP/text() = "Coronary". Additionally, the condition SPR = SID in the body of rule 1 is used to infer that the attribute child @ID of the node SAP/textQ in Figure 6.2(c) corresponds to the attribute child ©PatRef of the node SA in Figure 6.2(b). Note that some of the nodes have Skolem terms associated with them. They play a key role in the stitching phase. Figure 6.2(d)-(e) show the translated query pieces obtained from t\ via rules 3 and 4 respectively. Figure 6.3(a) reproduces the TP t2 while Figure 6.3(b)-(d) show its translated query pieces obtained via rules 1, 3, and 4. Distinguished nodes are shown with a box surrounding their variable (e.g., SN, SN2, in Fig-ure 6.3). Stitching At this stage, we have obtained translated pieces of a TP, obtained via various mapping rules. They need to be stitched together by identifying nodes and possibly adding equalities between leaf variables. Two nodes are identifiable provided they have the same tag and the Skolem terms denoting their node id are unifiable. Consider the translated pieces associated with t\ (see Figure 6.2(c)-(e)). It is easy to see that the two MassGeneral nodes, the two P r o g r e s s nodes are both identifiable: the unification is done via an identity substitution. The stitching of those trees based on this node identification yields the TP in Figure 6.2(f). The key/keyref constraint between SID and PatRef is used to infer the equality SPR = SID, which is added as a condition in Figure 6.2(f). Figure 6.3(b)-(d) shows the translated pieces of t2 (Figure 6.3(a)) w.r.t. the rules 1, 3, 4. The result of stitching them is shown in Figure 6.3(e). Observe how the three MassGeneral nodes and the two P rogress nodes have been identified and the equality SPR = SID has been added to the conditions. Chapter 6. Query Translation of Tree Pattern Queries 33 tKSMgh) I 4 ( $ P R ) * Progress f 6 ( $ T D a J e , $ T D e s c ) iV"yaVa3'^'oa" S T D a t o II ( S M g h ) •MFSPR') | @ P a t R e f S P R M(ia*GAh*t'ai $AP . tM io f 2 ( $ M g h ) f 4 ( S P R ) r 9 S S l 3 ( $ A P / t o x t ( ) , S I D , Tmaimont T ^ S l P a t R o S M . S A D . S D D ^ U ^ f 6 < $ T D a t o . / S M « X r » S D D r*»r: Oato^  - D«.-ac SID f "*.SP$. .-=.£!<3 I SID S M £ * . P a 1 Policy* f 1 ( $ M g h 1 ) P t o ( j r w s , Adn.l».l»r. «?*M«hi\^^ ^ ^ f 4 ( $ P R 1 ) ta (SM gh2j 1 2 ( $ M g h 1 ) * v '$AP2n«-xtO -|@PatRo (r f 3 (SAP2/ tex t ( ) ,SID2 , t 3 ( S A P 1 / l o x t ( ) . S I D 1 , » S M 1 . S A D 1 , $ D D 1 , $ N 1 ) * @ I D | SID1 f 6 ( $ T D a t e 2 , $ M 2 ; $ A D 2 , $ D b 2 , S N 2 j * S T D e s c 2 ^ $ i D 2 : . D1 ft Figure 6.3: Translation of TP t2 from Figure 6.1. Contraction In this step, we drop dummy nodes from the translated query. A node in a translated TP is dummy provided it is either a leaf and corresponds to a dummy node in the original expanded source TP (before translation), or it is an internal node and all its children are dummy. In Figure 6.2(f), the dummy nodes (e.g., SM, STDate, STDesc, f6($TDate, STDesc)) are highlighted by greying out the edges leading to them. Note that f6(STDate, STDesc) is dummy since both its children are. The translated TP corresponding to £i is simply the TP in Figure 6.2(f) with all dummy nodes dropped and all Skolem terms dropped and replaced by the tags associated with those nodes. Similarly, in Figure 6.3(e), nodes SM, SAD, SDD, STDesc are dummy. The translated TP corresponding to t2 is just the TP in Figure 6.3(e) with the dummy nodes dropped and the Skolem terms dropped and replaced by the tags associated with those nodes. 6.3.2 Translating Join of Tree Patterns MassGeneral f1($Mgh1). Admiss ion f2($Mgh1)* $AP1/!t>xt() f3($AP1/lext(),$ID1 $M1,$AD1 ,$DD1,$N1 MasftGonaml -f1($Mgh2) ff2($Mgh2) . . . . Admission»f2($ SAP2.'tox!()l ,„...X3*>*f3($AP2/text<),$ID2, @IDI " * ®m/ \$M2,$AD2,$DD2,$N2) m Palient $N2 (a) MassGenera l texl()| 13)19^ ^ P a t i e n l $ID \%U'} n,j\® ,@PatRef Date^ STDate £pR „ , |SPR=SID & $AP'text(}="Coronary'l Figure 6.4: Translation of the join of TPs from Figure 6.1: (a) some pairs of nodes in tti and tt2 being merged; (b) the final merged translated TP. Revisit the XQuery query QI and its representation as join of TPs shown in Figure 6.1. The translation of this query is done by first translating each of the TPs and then adding in the join condition. We rename the variables apart across translated TPs to avoid conflict. Figure 6.3(f) shows the two indi-vidual translated TPs (denoted tt\ and tt2 in the figures) corresponding to t\ and t2 along with the join condition. Variables in the two trees are renamed Chapter 6. Query Translation of Tree Pattern Queries • 34 apart. The original join condition was SPR = SID, which after variable re-naming becomes SPR2 = SIDl. Notice that the dummy nodes present in the translations of t\ (Figure 6.2(f)) and i2 (Figure 6.3(e)) are dropped in Fig-ure 6.3(f). To get the final translated query we need to detect nodes across the two tree patterns iii and ii2 that need to be merged. This can happen because of constraints present in the query. E.g., the join condition SPR2 = SIDl, together with the equality SPR2 = SID2, implies $7731 = $7732. Since the node $7731 corresponds to a key (of Coronary, Pulmonary, etc.) according to the BOS schema, this means that the two @ID nodes in iii and ii2 whose values have been equated, must be identical as nodes. Since the database is tree-structured, the parents f3(SAPl/text{),SIDl,SMI,$ADI,SDD1,$N1) and f3(SAP2/text(), $7732, $M2, SAD2, SDD2, SN2) of the two @ID nodes should also be the same. This in turn induces the constraint SAPl/textf) = SAP2/text() and hence SAP2/text() = 'Coronary'. By similar reasoning, the equalities SPR2 = SIDl and $7731 = SPR1 imply SPR2 = SPR1 and hence the two Progress nodes in Figure 6.2(f) with node id f4($PRl) and /4($P722) are iden-tified, and hence the two ©PatRef nodes, being single-valued children of their parent, are identified in turn. In general, whenever two nodes are identified, so are their parents in the two trees. This process is suggestively shown in Figure 6.4(a). The final merged TP with Skolem terms dropped is shown in Figure 6.4(b). 6.3.3 Translating T P Q to XQuery Let's first define relevant nodes to be the nodes in a TPQ that are either distinguished (nodes specified in the RETURN clause), with node constraints, or those needed for join (specified in the WHERE clause). These nodes are relevant because they are the nodes that carry important constraint information which needs to propagated via mappings to the other schema, or are needed to be returned in the final translated XQuery. The algorithm of translating TPQ to XQuery make use of these reievant nodes: we first find the Least Common Ancestors (LCA) of the relevant nodes and bind both the LCA and the relevant nodes in the FOR clause of the translated XQuery. We need the LCAs here to identify the common ancestor of multiple relevant nodes. For every pair of relevant nodes, we find their LCA, thus when there.are more than two relevant nodes in a tree pattern, there might be more than one LCA found. Among these LCAs, one LCA might also be the ancestor of another LCA. For both the LCAs and relevant nodes, when we specify them in the FOR clause, we need to bind each variable with an XPath expression. Here we describe what kind of path expression we use for each node: if a LCA does not have any other LCAs as their ancestors, then the XPath expression used is the path from the root to this LCA; if a LCA is a descendent of another LCA, then the XPath expression used is the path from the LCA ancestor to itself. For each relevant node, if it is an LCA itself which does not have any other LCA ancestor, or if it does not have any LCA, the XPath expression used is the path from the root to itself; if it does have an LCA, the XPath expression used will Chapter 6. Query Translation of Tree Pattern Queries 35 be the path from its LCA to itself. Then we will specify the node constraints specified in the final translated TP, and also join constraints in the WHERE clause of the XQuery, and spell out the distinguished nodes in the RETURN clause. Note that for now, we do not consider any restructuring of the RETURN clause from the original query. This restructuring will be further explored in HePToX's future research. Following the approach explained above, the XQuery corresponding to the final merged TP obtained in Section 6.3.1 is as follows: Qlt FOR $L1 IN / M a s s G e n e r a l , $L2 IN $ L l / P r o g r e s s , SRefVal i n $L2/SPatRef , $PDate i n $ L 2 / T r e a t m e n t / D a t e / t e x t ( ) , $L3 IN $ L 1 / A d m i s s i o n / C o r o n a r y , $PName IN $ L 3 / P a t i e n t / t e x t ( ) , $PID IN $L3/@ID WHERE $PDate = " 12 /25 /2003' 1 AND $RefVal = $PID RETURN $PName 6.3.4 Forward Query Translation Algorithm Figure 6.5 shows our translation algorithm for the forward direction (the same direction as the mapping). The algorithm translates one TP at a time. For each rule which is relevant to a given TP, it expands the TP and translates it. Each rule may only translate a piece of the TP in general, since the matching between the TP and rule's body may be partial. The algorithm then stitches the translated pieces obtained via various rules using the stitching procedure explained above. It then joins the translated TPs. This "join step" involves variable renaming, a chase procedure for identifying nodes across the translated TPs, and the final replacement of Skolem terms by the tags associated with the nodes. Finally, the resulting TP (or join of TPs) is translated to XQuery, a step that is straightforward and explained in Section 6.3.3. 6.3.5 Backward Query Translation Backward Query Translation Example So far we've seen an example of translating an XQuery posed on the source schema via the mapping rule direction (from rule body to rule head) to obtain a translated XQuery which comforms to the target schema. - Now we illustrate how to translate the other direction: translate from an XQuery posed on the target schema and use the mapping rule from rule head to rule body to obtain a translated XQuery which comforms to the source schema. When Q is a query expressed against the DTD A 2 , the key intuition for query translation is to essentially follow the mapping rule in the reverse direction, i.e., Chapter 6. Query Translation of Tree Pattern Queries 36 Algorithm TransForward. Input: a join of tree pattern queries Qi, mapping rule set from A i —> A 2 Output: one translated query in the form of join of TPs. 1. For each TP Qu translate it: let Q'i = TransTPF(Q;). 2. Rename the variables apart in the translated TPs. 3. Add join conditions to the set of translated TPs (with variable renaming). 4. Chase the set of TPs along with schema key constraints, the join conditions and other query conditions, merging nodes as necessary. 5. Obtain contractions of previous query Q' by recursively removing appropriate dummy nodes bottom up. Let Qt be the resulting query obtained by replacing Skolem terms with associated node tags. 6. Translate the final merged TP (or set of TPs) Q* into XQuery. Procedure TransTPp. 1. For each mapping rule p,j = hj <— bj a. Find a substitution and expand Qi to Q\^ by matching the rule body rbj to the query pattern Qi For each node in Qf • and not in Qi, mark it as dummy If it is a distinguished node in Qi, Mark it as distinguished in the expanded query Q\^ b. Translate each Q^ into Q^ by applying \ij Mark dummy and distinguished nodes accordingly 2. Stitch together all Q\*- to obtain resulting query Q\. Nodes with same or unifiable Skolem terms get stitched by looking at the corresponding substitutions. Figure 6.5: Forward Query Translation Algorithm from the head to the body. This has resemblances to query folding and answer-ing queries using views [21]. However, the presence of Skolem functions greatly simplifies this process. The reason is that the node id's act as a "glue" suggest-ing which subelement pieces should be associated together. Consequently, they drive exactly which mapping rule bodies we need to "join" together to rewrite the given query. We illustrate this with an example. . Consider the XQuery posed against the target schema (Figure 1.1 (b)): "Find symptoms of patients admitted with 'Pulmonary' (ailment)". 0 2 : FOR $Pulm IN / / P u l m o n a r y , $PID IN $Pulm /<3ID, $Prog IN / / P r o g r e s s , $Ref IN $Prog/@PatRef , Chapter 6. Query Translation of Tree Pattern Queries 37 $PDesc IN $Prog/Symptom/Desc/text() WHERE $Ref = $PID RETURN $PDesc TP 2: //$Prog @ P a t R e f / / S S s ^ ^ $R e f $Progress_Symptom2 |"JPD | scJJ name( SProg ) = Progress & name( SPR ) = PatRef & name( $Progress_Symptom2 ) = Symptom & name( SPDesc) = Desc Join Constraint: SRefVal = $PID Figure 6.6: Join of two Tree Patterns for Q2. Figure 6.6 shows the join of TPs corresponding to this query. The next step is to translate each TP by matching it to each rule head. It is easy to see that only.the heads of rule 1 in Figure 3.2 can be (partially) matched to TP 1 in Figure 6.6, and only the rule heads of rule 2 and 4 can be matched to TP 2 in Figure 6.6.1 The expanded TPs are obtained by matching the TPs against each of these rule heads and are shown in Figure 6.7(la), Figure 6.8(2a) and Figure 6.9(3a). In essence, as in the forward direction of translation, each expanded TP is a rule head expressed as a tree expression. From the names of variables and tags the substitution between query variables and rule variables should be implicitly clear. We also mark the variables that appeared in the original query. E.g., in Figure 6.7(la), we know SPulm is associated with the node f3($AP/text{), $ID, SM, $N, SAD, SDD), SID with the node SID and we also know that the tag variable in rule head SAP/text () matches "Pulmonary" in expansion. In Figure 6.8(2a), we have SProg associated with f2{$PR), SSymp with f3($EP,$ED), and SPDesc with SEP. Finally, in Figure 6.9(3a), we have SProg associated with f2{$PR), and also SRef with SPR. Next, each expanded TP is replaced by the tree expression in the corre-sponding rule body ( Figure 6.7(lb), Figure 6.8(2b), and Figure 6.9(3b) ). In doing so, we again keep track of variables mentioned in the original query (only leaf variables and tag variables are used here, since for non-leaf nodes which all correspond to Skolem functions, there are no corresponding Skolem functions 1 Actually, in the head of rule 3 just Progress can be matched to the query, but this is subsumed by matches to other rule heads and is redundant. TP 1: //$Pulm @ID $P name( SPulm ) = Pulmonary & name( $PID ) = ID Chapter 6. Query Translation of Tree Pattern Queries 38 TP 1: @ID K I M 1$'IBI Expand w.r.t. Mapping Rule 1 (1a) | $AP/lext() = "Pulmonary' MassGeneral-»f1($Mgh) Admission-* f2( $Mgh) :' Pulmonary-^lPulmj E@'iB3'|PiP:.j Name->$N Enter-*$AD \ Policy#->$M Leaved $DD • Substitutions: Pulmonary-M3(SAP/text(), $ID, $M, $AD, SDD, $N) |-> $Pulm SID |-> SPID Translate w.r.t. MR 1 T <ib) [M^inH6sp->$Mgh PPatient^SPat" | jAdmission-».$Aj t l S l f M S i l E y / j | I Problem-* SAP \ DisDate-* $DD MedCr#-»$M ^ l l ^ ^ M J A d m D a t e ^ $ A D SPID = SPR Remove Dummy Leaves 1 (1c) [ "MonlSliffloslSSMii ,Patient-»$Pat SPID = SPR Figure 6.7: Translation of TP 1 in Figure 6.6 Chapter 6. Query Translation of Tree Pattern Queries 39 TP 2: P!B5SSgg»lgoql jSjRlRefySReM Sylikn-llSyia ^ Expand w.r.t. Mapping Rule 2 (2a) MassGeneral-»f1($Mgh) Substitutions: f2($PR) |-> $Prog, f3($EP, $ED) |-> $Symp, SEP |-> SPDesc Translate w.r.t. MR 2 m m Admission-* $A _ _ @ID^$PID 1 j—J @PatRef-» SPR [Event~>$El „ | $PID = $PR l l roEieng$PDescl l Date-> $ED I Remove Dummy Leaves (2c) M5riGenHosp-»"$Mgh*1 fPatient-^ SPat 1 Histr>$H BW6iem^$PbWl Figure 6.8: Translation of TP 2 in Figure 6.1 with rule 2 in Figure 3.2 Chapter 6. Query Translation of Tree Pattern Queries 40 TP 2: //Progress-^ $P/og~I E@Pa.tRef-* $R§Li J3yJEPtQnv£ Expand w.r.t. Mapping Rule 4 (3a) MassGeneral-»f1($Mgh; Substitutions: f2($PR) |-> $Prog, $PR |-> $Ref i Translate w.r.t. MR 4 (3b) [M6n$enHospf$Mg!r; Mrnissi l -y 'SAn KlPmilDI i @ e l i e f ^ l K e l r$PID = $Ref I rPatienf»$PaP No Dummy Leafs to Remove (3c) |Mo'nQgn6ospiJ> $Mgh., ["Patient-£$PiP""""""""""Ai^^ SA © I D W J D I lPa tRe f -> SRcf $PID = $Ref Figure 6.9: Translation of TP 2 in Figure 6.1 with rule 4 in Figure 3.2 Chapter 6. Query Translation of Tree Pattern Queries 41 (2c) ; JonGenH6'sp^£$Mgfil Patient-* S P a t l Hist->$H ! .Evenf-*$E r Problem-* SPDesc i (3c) | K i ^ H o s £ ^ $ M | h ' < EPatient :?$Pat ! iAd r r r i s^ i i p $A1 [@ID±$P!PJ ©PatRef-* ;$Ref $PID = $Ref Stitch Translated TP Pieces for TP2 (TP2-S) |MpiiG£nflojpj£|Mgfi3 Patient-*$Pat @ID-*$PJD] FA'dmis§iofil»i$I Hist-*$H i l S l l i rProblem^S>DS"5 ! ©PatRef-* $Ref $PID = $PR Figure 6.10: Stitching Translated TP Pieces for TP 2 in Figure 6.6 in the rule body). For example, in Figure 6.7, the tag variable SAP in the expanded TP against the rule head (i.e. (la) ) corresponds to the leaf node variable SAP in the translated TP against the rule body (i.e. (lb) ) All the other correspondences are shown in the Figure 6.7, Figure 6.8 and Figure 6.9. For optimization purposes, we drop dummy leaf nodes in the next step (this is only for optimization, thus not shown in the algorithm outline). Dropping of dummy nodes generates simplified but equivalent TPs. The idea is very similar to that adopted for the forward direction of translation . However, note that here we don't have any Skolem functions in the translated TP, thus we cannot use the Skolem arguments to determine whether a non-leaf node is dummy or not as we did before (remember that dummy nodes are those redundant nodes that won't be used any more in later phases of translation, thus are dropped for optimization purpose). By default, all non-leaf nodes in the translated TP in this case are non-dummy, and we only mark a non-leaf node dummy if all its children/descendents are dummy nodes. After each TP is translated using each mapping rule, we stitch different translated pieces together as explained before. Nodes in different TPs that correspond to the same query variable are stitched together. Only rule 1 is used Chapter 6. Query Translation of Tree Pattern Queries 42 Problem-* SAP $AP/text() = "Pulmonary"; $PID = $PR W m i s s i o n - * $A] mtmrnmrn® $PID = SRef Merge (1c) and (TP2-S): (Final Translated TPQ) FMonGenHosp -f$111 Patient?$Pat"[ F i s s i o n - * $A] Ei^JS] • Problem-* SAP"™1 f g j p a t R e f ^ EPD~->$Pim | i"$AP/text(),= "Pulmonary*1 ' ~~" ;Event-* 'SEl "T ' [EBroleriislMl I $ P I D = $ R e f I Figure 6.11: Merging Figure 6.7(lc) and Figure 6.10(TP2-S) to translate TP 1, so no stitching is needed for TP 1, and the stitching process for the two translated pieces for TP 2 is shown in Figure 6.10. Finally, we use the same method that we used to translate a query from source (rule body) to target (rule head) schema to merge the translated TP 1, and the stitched translated TP 2 (denoted as Figure 6.10 (TP2-S)), and get the final translated Chapter 6. Query Translation of Tree Pattern Queries 43 TP in Figure 6.11, which is actually a join of two TPs. Finally, following the method explained in Section 6.3.3, we translate the final translated TP in Figure 6.11 and the translated XQuery is as follows: Q2t :F0R $L1 IN /MonGenHosp, $L2 IN $ L 1 / P a t i e n t , $PID IN $L2/@ID, $PDesc IN $ L 2 / E v e n t / t e x t ( ) , $L3 IN $ L l / / A d m i s s i o n , $AP IN $ L 3 / P r o b l e m / t e x t ( ) , $Ref IN $L3/0PatRef WHERE $AP = " P u l m o n a r y " AND $Ref = $PID RETURN $PDesc Backward Query Translation Algorithm The algorithm outlining the steps explained in the previous section is shown in Figure 6.12. 6.4 Correctness of Query Translation Suppose we have mapping rules p mapping database instances of schema A i of one peer to those of schema A 2 of another peer, i.e., p : Ai—>A2. Let inst(A) denote the set of database instances of A. We have the following result concerning the correctness of the algorithm w.r.t. query answering semantics defined in Definition 2. Theorem 3 [Correctness of Forward Query Translation] : For the fragment of XQuery defined in Chapter 6.2, the forward query translation al-gorithm in Figure 6.5 is correct w.r.t. the semantics of query answering de-fined in Definition 2: Suppose Qi is a query posed against A i . Let Q\ denote a translation of Qi against A 2 . Then the translation Q\ is correct provided VD2 e inst(A2) : Q\{D2) = nD*;M(D*)=D2 Qi(^i)-Proof Sketch: [Correctness of Forward Query Translation] Intuitively, forward query translation is similar to query answering using views (known as QAV). The complication here is that p,, the mapping that transforms instances of A i to those of A2, may not be invertible. Thus, the semantics of query answering is defined based on certain answers over all possible pre-images D\ for which D2 = p(D^). The set of mapping rules we have are like a set of view available to answer a query. Based on our rule generation algorithm, we only supply bindings for a given leaf element or attribute of the target schema A 2 in one mapping rule, Chapter 6. Query Translation of Tree Pattern Queries 44 Algorithm TransBackward. Input: a join of tree pattern queries Qi, mapping rule set from A i —> A 2 Output: one translated query in the form of join of TPs. 1. For each TP Qu translate it: let Q't = TransTPB(Qi). 2. Rename the variables apart in the translated TPs. 3. Add join conditions to the set of translated TPs (with variable renaming). 4. Chase the set of TPs along with schema key constraints, the join conditions and other query conditions, merging nodes as necessary. 5. Obtain contractions of previous query Q' by recursively removing appropriate dummy nodes bottom up. Let Q' be the resulting query obtained by replacing Skolem terms with associated node tags. 6. Translate the final merged TP (or set of TPs) Ql into XQuery. Procedure TransTPs.-1. For each mapping rule p,j = hj <— bj a. Find a substitution and expand Qi to Qf • by matching the rule head rhj to the query pattern Qi For each leaf node in Q?- and not in Qi, mark it as dummy If it is a distinguished node in Qi, Mark it as distinguished in the expanded query Qf • b. Translate each Q\^ into Q\j by applying \ij Mark dummy and distinguished nodes accordingly 2. Stitch together all to obtain resulting query Q\. Nodes with same or unifiable Skolem terms get stitched by looking at the corresponding substitutions. Figure 6.12: Backward Query Translation Algorithm thus when we translate Q\ against source A i to target A 2 , for a given tree pattern, there is only a unique set of mapping rules (like views) that need to be joined to rewrite the tree pattern query. For a query Q\ against source, it could match a set of rules (i.e. views), and we could get a unique set of rewriting w.r.t. each rule. For the join of a set of tree pattern queries, there is a unique set of relevant mapping rules that need to be joined to rewrite Q\ against source A i to obtain Q\ against A 2 . Because of the unique correspondences here, translating Qi to Q\ is like joining the appropriate views. Thus we have V D 2 e inst(A2) : Q[(D2) = HD* :M(Df )=D2 QI(DI)- • Theorem 4 [Correctness of Backward Query Translation] : For the fragment of XQuery defined in Chapter 6.2, the backward query translation al-gorithm in Figure 6.5 is correct w.r.t. the semantics of query answering defined in Definition 2: Suppose Q2 is a query posed against A 2 . Let Q\ denote a Chapter 6. Query Translation of Tree Pattern Queries 45 translation of Q2 against A i . Then Q2 is correct provided VDi 6 inst(Ai) : QUDi) = Q 2 ( M ( ^ I ) ) . Proof Sketch: [Correctness of Backward Query Translation] Intuitively, forward query translation is similar to query answering using views (known as QAV). The complication here is that fi, the mapping that transforms instances of A i to those of A2, may not be invertible. Thus, the semantics of query answering is defined based on certain answers over all possible pre-images for which D2 = niP^). The set of mapping rules we have are like a set of view available to answer a query. Based on our rule generation algorithm, we only supply bindings for a given leaf element or attribute of the target schema A2 in one mapping rule, thus when we translate Qi against source Ai to target A2, for a given tree pattern, there is only a unique set of mapping rules (like views) that need to be joined to rewrite the tree pattern query. For a query Q\ against source, it could match a set of rules (i.e. views), and we could get a unique set of rewriting w.r.t. each rule. For the join of a set of tree pattern queries, there is a unique set of relevant mapping rules that need to be joined to rewrite Q\ against source A i to obtain Q\ against A 2 . Because of the unique bindings for leaf elements or attribuets, translating Qi to Q\ is like joining the appropriate views. Thus we have VD2 € inst(A2) : Q\(D2) = nDi>(Df)=.D2 Qi(Di)- • 46 Chapter 7 Query Transla t ion for Genera l ized Tree Pa t t e rn 7.1 Generalized Tree Pattern In the previous chapter, we have discussed our query translation algorithm for XML Queries that can be represented as (joins of) tree patterns. While XQuery expression evaluation includes the matching of tree patterns, and hence can in-clude TPQ evaluation as a component, there is much more to XQuery than simply TPQ. To facilitate an efficient evaluation of generic XQuery queries, the notion generalized tree pattern (GTP) was introduced in [13], which extends the TP notation to accommodate arbitrarily nested queries with more com-plex construction clauses. In particular, GTPs allow the use of optional edge relationships and have the grouping information in the representation. As a preview, let's take a look at an example of an XQuery and its cor-responding Generalized Tree Pattern query introduced in [13] (shown in Fig-ure 7.1). This is a sample query against the auction.xml document of the XMark benchmark [31]. As you can see, a GTP could have both solid and dotted edges. Solid edges represent mandatory relations (parent-child or ancestor-descendent) just like edges of a TPQ. Dotted edges demote optional relationships. For exam-ple, $i optionally may be a child of $1, and $w optionally may be a descendant of $p. As defined in [13], each maximal set of nodes in a GTP connected to each other by paths not involving dotted edges is called a group. Groups are disjoint so that each node in a GTP is a member of exactly one group. Accord-ing to their group numbering convention, the group containing the FOR clause variables (including the GTP root) is marked group 0. For a nested queries, a hierarchical group numbering schema is used to capture the dependence be-tween a block and the corresponding outer block in the query. The idea is to add a new level of hierarchy to the group number when entering a new block in the construction of GTP. The GTP can be informally understood as follows: (1) Find matches for all nodes connected to the root by only solid edges; (2)Next, find matches to the remaining nodes (whose path to the GTP root involve one or more dotted optional edges), if they exist. Chapter 7. Query Translation for Generalized Tree Pattern 47 FOR $p IN document("auction..xmi")//perspn, SI IN $p/profile WHERE $l/age > 25 AMD Sp/state != 'HI' ;RETURN <result>{$p//watches/vatch({Sl/interestj </result> $p (0) $q(0) $i (2) 00 fp.tag = person & Ss.tag = state & $l.tag = profile & Si.tag = interest & $w.tag = watches & St.tag - watch & Sg.tag = age & Sg.content > 25 & $s.content != 'Ml' •(b) Figure 7.1: An Example XQuery and corresponding Generalized Tree Pattern Query. Solid (dotted) edges = compulsory (optional) relationship. Group num-bers of nodes in parentheses. (Reference: [13] 7.2 G T P Query Translation Examples In this section, we are going to walk through two query translation examples with GTP to explain how we extend the query translation algorithm with tree pattern queries to translate GTP queries. The general algorithm is given in Section 7.3. The source and target schemas we are going to use in these examples are shown in Figure 7.2, and the mapping rules used to specify the mappings between these two schemas are shown in Figure 7.3. Here we demonstrate how to translate a flat query against the source schema to a nested query against the target schema (Section 7.2.1), and also how to translate a nested query against the target schema to possibly a flat query against the source schema (Section 7.2.2). 7.2.1 XQuery Forward: Translating Flat XQuery in Source to Nested XQuery in Target Suppose we have the XQuery shown in Figure 7.4 posed on Figure 7.2(A). Obviously, this XQuery cannot be expressed as a normal Tree Pattern with no optional node(s) in the RETURN clause. However, by applying algorithm GTP in [13], we can represent this XQuery as a GTP. The obtained GTP is shown in Figure 7.5. This is step 1 in Figure 7.16. Once we have the GTP, we basically follow the same steps as in the TP algorithm, i.e. expansion, translation and contraction, but with many differences due to the extra grouping information. Thus, we expand the obtained GTPsrc Chapter 7. Query Translation for Generalized Tree Pattern 48 books catalog *i author book name hobby book isbn title author isbn title name hobby (a) Source Schema (b) Target Schema Figure 7.2: Book-Author Schemas for GTP Query Translation Example catalog—)- f .catalog ($books) [author —> f .author (%name, $hobby, $isbn, %title) [name —> Sname, hobby —> %hobby, book —> $6oofc[isbn—> Sisbn, t i t l e —> StiiZe]]] books —> %books [book —+ Sfeoofc [isbn—• $isbn, t i t l e —> %title, author —> Sauthor[name —> Sname, hobby —> $/T.ofefey]]] Figure 7.3: HePToX Mapping Rules for Schemas in Figure 7.2 XQ src ' FOR $b IN / / b o o k , RETURN <book> $b/author/name </book> Figure 7.4: XQuery posed on the source schema XQ src w.r.t. the mapping rule shown in Figure 7.3. As mentioned in Section 7.1, according to the group numbering convention used in [13], the group containing the FOR clause variables (including the GTP root) is marked group 0. Here Chapter 7. Query Translation for Generalized Tree Pattern 49 $b' $a' name( $b') name( $a') name( $n') book & author & name Figure 7.5: GTP Representation for XQ src we assume that any non-leaf node can be uniquely identified by all of its single-valued leaf descendents, thus when we expand GTPsrc, in order to capture the grouping information (i.e. which node is the group-by base, and which nodes are grouped by), for group 0 root node (denoted as R o o t ( O ) ) , we find all of its single-valued leaf descendents (which are the nodes being grouped by the group-by base, R o o t ( O ) ) , and mark them as group 0 as well. The reason that we find all the leaf descendents is that in the current HePToX framework, correspondences between schemas happen at leaf nodes most of the time (only exceptions are mapping a leaf to a boxed non-leaf node, or a boxed non-leaf node to a leaf node, or mappings between two boxed non-leaf nodes). Note that we also mark each of the single-valued leaf descendants non-dummy if they were originally marked as dummy nodes during expansion. This also gives us a rationale of why the TP algorithm is not sufficient to deal with GTPs. Indeed, single-valued leaf descendants are used to propagate impor-tant grouping information to the target schema in the subsequent translation, and thus it would not be correct to keep them marked as dummy nodes. For example, in GTPsrc, $isbn and $title were dummy nodes if we follow the query translation algorithm of TP queries, but they are marked non-dummy and marked with the same group number as $b ' since they are single-valued leaf descendents of the group-by base $ b ' . The expanded query GTP s e r c with the corresponding grouping information is shown in Figure 7.6 . The expansion phase corresponds to step 2 of the algorithm shown in Figure 7.16. Now after we obtain the expanded GTP GTPf r c , the next step is to translate this expanded GTP against the source schema to the translated GTP GTPlrc against the target schema following the correspondences specified in the mapping rule in Figure 7.3. This resembles the translation phase of the TP algorithm, with the differences due to optional edges and grouping information. We first build a basic translated GTP GTP*rc according to the rule head of the mapping rule, and then mark dashed edges in GTPj;rc accordingly: if there is an edge between the corresponding node of node N in source schema and the root node in source, we mark as dashed the edge between N and its parent node in target schema to propagate the optionality information. Chapter 7. Query Translation for Generalized Tree Pattern 50 $books (dummy) $isbn M [j&D'j $hobby •a < d m m r i name( $books) = books & name( $b') = book & name( $isbn ) = isbn & name( $title ) = title & name( $a') = author & name( $n') = name & name( $hobby) = hobby Figure 7.6: Expanded GTP for XQ3 For each GTP GTP*rc leaf nodes, we inherit GTP grouping information from GTPfTC following the correspondences specified in the mapping rule. An important difference w.r.t. the TP algorithm is made here by identifying the Least Common Ancestor (LCA) of leaf nodes. There happen to be a LCA for each group of leaves. Indeed, for leaf nodes marked as group 0 in GTP^rc, we find their Least Common Ancestor (LCA) to identify this LCA node (denoted as LCA(O-leaves)) to be the group-by base in the target schema. Then we turn to group 1, and find the LCA of all leaf nodes with group number 1 (denoted as 1-leaves), and LCA(0-leaves) and denote this LCA(l-leaves, LCA(0-leaves)). The translated GTP GTPlsrc shown in Figure 7.7. f_catalog($books) (dummy) f_author($n', $hobby) f|nj $hob^y f_book($isbn,$title) H SisfjrT^Jtitle K l 101 name( f_catalog($books)) = books & name( f_author($n',$hobby)) = author & name( $n') = name & name( Shobby) = hobby & name( f_book($isbn,$title) ) = book & name( Sisbn ) = isbn & name( Stitle) = title Figure 7.7: Translated GTP GTPt tgt We also mark dummy nodes as done in the query translation algorithm with tree pattern queries, and remove dummy leaves in GTP\TC for optimization purposes. The contracted GTP GTP}gt(c) is shown in Figure 7.8. This is the final translated Generalized Tree Pattern query that we get for the target schema. What's left to be done is to translate this GTP to an XQuery XQ\. Steps for this translation are as follows: (1) First, we initialize XQ\ to be an empty string. (2) Let 0-leaves denote all leaf nodes with group number 0; LCA(i-leave) Chapter 7. Query Translation for Generalized Tree Pattern 51 II //f_author$n', $ h o b b v ) | p M l S n ' T ^ ^ ^^fTBookf$isbn,$title) name( f_author($n',$hobby)) = author & name( $n') = name & name( f_book($isbn,$title)) = book & name( Sisbn ) = isbn & name( Stitle) = title Figure 7.8: Contracted Translated GTP GTP/ tgt denote the set of Least Common Ancestor of each combination of two leaf nodes with group i. Then, we bind each LCA(O-leaves) to variables, and put all these variables and their corresponding XPaths in the outer FOR clause of XQ\. The XPath used for LCA(O-leaves) is the path expression from the root of the target schema to the node LCA(0-leaves), and the XPaths used for all 0-leaves are the ones from LCA(O-leaves) node to each of them. (3) After group 0 is taken care of, we bind LCA(0-leaves, LCA(O-leaves)) to $LCA-1, and put them with their corresponding XPath expressions in the inner-FOR clause of XQ\. Then we find the paths between each of 0-leaves and $LCA-1, bind them with its corresponding 0-leaves, and put them in the inner-FOR clause of XQ\, and denote these variables of 0-leaves as 0-leaves-2. (4) Finally we equate variables 0-leaves specified in (2) and 0-leaves-2 in the inner-WHERE clause of XQ\. (5) Put the distinguished nodes specified with group number 1 (denoted as 1-distinguished nodes) with their paths in the inner-RETURN clause of XQ\. This obtained XQ\ is the final translated XQuery against the target schema, as shown in Figure 7.9. The general algorithm which includes the above steps are shown in Fig-ure 7.16. Readers are welcome to verify each step following the walk through explanation above. 7.2.2 XQuery Backward: Translating Nested XQuery in Target to Flat XQuery in Source In the previous subsection, we've seen how a flat query is translated to a nested query following the GTP query translation algorithm. Now let's take the ob-tained nested XQuery XQtgt (Figure 7.9) as input and translate it via the other direction of the mapping rule. (Step 1). Just as before, we apply algorithm GTP in [13] to translate XQtgt to a GTP GTPtgt as shown in Figure 7.10. Note that we have obtained two generalized tree patterns for this nested XQuery: Figure 7.10(A) and (B). Next we use the mapping rules to translate each of them. Chapter 7. Query Translation for Generalized Tree Pattern 52 FOR $b IN / / b o o k , $ i IN $ b / i s b n , $t IN $ b / t i t l e RETURN <book> { FOR $a IN / / a u t h o r , $ i l IN $ a / b o o k / i s b n , $ t l IN $ a / b o o k / t i t l e WHERE $ i = $ i l AND $t = $ t l RETURN $a/name} </book> Figure 7.9: Final Translated Query XQ\ name( $b ) = books & name( $ i ) = isbn & name( $t) = title & name( $a) = author & name( $n) = name & name( $bl ) = book & name( $ i l ) = isbn & name($tl) = title Figure 7.10: GTP representation of XQtgt (Step 2) Translate GTPtgt[A] to GTP*rc[A] and translate GTPtgt[B] to GTPUB) (2A) Translate GTPtgt[A] (Figure 7.10(A)): 2A.1 Expand to GTP?gt[A] w.r.t. the rule head (Figure 7.11) Since this is the translation from the rule head to rule body of the mapping rule, we only need to mark dummy nodes at leaf since non-leaf nodes in the rule head are not assigned to variables, but Skolem functions, and they do not have the counterparts at the rule body. We also only need to mark group information for leaf nodes because of the same reason. 2A.2 Translate GTPtegt[A] to GTP^rc[A] (Figure 7.12) Note that a non-leaf node in the translated GTP GTPj;rc against the source schema (which corresponds the rule body of the mapping rule) is dummy if all Join Constraints: $i = $ i l , $ t = $tl //Sam //$bioj $b1noi $riEai $i $t m m $l2 ,f0l (A) Group 0 (B) Group 1.0 Chapter 7. Query Translation for Generalized Tree Pattern 53 f_catalog($books) f_author($name, $hobby) $name $hobby f_book($i,$t) (dummy) (dummy) ./^s. $1 $t Hi KB name( f_catalog($books)) = books & name( f_author($name,$hobby)) = author & name( Sname ) = name & name( Shobby) = hobby & name( f_book($i,$t)) = book & name( $ i ) = isbn & name( $t) = title Figure 7.11: Expanded GTP [A] against the rule head GTP£gt[A] its children are dummy (i.e. $author in this example). Note that here this is not only an optimization issue, but also a correctness issue, because we do not want to check the existence of $author in the translated GTP in this case. 2A.3 Remove dummy leaf nodes in GTP*rc[A] and obtain GTP*rc[A](c) $books $book $t $authgr , $name $hobby (dummy) (dummy) name( $books ) = books & name( Sbook) = book & name( $i ) = isbn & name( $t) = title & name( Sauthor) = author & name( Sname ) = name & name( Shobby) = hobby Figure 7.12: Translated GTP [A] against the source schema GTPj;rc[A} (2B) Translate GTPtgt[B) Figure 7.10(B): 2B.1 Expand to GTPtegt[B] w.r.t. the rule head (Figure 7.13) f_catalog($books) f_author($n, $hobby) m SAiobby f_book($M ,$t1) (dummy) s $il $t1 i n LI:O:: name( f_catalog($books)) = books & name( f_author($n,$hobby) ) = author & name( $n ) = name & name( Shobby) = hobby & name( f_book($ i l ,$ t l ) ) = book & name( $ i l ) = isbn & name( Stl ) = title Figure 7.13: Expanded GTP[B] against the target schema GTPfgt[B] Chapter 7. Query Translation for Generalized Tree Pattern 54 2B.2 Translate GTPtegt[B] to GTPlsrc[B] (Figure 7.14) $books nam i -$book $i1 * $author • B I $hobby (dummyf name( $books ) = books & name( Sbook ) = book & name( $ i l ) = isbn & name( $ t l ) = title & name( Sauthor) = author & name( $n ) = name & name( Shobby) = hobby Figure 7.14: Translated GTP[B] against the source schema GTP*rc[B] 2B.3 Remove dummy leaf nodes in GTPlrc[B] and obtain GTPlsrc[B](c) Step 3 Translate GTP^rc[A}(c) and GTPlsrc[B}(c) to XQlrc (shown in Fig-ure 7.15), and this is the final translated XQuery. XQ s r c • FOR $b IN / / b o o k , $ i IN $ b / i s b n , $t IN $ b / t i t l e RETURN <book> { FOR $b l IN / / b o o k , $ i l IN $ b l / i s b n , $ t l IN $ b l / t i t l e WHERE $ i = $ i l AND $t = $ t l RETURN $bl /author /name} </book> Figure 7.15: Final Translated Query XQl After we obtain the nested XQuery XQ1^^ we can also use some reasoning to discover that in Figure 7.15, the inner FOR clause is essentially the same as the outter FOR clause, thus the nesting of the two FOR clauses and the equality in the inner WHERE clause is unnecessary, thus rewrite it as a fiat query as shown in Figure 7.4. Note that sometimes we may not always have equality join constraints in the translated nested XQuery, and in that case, we will not be able to rewrite the translated nested XQuery to a flat one. The reasoning used here will be further explored in HePToX's future research. Chapter 7. Query Translation for Generalized Tree Pattern 55 7.3 G T P Query Translation Algorithm As illustrated with examples above, here we extend the query translation al-gorithm with tree patterns to translate more complicated XML queries (for example, nested XML queries) making use of the additional properties that a Generalized Tree Pattern captures. The formal GTP Query Translation Algo-rithm is given in Figure 7.16. Input: an XQuery XQi, mapping rule set from Ai —> A 2 Output: one translated nested XQuery XQ\ 1 .Translate XQi to GTPi set (applyingalgorithm GTP in [13]) 2. For each GTPi, we apply the 3 translation phases: 2.1 Expand GTPi w.r.t. each mapping rule MR as done currently FOR group a.O root node Root(a.O) — a can be empty, e.g. group 0 Find all of its single-valued leaf descendents Mark them the same group number(i.e.a.O) Mark them non-dummy if necessary 2.2 Translate GTPi to GTP/ according to MR 2.2.1 FOR each GTP* leaf nodes N If there is a dashed edge between N's corresponding node in the other schema and the root node in the other schema Mark the edge between N and its parent node dashed Inherit GTP grouping information from GTPi according to MR 2.2.2 FOR group a.O leaf nodes in GTPf (denoted as a.O — leaves) Find LCA(a.O — leaves) of all group a.O leaf nodes 2.2.3 Turn to group a.l and find LCA(a.l — leaves, LCA(a.O — leaves)) 2.3 Contract GTPf by removing dummy leaf nodes and obtain GTP{gt(c) 3. Translate the set of GTPf(c) to XQ{ 3.1 Initialize XQ\ to be an empty string 3.2 Bind each of LCA(a.O — leaves) and all a.O — leaves to variables Put them and their corresponding XPaths in XQfs outter FOR clause XPath used for LCA(a.O — leaves) is the path expression from root XPaths for a.O — leaves are the ones from LCA(a.O — leaves) to them) 3.3 Bind LCA(a.l - leaves, LCA(a.O - leaves)) to $LCA-a.l Put $LCA-a.l with their corresponding XPath in XQfs inner-FOR clause Find Path(a.O - leaves, $LCA-a.l) Bind each path with its corresponding a.0-leaves Put them in XQ\'s inner-FOR clause Denote these a.O — leaf variables a.O — leaves — 2 3.4 Equate a.O — leaves and a.O — leaves — 2 (with same path in Schema) in XQfs inner-WHERE clause 3.5 Put a.l-distinguished nodes with their paths in XQl's inner-RETURN clause Figure 7.16: GTP Query Translation Algorithm Chapter 7. Query Translation for Generalized Tree Pattern 56 7.4 Conclusions and Future Work Here we have proposed an algorithm to translate Generalized Tree Patterns. For a restricted yet significant subset of XQuery as modeled by GTPs, we have devised a translation strategy along and against the direction of mapping rules. Note that our algorithm and our mapping rules do not expect any "key" infor-mation whatsoever. For any arbitrary XML document, one cannot assume the presence of keys, as it happened for relational tuples. HePToX's future work is devoted to studying the possible implications of adding key information to the schemas and to the mapping rules and see how these can aid the translation process. Chapter 8 57 Exploring Many-to-One and One-to-Many Mappings in HePToX 8.1 Introduction Many existing literature in data integration employ element mappings between different schemas that store similar or related information. However, as sur-veyed in [16], most existing approaches focus on 1:1 mappings which map one element of one schema to one element of the other schema. Some literatures have mentioned about N:l mappings ([23], [36], [27], [33]), but more work is needed to explore more sophisticated criteria for generating non 1:1 mappings, which are currently hardly treated at all. In this chapter; we study why N:l and 1:N mappings are hard to deal with. (Note that we do not deal with N:M mappings here.) We have proposed a way to capture N:l and 1:N mappings using visual annotations in HePToX and also explored how to represent these mappings in HePToX mapping rules, which are later used to extend the current query translation algorithm in HePToX to deal with N:l and 1:N mappings. For query translation involving N:l and 1:N mappings, we have also studied soundness and completeness of the query translation results. This is an extension of the current infrastructure of HePToX explained in the previous chapters. In section 8.1.1, we will discuss in more details why dealing with N:l and 1:N mappings has remained to be a very challenging problem in data integration and in section 8.2 we will discuss the different categorizations of N:l and 1:N mappings. After that, section 8.3 will discuss how these mappings are represented in HePToX mapping rules, and section 8.4 will walk through the query translation algorithm with N:l and 1:N mappings with examples. We study the soundness and completeness of the query translation answers in section 8.5 and at the end we will conclude in section 8.6 by pointing out limitations of the studies so far and interesting future works. 8.1.1 Challenges of dealing with N : l and 1:N Mappings Dealing with N:l and 1:N mappings in data integration is a challenging task because of the following main reasons: Chapter 8. Exploring 1:N and N:l Mappings in HePToX 58 • It is hard to identify N:l and 1:N mappings. • Assume that we have correctly identified N:l and 1:N mappings, it is also hard to represent these mappings because there are many different varia-tions and possibilities that the mappings could be (discussed in section 8.2) and thus very hard to generalize. • Assume that we have correctly identified N:l and 1:N mappings and rep-resented them well enough in a general and abstract way, to translate queries using these mappings is still a very hard problem. For example, how to guarantee correctness, completeness, soundness of the translated query in this case? Even query translation using 1:1 mapping is a com-plicated process, it's natural to imagine that query translation with N:l and/or 1:N mappings could easily involve very heavyweight operation and thus hard to scale in a network requiring many translations. 8.2 Different categorizations of N:l and 1:N Mappings We started studying the problem of N:l and 1:N mappings by first considering different categorizations of these mappings to try to understand the charac-teristics of the mappings better. The examples in this section are with N:l mappings. However, the categorizations of 1:N mappings can be understood in a similar way. We essentially considered 3 different axes of categorization: (1) Mapping at schema-Level or data-Level (Section 8.2.1); (2) Mapping from multiple tuples or a single tuple (Section 8.2.2); and (3) Mappings related to string or arithmetic manipulation(Section 8.2.3). 8.2.1 Mapping at Schema-Level or Data-Level One way to categorize N:l and 1:N mappings is to look at whether the N el-ements in schema A is related to the 1 element in the other schema B at the schema-level or data-level, or both. In XML, schema-level mappings refer to the ones between node tag names, and data-level mappings refer to the ones between instance values. There could also be more complicated mappings that happen at both the schema and the data level. It is important to recognize these differences, since they are very different in terms of data exchange semantics. For example, here are some possible variations: • Map tag names of N nodes in schema A to the tag name or the instance value of 1 node in schema B. (examples shown in Figure 8.1) • Map instance values of N nodes in schema A to the tag name or the instance value of 1 node in schema B. (examples shown in Figure 8.2) • Map some combination of tag names and instance values of the N nodes in schema A to 1 node in schema B (examples shown in Figure 8.3) Chapter 8. Exploring 1:N and N:l Mappings in HePToX 59 A) Source Schema confl conf2 SIGMOD { VLDB 1 2004 <:2005 publication author title Map N tag names iii A , -'to 1 tag name in'B„ (a) publication author title B) Target Schema VLDB /'VLDB\ SIGMOD SIGMOD 2004 ^2005i 2004 2005 A) Source Schema confl SIGMOD i''vLDB 1 \><T~:0 2004 r ,2005. y Map N tag names in A to 1 instance value in B conf3 I DBConf B) Target Schema ConfID - • /^ 'VLDB^, publication > 2005") author title publication author title (example of one instance value) (b) Figure 8.1: N:l Mapping Variation 1 - Mapping tag names to a tag name (a) or an instance value (b) 8.2.2 Mapping from Multiple Tuples or A Single Tuple This is another way to look at the categorizations of N:l and 1:N mappings. Assume that we only deal with data-level N:l and 1:N mappings, there still exists different kinds of variations. For example, one possible N:l mapping could relate N instance values stored in multiple tuples of one particular node in one schema to one node in the other schema. For example, we can relate 12 monthly expenses (each month could have different expense) which are stored in 12 tuples of the node mExpense in one schema to a node AnnualExpense in another schema that stores yearly salary for a particular company. The annual expense in the latter schema is the sum of the monthly expenses stored in the first schema (Figure 8.4). This is like an aggregation mapping which uses an aggregation function which takes the N elements in one schema as arguments to obtain the 1 element in the other schema. This is an very interesting case, and will be further explored in HePToX's future research. Compared to the above aggregation mapping example, another possibility of an N:l or 1:N mapping could be a mapping that relates the instance value of N different nodes that are stored in a single tuple in one schema to one node in the other schema. For example, we can map price and tax in one single tuple (that corresponds to the same product) in one schema to the cost node in the other schema. Note that this'is a very different mapping from the salary example above in terms of how many tuples this mapping relates to. It is important for Chapter 8. Exploring 1:N and N:l Mappings in HePToX 60 conf2 VLDB / 'VLDB\ SIGMOD 2004 \2Q05s' 2004 Conflnfo CqnfName publication ,YB3C_ author title (example of some instance values) • Map'N instance values in A to 1 tag nanie;iri'B publication author title B) Target Schema SIGMOD 2005 (a) A) Source Schema conf4 DBConf author ConfName Year Cj2005-~>' title (example of some instance values) conf3 I DBConf ConfID publication B) Target Schema Map N instance values in A to 1 instance value in B - - * - , " V L D E r - • / t 2005" ) author (example of one instance value) title (b) Figure 8.2: N:l Mapping Variation 2 - Mapping instance values to a tag name (a) or an instance value (b) us to recognize these differences, which could have impact in query translation in the later process. 8.2.3 String vs Arithmetic Manipulation Other than the above two categorizations, we also consider another way of categorizing the mappings - depending on the characteristics of the mapping. Assume that we don't care whether the value(s) to be mapped are tag name(s), instance value(s), or even a combination of both, we abstract an N:l and 1:N mappings to the idea of relating N values in one peer to 1 value in another peer and we classify how these values relate in the two main categories: • String Manipulation (example shown in Figure 8.5) • Arithmetic Manipulation(example shown in Figure 8.6) This is the categorization that the following sections are focusing on. How to deal with the other 2 categorizations of N:l and 1:N mappings (mappings at schema-level or data-level, and mappings from multiple tuples or a single tuple) will be explored further in HePToX's future research. Some intuitions of these future work are given in Section 8.6. Chapter 8. Exploring 1:N and N:l Mappings in HePToX 61 confinfo ' V V L D B ; ; conf2 VLDB / V L D B \ SIGMOD 2004 ls2005; 2004 publication ,YB9C C™&}'' author title (example of some instance values) Map N tag names and instance J values in A to 1 tag name in B (a) publication author title B) Target Schema SIGMOD 2005 A) Source Schema conf4 DBConf ConfName Year author title (example of some instance values) Map N tag names arid instance values in A to 1 instance value in B B) Target Schema conf3 I DBConf ConfID publication - • / • V L D B \ ^ / ^ \ v 2005" / author title (example of one instance value) (b) Figure 8.3: N:l Mapping Variation 3 - Mapping combinations of tag names and instance Values to a tag name (a) or an instance value (b) ExpenseReportl I* expl ExpenseReport2 I* exp2 ! J§3rl_Q™nJl _13ixp§!lSJ?J I.y§.ar2 Select yearl Sum (mExpense) Group By yearl Anni^ Expense] Figure 8.4: Example of Mapping data from N multiple tuples to one tuple (aggregation mapping) Chapter 8. Exploring 1:N and N:l Mappings in HePToX 62 A) Source Schema bib *l bookl authort titlel /" '^^^~/> " ~ ~~ ~-\ FName•) \ LNamej T 1 catalogue *l author2 ' name2) book2 ? title2 B) Target Schema String Concatenation Figure 8.5: Book-Author example - 2:1 String Manipulation Mapping A) Source Schema infol info2 B) Target Schema bookl book titlel regionICpjice) ijaxj* \ \ V x v ( price*(l+tax) - cost' y N. \ | .1 _ _! J _ _ J s 'cost,'* region2 title2 Figure 8.6: Price-Cost example - 2:1 Arithmetic Manipulation Mapping 8.3 Representation of N:l and 1:N Mappings in HePToX 8.3.1 Visual Annotations used to Capture N : l and 1:N Mapping To study how to extend the current HePToX framework with N:l and 1:N map-pings, the first question that we need to consider is how to capture the N:l mapping using visual annotations in HePToX. Adapting the use of arrows and boxes to specify correspondences in HePToX, we first circle the nodes impacted by the N:l or 1:N mapping (the N nodes in one schema, and the 1 node in the other schema), and connect the circles from the two schemas together with Chapter 8. Exploring 1:N and N:l Mappings in HePToX 63 a "mapping box" which specifies the details of this N:l or 1:N mapping (i.e. characteristics of the mapping, such as string concatenation, or arithmetic ex-pression of how the N nodes are related to the 1 node, etc.), and finally we will complete the arrow from the "mapping box" to the node(s) in the target schema that this mapping maps to. Let's reconsider the example in Figure 8.5. In Figure 8.5 we concatenate the data in FName and the data in LName in source to obtain name2 in target, thus we have a 2:1 mapping here. You can see the circled nodes (FName, LName in the source, and name2 in the target) and the "mapping box" which specifies that this mapping is a string concatenation. 8.3.2 Representation of N : l and 1:N Mappings in Mapping Rule After we capture the correspondences of N:l or 1 :N in HePToX, how to represent this kind of mappings in the mapping rules? In the straight 1:1 case, we could easily represent this mapping relationship between the source and the target by referring to the same node variable. A correspondence between the two schemas is established by having the same node variable both appear in the source and target. However, in the case of N:l or 1:N mappings, we need more complex and powerful representations which can also capture the type of the mapping - string manipulation or arithmetic comparison (see section 8.2.3). To capture this generically, we propose to use a mapping expression which we call "Ntol Expression" for N:l mappings (or "ltoN Expression" for 1:N mappings) that take the binding variables of the N elements in one schema as arguments, and specify how the N elements in one schema is related to the 1 element in the other schema (examples explained below). Please note that for now, we assume that the N-fT nodes impacted by every N:l or 1:N mapping exist in the same, mapping rule. Let's take Figure 8.5 as an example, the the mapping rule for this schema pair with the "Ntol Expression" is shown in Figure 8.7. In the mapping rule, we assign name2 in the rule head a node variable $name2 and then assign $name2 the 'Ntol Expression concat($fn, $ln) to indicate that the name2 node in the target schema is a string concatenation of FName ($f n) and LName ($ln) in the source. This capture the correspondences between the 2:1 mapping in Figure 8.5. Now let's take a look at the "Ntol Expression" and "ltoN Expression" in more details. Remember that when we only had 1:1 correspondences in our mapping rules, all the nodes in the source schemas were initially assigned to variables, and then all the nodes in the target schemas are bound to either exactly one variable from the source schema or to a Skolem function that takes as arguments the corresponding,variables from the source, each of which is associated with a single-valued leaf descendents of this non-leaf node in target. When we specify a Ntol mapping, we assign the variable of the 1 node in the target schema that's involved in the N:l mapping to a variable that will be Chapter 8. Exploring 1:N and N:l Mappings in HePToX 64 specified later with an " N t o l Expression". Let's call this variable in rule head " N t o l Variable" from now on. Similarly, when we specify a ltoN mapping, we assign the N nodes in the target schema to variables first, and then assign the 1 node in the source schema that's involved in the 1:N mapping to a " l t o N Variable" and then associate this variable to a " l t o N Expression". We put both the "Ntol Expression" and the "ltoN Expression" in the rule body. In this way, all the leaf nodes (both in target and source) are still bound to variables which is consistent with what we've been writing for the mapping rule with only 1:1 correspondences. Please keep in mind that the mapping rules are only for specifying the corre-spondences of nodes between the source and target schema, but is not intended to specify how to construct instances of one schema given the instances of the other schema. catalogue—> f-catalogue($bib) [author2 —* f.author2($fn, $ln) [name2 —> $name2 book2 -> fJitle2($titlel)[title2 -> Uitlel)}) bib->$6i& [bookl —> $bookl [author 1 —> %author 1 [FName —• $fn, LName -> $ln], titlel->$tttZel]], $name2 = concat($fh, $ln). Figure 8.7: Mapping Rule with Ntol Expression for Book-Author example in Figure 8.5. 8.3 .3 Formal Algorithm for Mapping Rule Generation with N : l and 1:N Mappings Incorporating what was explained in section 8.3.2 with the previous HePToX mapping rule generation algorithm with only 1:1 correspondences, we obtain the the new mapping rule generation algorithm which deals with not only 1:1 mappings, but also N:l and 1:N mappings. Similar to the previous algorithm, the mapping rule generation algorithm consists three parts: detecting groups (which is still the same as the algorithm with 1:1 mapping), generating tree expressions, and constructing rules (as shown in Figure 8.8). The group formation algorithm can stay the same because that the reason we form different groups for one particular DTD is to deal with the multiplicity problem (explained in Chapter 4), which is the same with 1:1, N:l and N:l Chapter 8. Exploring 1:N and N:l Mappings in HePToX 65 mappings. The differences of N:l and N:l mappings from 1:1 mappings in the mapping rules are produced in the rule construction algorithm (Figure 8.8). 8.3.4 Variable Binding in Mapping Rules In a mapping rule with only 1:1 correspondences, we have rule-head(V) <— rule-body(V, U) where V are variables mentioned in the rule head (which also appear in the rule body), and U are the variables bounded in the rule body that have no correspondences in the rule head (e.g. non-leaf nodes). Thus, all variables in the rule head also exist in the rule body, but not all variables in the rule body exist in the rule head. For example, suppose we have the source and target schemas as shown in Figure 8.9, and the mapping rules inferred from the correspondences are shown in Figure 8.10. For this mapping rule, V includes node variables $bib, $namel, $titlel, and tj includes node variables $bookl and $authorl. Note that in a mapping rule with Ntol and ltoN mappings, as explained in Section 8.3.2, the variable bindings are a little bit different. For an Ntol map-ping, we assign the N nodes in rule body to variables, and then assign an Ntol variable to the 1 node in the rule head, and specify the Ntol Expression for this node in the rule body. For a ltoN mapping, we assign variables for the N nodes in rule head (which participates in the ltoN mapping), and then assign a ltoN variable to the 1 node in the rule body, and specify the ltoN Expres-sion for this node in the rule body. Note that both the l toN Variables and the Ntol "Variables are variables that are not mentioned in the other side of the rule, even though they are related to variables of the other side indirectly via "ltoN Expression" and "Ntol Expression". For example, in the mapping rule shown in Figure 8.7, $name2 in the rule head is an JVtol variabie. It does not have a counterpart in the rule (Sname2 is not associated with any node in the source schema), but it is related to the node variable $fn and $ln in the source via the Ntol Expression concat($fn, $ln). Thus, for the mapping rules only involving Ntol Expressions, we have rule-head(W, V) <—rule-body(W, V', U), V' -> (via ltoN Expression assignment) —> V For the mapping rules only involving ltoN Expressions, we have rule-head(W, V') <—rule-body(W, V , U ) , V ' -> (via JVtol Expression assignment) —> V Chapter 8. Exploring 1:N and N:l Mappings in HePToX 66 where W are variables in the rule head that have counterparts in the rule body, V are the N variables in one side of the rule (rule head for Ntol mapping, or rule body for ltoN mapping) that do not have counterparts in the other side of the rule (rule body for Ntol mapping, or rule head for ltoN mapping), but relate to certain variable(s) in the other schema via some function (e.g. a "Ntol Expression" or "ltoN Expression"). U are the variables bounded in the rule body that have no correspondences in the rule head (e.g. non-leaf nodes), and V ' refers to the N t o l Variable or l t o N Variable that participates in the Ntol or ltoN mapping. For the mapping rules that have both Ntol and ltoN Expressions, we have rule-head(W, P \ R ) <—rule-body(W, P, R', U), P'->(via JVtol Expression assignment) —* P, R ' —» (via ltoJV Expression assignment) —> R where W are variables in the rule head that have counterparts in the rule body, P are the N variables in the rule body that have no counterparts in the rule head, but relate to certain variable(s) in the rule head via "Ntol. Expression"; P ' refers to the N t o l Variable in the rule head that participates in the Ntol mapping; R are the N variables in the rule head that do not have counterparts in the rule body , but relate to certain variable(s) in the rule body via "ltoN Expression"; R ' refers to the l t o N Variable in the rule body that participates in the ltoN mapping; U are the variables bounded in the rule body that have no correspondences in the rule head (e.g. non-leaf nodes). For example, in Figure 8.16, we have both Ntol and ltoN expressions. W includes the variables $bib, and $isbn; P' refers to the Ntol Variable cost; R refers to $base and Sbenefits; P refers to price and $tax; R' refers the ltoN Variable $salary; and tj refers to book, and $author, and priceinfo. In the rule body, the "Ntol Expression" specifies that the "ltoN Variable" P'(i.e. Ssalary) is associated to $base with $benef its by adding those two together, while the "ltoN Expression" captures the ltoN mapping from price and tax to cost. 8.4 Query Translation with N : l and 1:N Mappings 8.4.1 Query Translation Algorithm So far we have obtained the mapping rules from section 8.3. Now let's take a look at how we integrate the "Ntol Expression" and "ltoN Expression" to the query translation algorithm in HePToX. Here is the intuition. Recall that dummy nodes are the nodes in a mapping rule, but are not mentioned in a particular query; and distinguished nodes are the nodes specified to "return" in an XQuery. As before, the query translation Chapter 8. Exploring 1:N and N:l Mappings in HePToX 67 algorithm for a tree pattern with one particular mapping rule still consists the main phases of expansion and translation. In the expansion phase, we mark dummy and distinguished nodes, and also obtain the substitution that relates variables in two schemas. This phase is unchanged from the previous algorithm. Then in the translation phase, for the node with an "Ntol Variable", unlike the straight 1:1 case that every leaf node corresponds to a variable, a leaf node now have an Ntol Expression that relate this node to more than one variable. For optimization purpose, we also need to mark a node dummy or not as before. 'Just like the Skolem function we used before, if at least one of the N nodes associated with the "Ntol Variable" are non-dummy, then this node is non-dummy. Otherwise, we mark this "Ntol Variable" as dummy. If one of the associated N nodes is a distinguished variable, then this node is distinguished. For a node with a "ltoN Variable", if the 1 node in the source were non-dummy, the N nodes in the target are all non-dummy and they propagate the constraints from the 1 node from the source according to the ltoN expression specified in the mapping rule (e.g. the arithmetic expression). If the 1 node in the source was distinguished, the N nodes in the target are all distinguished. The formal algorithm with the intuition mentioned above is shown in Fig-ure 8.11 and Figure 8.12. As you can see, the algorithm varies depending on whether we are translating queries from source to target, or from target to source, and whether we are dealing with 1:N Function or N:l Function. 8.4.2 Query Translation Example with only N:l Mapping Suppose we pose the following XQuery to the source schema in Figure 8.5: "Find the last name of the author who wrote the book with the title 'XML' ", expressed as: q i : FOR $b IN / / b o o k l , $a IN $ b / a u t h o r l WHERE $ b / t i t l e l = "XML" RETURN { $a/LName } This query can be represented as a tree pattern, as shown in Figure 8.13. As done with the query translation algorithm with 1:1 correspondences, in the expansion phase, we mark dummy and distinguished nodes as before, and we also obtain the substitution that relates variables in the rule body to those in QI. After the expansion phase, we obtain Figure 8.14(a). Then in the translation phase, we need to use our new algorithm to deal with the "Ntol Variable" name2. As explained in Figure 8.11, just like our skolem function, if at least one of the N nodes in the Ntol Expressions for an Ntol Variable is non-dummy, this Ntol Variabie is non-dummy. Otherwise, we mark it as dummy. If one of the associated N nodes is a distinguished variable (i.e. $ln'), then this node is distinguished. The translated tree pattern query after the translation Chapter 8. Exploring 1:N and N:l Mappings in HePToX 68 process is shown as in Figure 8.14(b). For the last step, we shrink (b) as we did before in the usual HePToX query translation algorithm and obtain (c), the final translated XQuery. 8.4.3 Query Translation Example with N:l and 1:N Mapping Let's consider the correspondences shown in Figure 8.15 between two schemas that have both N:l mapping and 1:N mapping. According to section 8.3, we will have the mapping shown in Figure 8.16. Query Translation from Source to Target Schema Let's consider the following XQuery (Q2) that involve both the node involved in the ltoN mapping (i:e. salary) and the node involved in the Ntol mapping (i.e. price): "Find the salary of the author whose salary is over 10K and whose book is sold for less than $100.", expressed as: q2: FOR $b IN //book, $s IN $b/author/salary/text(), $p IN $b//price/text() WHERE $s>10K and $p<100 RETURN { $s } Similar to the previous examples, Q2 can be represented as a tree pattern, as shown in Figure 8.17. The translation process of Q2 is shown in Figure 8.18, There are several interesting issues here: Constraint Propagation: Note that when we propagate the constraints for a ltoN mapping, we replace the 1 element in the original constraint on that element in the source schema with the mapping formula. For example, the original constraint in Q2 was "salary>10K". In Q2transiated, we propagate the constraint and obtain "base+benefits>10K". This might cause the problem of being difficult to prune early when evaluating the query. However, the pruning effectiveness might also depend on the characteristics of the constraints. If the constraint is "base+benefits<10K", then it is easier to prune. "?" in Constraint: For example, if we actually have "?" for the cardinality between benefits and author in the target schema, how do we deal with it? We might not have a value in some instance. For example, if the ltoN mapping is ' 'salary=wages*hours'', and if we get a null value for wages, then we might not be able to calculate salary. Semantics of the mapping: We might want to represent something like salary = (base if 3one) + (benefits if 3one) Chapters. Exploring 1:N and N:l Mappings in HePToX 69 Query Translation from Target to Source Schema With our translation algorithm, just as before, we can not only translate the query in the same direction as the mapping rules (from the source to the target schema), but also translate in the opposite direction of the mapping rules (from the target to the source schema. Here is an example based on the Price-Salary example (Figure 8.15) with the mapping rule ((Figure 8.16). Let's consider the following XQuery that (Q3tgt) is posed on the target schema Figure 8.15(B): "Find the salary of the author whose base salary is over 10K and whose book costs less than $100.", expressed as: Q3tgt '• FOR $b IN / / b o o k l , $b IN $ b / a u t h o r l / b a s e / t e x t ( ) , $c IN $ b / / c o s t / t e x t ( ) WHERE $b>10K and $c<IOC-RETURN {$b} Q3 is represented as a tree pattern, as shown in Figure 8.19. The translation process of Q3 is shown in Figure 8.20. 8.5 Analysis of Query Translation involving N:l and 1:N Mappings 8.5.1 Definition of Soundness, and Completeness of Query Translation Intuitively, soundness and completeness adhere to the normal definitions: if all the answers returned by the translated query are correct, then the translation is sound; if the translated query is guaranteed to return all answers that the original query would, then it is complete. Formally, we explain as below. Suppose we have the database Dsrc(Dtgt) for the source(target) schema. For Dsrc, we construct/enumerate n hypothetical Dlsrc_hyp. We take the intersec-tion of the answers of the Qsrc posed on each D\rc_hyv( i.e. f] Qsrc(E>lsrc_hyp) ) and we also obtain the answer obtained from posing the translated query on Dtgt ( i.e. Q%Ts{Dtgt) )• (a) If the translated query returns not only all answers that the original query would, but also some noises, then the answers returned by the translated query is complete but not sound. Formally, this means: if QtfTa<Jls(Dtgt) D fl QsrdDirc-hyp), then Q'™ n s(rv) is complete but not sound. (b) If the translated query returns all correct answers, but not all the answers that the original query would, then the answers returned by the translated Chapter 8. Exploring 1:N and N:l Mappings in HePToX 70 query is sound but not complete. Formally, this means: If Qls™ns(Dtgt) C n f| Qsrc(E>irc_hyp), then Q1™™ (Dtgt) is sound but not complete. i = l (c) If the translated query returns exactly all the answers that the orig-inal query would, which are correct answers, then the answers returned by the translated query is both sound and complete. Formally, this means: If Qlrrcns(Dtgt) = f]Qsrc(Dlrc_hyp), then Q1™™ (Dtgt) is both sound and com-plete. 8.5.2 Price-Cost Example , Let's reconsider the example of the 2:1 mapping between p r i c e , t a x in the source and c o s t in the target in Figure 8.6. Similar to the book-author string concatenation example discussed above, we use our "Ntol Variable" with the "Ntol Expression" to represent the 2: 1 mapping in the mapping rule (see Figure 8.21). Consider the following query asked on the source schema: "Find the title of the books that are sold in 'Canada' with price less than $100 excluding tax." (assume that country names are stored in the node r e g i o n for now, and each country name uniquely determines the t a x rate of the book). This query could be expressed as the following XQuery: q4: FOR $b IN / / b o o k , WHERE $ b / r e g i o n = " C a n a d a " AND $b/pr ice < 1 0 0 RETURN { $ b / t i t l e } We apply the algorithm discussed in 8.3 to translate the query, and obtain the following translated XQuery: Q4translated '• FOR $b IN / / b o o k , WHERE $ b / r e g i o n = " C a n a d a " AND $b/cost< 1 0 0 RETURN { $ b / t i t l e } 8.5.3 Sound but not Complete translated query results Are answers returned by the translated query sound? Yes. According to the arithmetic expression that specifies how price, tax and cost are related ( price*(l+tax)=cost ), and we assume that tax is always greater than 0, then we know that price is. always less than cost. Thus, all the books with price less than 100 must have cost less than 100 as well. That's to say, if we return all the book title with cost less than 100 sold in Canada in the target schema, they are sound answers. Chapter 8. Exploring 1:N and N:l Mappings in HePToX 71 Are answers returned by the above translated query complete? Is this answer complete? No. If as we assumed that tax is always greater than 0, then if we return all the book title with cost less than 100, it's possible that there are still some books with price less than 100, but cost greater than 100 that haven't been returned as answers. Thus, the returned answers are not complete. Note that price and tax are in the source schema, and cost is in the target schema. If we query the source schema, and translate Q s r c to Qtgt to get the data which is stored in the target schema. In some situations, if we can get the tax data from the source schema, then we can do some arithmetic post-processing after we get the returned cost data from the target to return the expected price information. However, most the times, we don't have the necessary tax data available at the source, thus we cannot guarantee completeness in this case. In general, for sound but not complete query answering, we want to get minimally complete answers. We define minimal completeness as follows: The translated query Qtgt is minimally complete provided VQ on Stgt, Q is complete w.r.t. Qsrc, Qtgt Q Q-8.5.4 Complete but not Sound translated query results On the other hand, if we change the constraint of "price<100" to "price>100", we will have " cost > 100" after translation. Since cost is always greater than price as we explained above with our assumption of tax being greater than 0. Since books with cost greater than 100 may have price less than 100, thus some garbage answers might have been returned. However, all the books with price greater than 100 must have cost greater than 100, thus the answer set returned in the translated query is a superset of the expected answer set. In this situation, our query translation results are complete but not sound answers. In general, for complete but not sound query answering, we want to get maximally sound answers. We define maximal soundness as follows: The translated query Qtgt is maximally sound provided VQ on Stgt, Q is sound w.r.t. Qsrc, Q C Qtgt. 8.5.5 Summary of Soundness and Completeness in Query Translation So in general, for query translation results with N:l Mapping in HePToX using the above technique, when can we guarantee soundness, when can we guarantee completes, and is it possible to guarantee both? For now, we assume that we are only considering the kind of arithmetic N:l mappings that only deal with =, >, >=, <, and <=, and the constraints that users are allowed to pose on the source schema are linear, one-variable con-straints, in the form of <variable><operator><constant>, where <operator> is only of the form =, >, >=, <, or <=. In this case, if the direction of the <operator> is not in conflict with the direction of the mapping function, then we can guarantee to return sound but not necessarily complete answers (as in 8.5.3). Otherwise, if the direction of the Chapter 8. Exploring 1:N and N:l Mappings in HePToX 72 <operator> is in conflict with the direction of the mapping function, then we can only guarantee to return complete but not necessarily sound answers (as in 8.5.4). 8.6 C o n c l u s i o n a n d F u t u r e W o r k In this chapter, we have presented how to extend the current HePToX P2P XML database system to deal with N:l and 1:N mappings, focusing on the following key conceptual contributions: 1. a general discussion of possible categorizations of N:l mappings (Sec-tion 8.2) 2. a visual annotation to capture N:l mapping between schemas in HePToX (Section 8.3.1) 3. a way to represent N:l mappings and 1:N mappings in HePToX mapping rules using the "Ntol Expression" and "ltoN Expression" (Section 8.3.2) 4. an extension of the current HePToX query translation algorithm to em-ploy the "Ntol Expression" and "ltoN Expression" in mapping rules (Sec-tion 8.4) 5. an analysis of the soundness and completeness of the translated queries after applying on 4 for query translation (Section 8.5) The analysis mentioned in 5 could still be further explored, and there are also a list of other interesting future work that HePToX could further extend: • In this chapter, we discussed how to use the "Ntol Expression" and "ltoN Expression" to translate a query with N:l and 1:N mapping expressed in a mapping rule. However, here we implicitly assumed that the elements to be mapped all exist in the same mapping rule. What if they actually exist in different mapping rules? Then how represent the N:l mapping with multiple mapping rules and how to translate queries in that case? • After N:l and 1:N mapping is fully explored in HePToX, a natural exten-sion is to deal with N:M mapping which is even more complicated. Chapter 8. Exploring 1:N and N:l Mappings in HePToX 73 Input: source groups, and target groups Output: a set of mapping rules For every target group tg Let {sgi,sgj} be the set of source groups connected to tg by arrows Let TE(tg) be the tree expression for group tg 1.Start with the rule skeleton TE{tg)^-TE(sgi),...,TE{sgj) For each leaf node leaf .node in TE(tg) 1.1 If it has an arrow incident from only 1 node in source, Assign its id the incident variable in source 1.2 If it participates in an Ntol Mapping, Assign its id to an Ntol Variable Assign this Ntol Variable to an Ntol Expression which specifies how it is related to the N nodes in the rule body Write this assignment to with the Ntol expression in rule body 1.3 If it participates in a ltoN Mapping, Assign its id to a ltoN Variable Assign this ltoN Variable to a ltoN Expression which specifies how it is related to the N nodes in the rule head Write this assignment with the ltoN expression in rule body 2. For root and each of its descendants via only single-valued edges Assign their ids as distinct Skolem functions of the root variable in the source 3. For each internal node inode 3.1 Assign its id as a distinct Skolem function with variables of all mode's single-valued leaf descendants. Let sc denote mode's single-valued leaf descendant(s) If a sc has an arrow incident from only 1 node in source, Use the variable associated with sc via the arrow If a sc is in an Ntol Mapping, Use the Ntol Variable created from 1.2 above If a sc is in an a ltoN Mapping, Use ltoN Variable created from 1.3 above 3.2 If any of sc does not belong to tg Trace the source group sgv with an arrow pointing to sc Add TE(sgp) in the rule body: TE(tg)<—TE(sgi),TE(sgj),TE{sgv) ' Let scref denote the corresponding element of sc in the source schema If scref points to a node in the source schema scid and the source group sgq that scid belongs to is not in {sgi,sgj} Add TE(sgq) in the rule body: TE(tg)^TE(sgi),TE(sgj),TE(sgp),TE(sgq) Equate the variable scref binds to ($scref) with the variable scid binds to ($scid): $scref=$scid/text() (if scid is an attribute, then use $scref=$scid instead) Add %scref as an argument to the Skolem function on the RHS of inode Figure 8.8: Rule Construction Algorithm for 1:1, N:l and 1:N Mappings Chapter 8. Exploring 1:N and N:l Mappings in HePToX 74 catalogue author2 name2 book2 (a) Source Schema title2 ^ji (b) Target Schema Figure 8.9: Book-Author example schema in HePToX catalogue —> f .catalogue($bib) [author2 —> f .author2(%namel) [name2 —> %namel, book2 -> /J>oofc2($tztfel).[title2--»StitZel]]] bib-»$6i6 [bookl —> %book\ [authorl —* $authorl[namel —> $namel] t i t l e l ->$tifZel]] Figure 8.10: Mapping Rules for Book-Author example in-Figure 8.9. Chapter 8. Exploring 1 :N and N:l Mappings in HePToX 75 Algorithm TransForward. Input: a join of tree pattern queries Qi, mapping rule set from A i —> A 2 Output: one translated query in the form of join of TPs. 1. For each TP Qi, translate it: let Q't = TransTPF(Qi). 2. Rename the variables apart in the translated TPs. 3. Add join conditions to the set of translated TPs (with variable renaming). 4. Chase the set of TPs along with schema key constraints, the join conditions and other query conditions, merging nodes as necessary. 5. Obtain contractions of previous query Q' by recursively removing appropriate dummy nodes bottom up. Let Ql be the resulting query obtained by replacing Skolem terms with associated node tags. 6. Translate the final merged TP (or set of TPs) Ql into XQuery. Procedure TransTPp. 1. For each mapping rule fij = hj *— bj a. Find a substitution and expand Qi to Q\j by matching the rule body rbj to the query pattern Qi For each node in Q|- and not in Qi, mark it as dummy If it is a distinguished node in Qi, Mark it as distinguished in the expanded query Q\j b. Translate each Q\j into Q\* by applying fj.j If a leaf node in rule head %rhleaf is associated with an JVtol Variable If %rhdeaf contains at least one non-dummy node in its JVtol Expression, then it is non-dummy; otherwise it is marked dummy. If one of its associated variables in the JVtol Expression is distinguished, then it is marked distinguished. If %rhdeaf is associated with a 1 toN Variable If its associated node in source in the ltoJV Expression is non-dummy, then %rh.leaf are non-dummy and they propagate the constraints from the 1 node from source according to the ltoJV Expression If its associated node in source in the ltoJV Expression was marked distinguished, then %rh.leaf is marked distinguished 2. Stitch together all Q\j to obtain resulting query Q\. Nodes with same or unifiable Skolem terms get stitched by looking at the corresponding substitutions. Figure 8.11: Forward Query Translation Algorithm with 1:1, N:l and 1:N Map-pings Chapter 8. Exploring 1:N and N:l Mappings in HePToX 76 Algorithm TransBackward. Input: a join of tree pattern queries Qi, mapping rule set from Ai —* A2 Output: one translated query in the form of join of TPs. 1. For each TP Qi, translate it: let Q\ = TransTPB(<2i)-2. Rename the variables apart in the translated TPs. 3. Add join conditions to the set of translated TPs (with variable renaming). 4. Chase the set of TPs along with schema key constraints, the join conditions and other query conditions, merging nodes as necessary. 5. Obtain contractions of previous query Q' by recursively removing appropriate dummy nodes bottom up. Let Ql be the resulting query obtained by replacing Skolem terms with associated node tags. 6. Translate the final merged TP (or set of TPs) Q* into XQuery. Procedure TransTPs-1. For each mapping rule pj = hj <— bj a. Find a substitution and expand Qi to Qf • by matching the rule head rhj to the query pattern Qi For each leaf node in Qf • and not in Qi, mark it as dummy If it is a distinguished node in Qi, Mark it as distinguished in the expanded query Qf • b. Translate each Qf • into Q1*- by applying p,j If a leaf node in rule body SrbJeaf is associated with an 'Ntol Variable If $rbJeaf contains at least one non-dummy node in its Ntol Expression, then it is non-dummy; otherwise it is marked dummy. If one of its associated variables in the Ntol Expression is a distinguished variable, then it is marked distinguished. If SrbJeaf is associated with a ltoN Variable If its associated node in source in the ltoN Expression is non-dummy, then SrbJeaf are non-dummy and they propagate the constraints from the 1 node from source according to the ltoN Expression If its associated node in source in the itoJV Expression was marked distinguished, then SrbJeaf is marked distinguished 2. Stitch together all QQ to obtain resulting query Q\. Nodes with same or unifiable Skolem terms get stitched by looking at the corresponding substitutions. Figure 8.12: Backward Query Translation Algorithm with 1:1, N:l and 1:N Mappings Chapter 8. Exploring 1:N and N:l Mappings in HePToX 77 $book' Sauthor' . . I . . . $ln' $title' name( $book') = bookl & name( Sauthor') = authorl & name( $ln ' ) = LName & name( Stitle') = title 1 & $title'/text() = " X M L " Figure 8.13: Tree Pattern Query representation of QI (a) bib-+$bib (dummy) bookl-*- Sbook' Translate (b) catalogue -*-f_catalogue($bib) Using mapping ru] author r > Sauthor' titlel-* Stitle' = " X M L " author2-»-f_author( Sname' book2-».f book( Stitle' t i t l e d Stitle' ^TName^$fn> £lNarT)e-*$lf£i (dummy) $name=concaF($fn, $jn') Substitutions: Sbook' |-> Sbook, Stitle' |-> Stitle, Sauthor |-> Sauthor', Sin |-> Sin' Get Translated Quer^ for the Target Scherrfa - i 2 : 1 r (c) /catalogue author2 rnime2~l book2 | title2 = " X M L " FOR $a in /catalogue/author2 $b in //a/book2 WHERE $b/title2 = " X M L " RETURN {$a/name2} Figure 8.14: Query Translation of QI in HePToX with N : l Mapping A) Source Schema bib •I book author pn^einfo -isbn 'SaJaiyXpjice) %xj> cat * | book B) Target Schema author priceinfo isbn (base)<ienefiis> icosp •--•'•^.....r-j —. i -^ri(^ *Q+tax} = cost_ j_ - t " Ts^ lar^ j^ bjsejfberieWs^  Figure 8.15: Price Salary Example - with both 1:N Mapping and N : l Mapping Chapter 8. Exploring 1:N and N:l Mappings in HePToX 78 cat —> f_cat($bib) [book —• fJbook($price, $tax, Sisbn) [author —> f -author(Sbase, ^benefits) [base —> $base, benef its—> ^benefits], priceinfo —> f .priceinfo($price, $tax) [cost —> $cost], isbn—+ Sisbn]] bib->$fti& [book —»%book [author —> $aut/io?-[salary —> Ssalary)], priceinfo—• $pricem/o[price—» $price, tax—* Stax], isbn—> Sisfen]], Ssalary = Sbase + Sbenefits, Scost = Sprice * (1 + $tax). Figure 8.16: Mapping Rule with both Ntol and ltoN Expressions for Price-Salary example in Figure 8.15. //book author ]_salarv_| >10K price <ioo Figure 8.17: Tree Pattern Query representation of Q2 Chapter 8. Exploring 1:N and N:l Mappings in HePToX 79 (a) (b) car-»f_ca|twyf*™^ book-*fJ)ook($[i\$tax, $isbn) aufHor • t isbn • benefits,. . p n ?^ f ° (dummy) bib-+$bib (dummy) book-*$b' Translate fc mjuiui priceinfo -~~T^~^ Using mapping rures f authorfjbase, $ e efit), _rf„ ln f3t_. »,„„, $isbn' Jprice/p^ dummy) I'Tjase^j t'fienefits^ j Ccost-^ cjsp (iialary-*$s!) cJricefWSg) <%ax-+$tap Substitutions: $b' |-> Sbook, Sa' |-> Sauthor, Ss' |-> Ssalary, Sp' |-> Sprice (c) /cat ! 1:2 i Constraints: Sbase+Sbenefits > 10K Scost < 100 book Get Translated Querj author priceinfo tortlieTargetSchema* _ I ]"base] IJeneTits! cost basB .benBfits>tOK <I00 FOR Sb in /cat/book, Sa in Sb/author, Sbase in Sa/base/text(), Sbenefits in Sa/benefits/text(), Scost in Sb/priceinfo/cost/textO WHERE Sbase+Sbenefts>10K ANDScosKlOO RETURN iSbase+Sbenefils) Figure 8.18: Query Translation of Q2 - translation in the mapping rule direction //book author _ _ [ _ _ _b_ase~ >10K cost <100 Figure 8.19: Tree Pattern Query representation of Q3 Chapter 8. Exploring 1:N and N:l Mappings in HePToX 80 (a) Expansion (b) cat-*f_cat($bib) book-»$book' Translate from rule head to. bib-*$bib book-*-$book author+Sauthor' Benefits^  •Sbenefits, (oumifry) priceinfo—*$pi' ' cost-*$c) isbn+$isbn (dummy) Substitutions: Sbook' |-> f_book($price,Stax,$isbn), Sauthor' |-> f_author( $b, $benefits) $b ]-> Sbase, $pi' |-> f_priceinfo($price, $tax), Sc |-> Nto1( $price, $tax) Ge r^ranslatedQucr^ ^ author priceinfo . i , / \ | salary! pnee tax <]0K price*( 1+tax)< 100 [''salary^ "-] priceinfo 's(j Spriceinfo price Propagated Constraints: Ssa!ary> 10K Sprice*(l+$tax) > 100 FOR $b in /bib/book, Ss in Sb/salary/textO, Sp in Sb/priccinfo, Sprice in $p/price/text0, Stax in Sp/tax/textO WHERE $s>IOK AND Sprice*(l+$tax)<100 RETURN Us) Figure 8.20: Query Translation of Q3 - translation against the mapping rule inf o2 -> fJnfo2($infol) [book2—> f jbook2($titlel, $regionl, %price, %tax) [cost —* $cost, t i t l e 2 —> $titlel, region2 —-> $regionl]] < infol—»$in/ol [bookl —> Sbookl [title 1 —> $titlel, regionl —> %regionl, price —> %price, tax —> $£az]], $cost = Sprice * (1 + $tax). Figure 8.21: Mapping Rule with Ntol Mapping for Price-Cost Example in Fig-ure 8.6. 81 Chapter 9 Experimental Study In the previous chapters, we've discussed the various algorithms used in HeP-ToX system, and now in this chapter, we study the algorithmic performance of HePToX along the processes of mapping generation, query translation and query processing. We also investigate the scalability of HePToX w.r.t. the number of peers and the data heterogeneity of the network. 9.1 Experimental Guidelines In all the experiments, we consider a P2P network consisting of several peers, with schema mappings established between some pairs of peers. Each peer joining the network usually maps its schema to that of a few acquaintances. Any peer in the network may similarly map its schema to this new peer's schema. Any user can transparently ask a query conforming to her local schema. Thus, she does not have to be aware of the machinery underlying the mappings between schemas in order to pose queries. Using the mappings inferred and created by the system and using the query translation algorithm, the query will be propagated and translated to the peer's acquaintances. These translated queries are consistent with the schemas of the acquaintances, thus can be run against their local data and the answers are sent to the originating peer. In turn, these acquaintances propagate and translate the query to their acquaintances, and the query will eventually reach all the peers. We define a semantic path as a path that leads from an originating peer to a destination peer, by ensuring that there is a mapping for each pair of peers in the path. Semantic paths are non-directional as mappings can be crossed either ways and each peer is transitively connected to every other node. Notice that there may exist multiple paths connecting a given pair of peers. In the current setting, we select the shortest path. Other criteria may be used here, such as the size of intermediate answers or the coverage of the query by the schema mappings, which are beyond the scope of this paper. Cycles may also occur in' a semantic path, which are solved by marking each query with a global unique id. Peers will not further process queries with ids already seen, thus avoiding redundant work. Thanks to the P2P load balancing, HePToX peers execute queries on their local data and retrieve the query results that are relatively concise, thus re-ducing the network load. In this section, we demonstrate that HePToX query evaluation is quite efficient and local resources consumption is reduced to the Chapter 9. Experimental Study 82 minimum as each peer only needs to remember the addresses and mappings of its acquaintances. This localization means that a user asking a query will not need to know from where the answer are exactly coming, nor does the user know how many peers actually respond to his query, thus making the P2P paradigm a natural setting for HePToX. 9.2 Implementation and Setup We use Emulab [15], a network emulation testbed, to get a realistic P2P database. Emulab consists of a collection of PCs, whose network delay and bandwidth can be set at wish. Emulab allows the full allocation of a real machine resources. Indeed, we observed that XML query answering on each peer is computationally demanding, and cannot just rely on partially allocated machines, as it happens in e.g., Planetlab, justifying our choice of Emulab. We could get 50 real ma-chines in total from the Emulab network. Moreover, we chose a 70ms delay and a 50MB bandwidth to simulate as much as possible the geographical networks behavior. As network protocol among Emulab machines, we mounted FreePas-try [26], which offers a scalable and efficient P2P routing algorithm. Its O(logN) routing complexity, and O(logN) routing table size is at least as efficient as CAN, Tapestry and Chord etc. The current implementation of HePToX exhibits high modularity, and it can run with any XML Query engine by implementing a simple API. As a query engine for XML data, we chose QIZX [30], as this is the fastest open-source XQuery engine we could find. We used FreePastry vs. 1.3.2 and QIZX vs.0.4pl, respectively. HePToX is written in JAVA, thus making it cross-platform. Q Query Descr. QMC Qi Selection with lfilter, Mich. QR3, etc. 11.4% Q2 Selection with 2filters, Mich. QS5, etc. 13.7% Q3 Selection with 3nlters, Mich. QS16, etc. 20.7% Qi Selection with 2filtcrs (lnested), Mich. QS18, etc. 11.5% Qs Selection with 3filters (2nested), Mich. QS34 etc. 11.8% Qe Uoin, Selection with 2filters, Mich. QJ1, etc. 28.6% QT lJoins, Selection with lfilter, Mich. QJ3, etc. 28% Qs 3Joins, Selection with lfilter, No corresp. Mich. 62.5% Q9 6Joins, Selection with lfilter, No corresp. Mich. 100% Qw 9Join, Selection with lfilter, No corresp. Mich. 100% Table 9.1: Query #, Query Description and QMC (% of Query Mapping Cov-erage). 9.3 Datasets and Queries used for Experiments In order to probe the efficiency of the query translation algorithm (Figure 6.5 and Figure 6.12), we considered both synthetic and real XML datasets. As syn-thetic dataset, we chose XMark [31], for which we derived 9 restructured vari-ations, covering a variety of structural transformations, to produce 10 different Chapter 9. Experimental Study 83 XMark schemas randomly scattered across the network. Detailed description of the schemas can be found in Appendix A with Appendix B, and also on the HePToX home page [17]. We modified the XMark xmlgen code accordingly to generate the data sets conforming to the schemas above, and having an average size of 50MB. On the XMark schemas we run a comprehensive set of 10 queries: 7 queries are borrowed from the UMichigan XML benchmark queries (adapted to XMark schema) and 3 queries are multiple joins queries of increasing com-plexity. A summary description of these queries is given in Table 1, whereas complete query specifications can be found in Appendix C and also at [17]. The UMichigan XML benchmark queries were more suitable to probe the effective-ness of our algorithms than the XMark queries themselves, since they build around the XML data structures and let us leverage the heterogeneity of our schemas variations. Since the UMichigan XML benchmark queries are written on a less realistic dataset than XMark itself, we decided to still keep XMark as the synthetic dataset for experiments. Similarly, we also wanted to probe the effectiveness of our translation algo-rithm on real data. For this purpose, we chose the DBResearch collection of 19 XML schemas used in Piazza [32]. For this dataset, we run the same queries used in Piazza experiments, which amount to multiple-joins queries of increasing complexity. These queries can be found in Appendix D. We measured the average number of rules across any pair of schemas, which is equal to 8.67 for the synthetic dataset. Table 1 outlines for each query the % of query mapping coverage (QMC) as the percentage of rules traversed by that query divided by the average number of rules on any schema pair. It can be observed from Table 1 that all the benchmark queries QMC is up to 25% of the total coverage, whereas ad-hoc join queries are close to 100% of the total coverage. This shows that the queries in Table 1 satisfactorily span the QMC %. 9.4 R u l e I n f e r e n c e A l g o r i t h m i n H e P T o X HePToX GUI [10], allows the user to draw a few arrows/boxes, and let generate the corresponding rules accordingly. To give an idea of the performance of algorithms of Chapter 4, we measured the time to infer the rules for XMark schemas variations (DBResearch schemas, respectively), having an average of 65 (7) arrows between them. Our algorithm let us generate an average number of 9 (3) rules in about 45 (25) ms (tested on a PA machine, with 3.0GHz and 2GB memory). 9.5 Q u e r y T r a n s l a t i o n i n H e P T o X In this experiment, we probe the effectiveness of the query translation algo-rithm under different network configurations. In particular, we realized differ-ent parameters may affect query translation time: (a) the average number of Chapter 9. Experimental Study 84 No. ol Acquaintances Avg QT Time w.r.t. Avg Nr. of acquaintances; 1 S 3 4 9 S 7 l t l O min Degree ol Heterogeneity (b) Avg Nr. of Semantic Hops w.r.t. degree of Homogeneity; (c) Avg QT Time w.r.t. total Nr. of distinct Schemas; (d) Time Composition for Querying; ±±± so? I0OG 2000 1000 WOO 5000 0000 1 Completion Tims (ma) (e) Query Completion w.r.t. Timeout; (f) Scalability of the QE Algorithm w.r.t. Nr. of peers; Figure 9.1: HePToX Query Translation and Query Performance for XMark Datset acquaintances across all sets; (b) the degree of 'heterogeneity' (ranging from all acquaintances having the same schema to all acquaintances having different schema); and (c) the overall number of distinct heterogeneous schemas scattered across the network. In this experiment, we investigate how (a)-(c) impact query translation time. In experiment (a), we consider the total number of distinct schemas in the network equal to 10, and measure the QT time when the average number of acquaintances on each peer varies from 2 to 10. The result reported in Figure 9.1 (a) showed us that, in order to keep the number of translations reasonably low, we have to choose an average number of acquaintances of (at least) 4. This number of acquaintances for a 50 peers network is indeed sufficient to get an acceptable average length of semantic paths for that network, as also confirmed by next experiment (b). Indeed, Figure 9.1 (b) plots the average number of hops (or length of semantic paths) achieved when the average number of acquaintances is equal to 4 (few queries are reported to avoid clutter). The x axis in such a case represents the degree of 'heterogeneity' increasing from 1 to 10, meaning that 1 to 10 acquaintances have a distinct schema. It can be noted Chapter 9. Experimental Study '85 that the average length of semantic paths (reaching 3) stabilizes with (at least) a degree of heterogeneity of 3. These two experiments let us choose an average number of acquaintances equal to 4 in the remainder. Figure 9.1 (c) shows how the total number of distinct schemas present in the network affects the average QT time. This experiment shows that the average QT time grows almost linearly with the number of distinct schemas up to 5 and stabilizes after 5. It lets us conclude that variations of the QT time can be appreciated for a number of schemas less or equal to the number of acquaintances (i.e. 4) and are blurred otherwise. 9.6 Query Performance in HePToX The next experiment aims at showing the minimal overhead introduced by our translation algorithm all along the query answering process. In Figure 9.1(d), we highlight the various time components (%) taken by query translation, network delay,- and local query answering respectively, while the actual times (in ms) are reported on top of each bar. It can be noted that query translation takes a negligible time if compared to network delay and local query answering. The latter, essentially due to QIZX, was the bottleneck for all queries and caused the crashing of the most complex ones. Query answers to Qs, Qg and Qio are omitted in this plot, since neither they completed nor did yield any answer within the given timeout. We tried different query engines before choosing QIZX, which is considered the fastest one, thus this behavior was definitely outside our control. The next experiment complements the previous one by measuring the % of query completed within a specified timeout with no heterogeneity/the maximum heterogeneity in the network, respectively. Figure 9.1 (e) let us appreciate in terms of % of query completed when the peers have all different schemas (maximum heterogeneity) w.r.t. the baseline case when all peers exhibit the same schema (no heterogeneity). We can see that for instance the difference between the two curves for queries Q3 and Q7 is at most 1000 ms, showing the overall negligible impact of translation over query processing. 9.7 Scalability of HePToX The last experiment in Figure 9.1 (f) shows the scalability of HePToX P2P databases. We report the query completion time (including the query trans-lation time, query evaluation and network delay) when varying the number of peers in the network up to 5000 peers. This experiment was executed on the Pastry simulator alone, as Emulab real machines would not be enough. It can be observed that query completion follows a quite regular logarithmic curve. Chapter 9. Experimental Study 86 T.tm Tmt (IM) Figure 9.2: Time Composition for Querying in HePToX for DBResearch Dataset 9.8 Real Dataset We next considered the real dataset, DBResearch. We ran the previous experi-ments on all Piazza queries. The time composition graph is shown in Figure 9.2. We obtained in our syntax the same 29 mappings employed in Piazza. It can be noted that the behavior of our query translation algorithm is similar for both synthetic and real data. The times reported in Fig. 9.2 are reduced compared with the XMark datasets due to the smaller sizes of DBResearch data (up to only 27KB per peer). In summary, the extensive set of experiments on both synthetic and real data sets shown above have demonstrated HePToX's utility and scalability. 87 Chapter 10 Conclusions and Future Work In this thesis, we have presented the HePToX P2P XML database system, fo-cusing on the following key conceptual contributions: • An informal mechanism for specifying correspondences using arrows and boxes (Chapter 3); • A novel algorithm for inferring precise mapping rules, expressed in a declarative language, between schemas, based on input correspondences expressed as arrows and boxes (Chapter 4); • Study of the significance of the class of transformations captured by the mapping rules and their equivalence to algebraic transformations (Chap-ter 5); • Semantics of peer queries, and an efficient and novel query translation al-gorithm that handles a simple but significant fragment of XQuery (Chap-ter 6); • Utility and scalability of HePToX's approach with a detailed set of exper-iments A key novelty of HePToX's contribution is the handling of mappings be-tween peer heterogeneous schemas that involve an interplay between data and meta-data, which is not captured in prior work on schema mappings and query translation, to the best of our knowledge. HePToX's way of handling such data/meta-data conflicts also yields a nice mechanism for mappings between internal nodes (i.e. complex elements) across schemas. We have also started extending the framework to handle N:l and 1:N map-pings and larger query classes (e.g. GTP) as shown in Chapter 7 and Chapter 8. These are HePToX's on-going work and more will be explored in these areas in HePToX's future research. Handling of union mappings and other types of N:l as well as 1:N and M:N mappings, and larger query classes are HePToX's major future work directions. Bibliography [1] S. Al-Khalifa and H.V. Jagadish. Structural joins: A Primitive for Efficient XML Query Pattern Matching. In ICDE, 2002. [2] S. Amer-Yahia, S. Cho, L.V.S. Lakshmanan, and D. Srivastava. Minimiza-tion of Tree Pattern Queries. In SIGMOD, 2001. [3] M. Arenas, P. Barcelo, R. Fagin, and L. Libkin. Locally Consistent Trans-formations and Query Answering in Data Exchange. In PODS, 2004. [4] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R.J. Miller, and J. Mylopoulos. The hyperion project: From data integration to data coor-dination. Sigmod Record, 32(3), 2003. [5] M. Arenas and L. Libkin. XML Data Exchange: Consistency and Query Answering. In Proc. of PODS, 2005. [6] A.Y.Halevy, Z.G.Ives, D.Suciu, and I.Tatarinov. Schema Mediation in Peer Data Management Systems . In ICDE, 2003. [7] A.Y.Halevy, Z.G.Ives, P.Mork, and I.Tatarinov. Piazza: data management infrastructure for semantic web applications. In WWW, 2003. [8] M. Benedikt, C. Chan, W. Fan, J. Freire, and R. Rastogi. Capturing both Types and Constraints in Data Integration. In VLDB, 2003. [9] P. Bohannon, W. Fan, M. Flaster, and P.P.S. Narayan. Information Pre-serving XML Schema Embedding. In Proc. of VLDB, 2005. [10] A. Bonifati, E.Q. Chang, T. Ho, L.V.S.Lakshmanan, and R. Pottinger. HePToX: Marrying XML and Heterogeneity in Your P2P Databases (demo). In Proc. of VLDB, 2005. [11] N. Bruno, N. Koudas, and D. Srivastava. Holistic Twig Joins: Optimal XML Pattern Matching. In SIGMOD, 2002. [12] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Logical Foundations of Peer-To-Peer Data Integration. In Proc. of ACM PODS, 2004. [13] Z. Chen, H.V. Jagadish, L.V.S. Lakshmanan, and S. Paparizos. From Tree Patterns to Generalized Tree Patterns: On Efficient Evaluation of XQuery. In VLDB, 2003. Bibliography 89 [14] Alin Deutsch and Val Tannen. Reformulation of XML Queries and Con-straints. In Proc. oflCDT, 2003. Emulab hp. http://www.emulab.net. E.Rahm and P.A.Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4):334-350, 2001. HePToX. http://www.cs.ubc.ca/labs/db/heptox. J.Madhavan, P.A.Bernstein, and E.Rahm. Generic Schema Matching with Cupid. In Proc. of VLDB, 2001. Anastasios Kementsietsidis, Marcelo Arenas, and Renee J. Miller. Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues. In Proc. of SIGMOD, 2003. L.V.S. Lakshmanan, F. Sadri, and I.N. Subramanian. Logic and algebraic languages for interoperability in multi-database systems. Journal for Logic Programming, 33(2):101-149, 1997. A. Y. Levy, A.O. Mendelzon, Y. Sagiv, and D. Srivastava. Answering Queries Using Views. In PODS, 1995. R.J. Miller, L.M. Haas, and M.A. Hernandez. Schema Mapping as Query Discovery. In VLDB, 2000. R.J. Miller, L.M. Haas, and M.A. Hernandez. Schema Mapping as Query Discovery. In VLDB, 2000. W. S. Ng, B.C. Ooi, K.L. Tan, and A.Y. Zhou. PeerDB: A P2P-based System for Distributed Data Sharing. In Proc. of ICDE, 2003. P.A.Bernstein, F.Giunchiglia, A.Kementsietsidis, J.Mylopoulos, L.Serafini, and I.Zaihrayeu. Data Management for Peer-to-Peer Computing:A Vision. In Proc. of WebDB, 2002. Pastry hp. http://research.microsoft.com/ antr/Pastry/. L. Popa, Y. Velegrakis, M.A. Hernandez, R.J. Miller, and R. Fagin. Trans-lating Web Data. In VLDB, 2002. Lucian Popa, Yannis Velegrakis, Rene J. Miller, Mauricio A. Hernndez, and Ronald Fagin. Translating Web Data. In Proc. of VLDB, 2002. R. Pottinger and P. A. Bernstein. Merging Models Based on Given Corre-spondences. In Proc. of VLDB, 2003. Qizx. http://www.xfra.net/qizxopen/. A. Schmidt, F. Waas, M. Kersten, M. Carey, I. Manolescu, and R. Busse. XMark: A benchmark for XML data management. In VLDB, 2002. Bibliography 90 [32] I. Tatarinov and A.Y. Halevy. Efficient Query Reformulation in Peer-Data Management Systems. In Proc. of SIGMOD, 2004. [33] Y. Velegrakis, R.J. Miller, and L. Popa. Mapping Adaptation under Evolv-ing Schemas. In VLDB, 2003. [34] XML Homepage, http://www.xml.com. [35] XQuery Homepage. http://www.w3.org/TR/xquery/. [36] L.L. Yan, R.J. Miller, L.M. Haas, and R. Fagin. Data-Driven Understand-ing and Refinement of Schema Mappings. In SIGMOD, 2001. [37] Cong Yu and Lucian Popa. Constraint-Based XML Query Rewriting For Data Integration. In Proc. of SIGMOD, 2004. [38] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. SIGMOD, 2001. 91 Appendix A Schema Graphs of the 10 X M a r k Schemas i n Exper iments Figure A.l: Original Auction Schema: auctionO.dtd Appendix A. Schema Graphs of the 10 XMark Schemas in Experiments 92 site Figure A.2: Schema Variation: auctionl.dtd site Figure A.3: Schema Variation: auction2.dtd Appendix A. Schema Graphs of the 10 XMark Schemas in Experiments 93 Figure A.4: Schema Variation: auction3.dtd Figure A.5: Schema Variation: auction4.dtd i Appendix A. Schema Graphs of the 10 XMark Schemas in Experiments 94 Figure A.6: Schema Variation: auction5.dtd Figure A.7: Schema Variation: auction6.dtd Appendix A. Schema Graphs of the 10 XMark Schemas in Experiments 95 Figure A.8: Schema Variation: auction7.dtd Figure A.9: Schema Variation: auction8.dtd Appendix A. Schema Graphs of the 10 XMark Schemas in Experiments 96 site Figure A.10: Schema Variation: auction9.dtd Appendix B 97 XMark Schema Variation Descriptions Variations from auctionO.dtd to auctionl.dtd: • The element < name/ > is splitted (not the same split as in Chapter 5) into 3 separate elements: < category-name/ >',< person siame/ >,< item-name/ >. • < item/ >'s are splitted on the edge < region/ > and the different continents instead of the edges between each continent and < item/ >. (hard to demonstrate merge from the XMark DTD, but the reverse of the above is merge) • A new element < auction/ > which groups all the common SINGLE elements under < open .auction / > and < close .auction/ > is created. • Both < open .auctions / > and < closed-auctions / > are grouped by the value of < type/ > (i.e. flip), and the values of < type/ > are "Featured" and "Regular". (This can be a problem with the IBM XML generator because the #PCDATA under the original < type/ > may be random text, so not restricted to "Feature" and "Regular") Variations from auction2.dtd to auctionO.dtd: • The continents are grouped under the region attribute under item (flop), and < item/ > is grouped under a new element < items/ > which is under < site/ >. • < edge/ > under < catgraph/ > is nested under < category/ >. The "from" attribute of < edge/ > is implied by its new parent, thus it no longer exist. • < mailbox/ > under < item/ > is unnested, and each < mailbox/ > references the < item/ > by IDREF. • < from/ > under < mail/ > is renamed to < sender/ >. • < to/ > under < mail/ > is renamed to < receiver/ >. • < emailaddress/ > is renamed to < email/ >. Appendix B. XMark Schema Variation Descriptions 98 Variations from auction3.dtd to auctionO.dtd: • < business/ > (yes/no values) is flopped on top of person, its tag name is still "business" and "nobusiness". • < interest/ > is no longer under < profile/ > which is in turn under < person/ >. It is now a sibling of < person/ >, and uses IDREF under jpersonref/i to reference the < person/ > (unnest). • < catgraph/ > is splitted and its child < edge/ > becomes single (split). • < edge/ > is renamed to jlink/^. Variations from auction4.dtd to auctionO.dtd: • < item/ >is under category, and has an extra text element < region/ > to specify its region. • A new element < auction/ > which groups all the common SINGLE elements under < openauction/ > and < closeauction/ > is created. Variations from auction5.dtd to auctionO.dtd: • < item/ > is nested under new element < items/ > under root, and has an extra text element < region/ > to specify its region. • < edge/ > under < catgraph/ > is nested under < category/ >. The "from" attribute of < edge/ > is implied by its new parent, thus it no longer exist. Variations from auction6.dtd to auctionO.dtd: • < item/ > is under category, and has an extra text element < region/ > to specify its region. • < catgraph/ > is splitted and its child < edge/ > becomes single. Variations from auction7.dtd to auctionO.dtd • < open.auctions/ > and < closed .auctions / > are merged into jauc-tions^ . Common elements of < openjiuction/ > and < closed-auction/ > are under < auction/ >. • < openjiuction/ > and < closed.auction/ > are renamed to < open/ > and < closed/ >. • < mailbox/ > under < item/ > is unnested, and each < mail/ > refer-ences the < item/ > by IDREF. Appendix B. XMark Schema Variation Descriptions 99 Variations from auction8.dtd to auctionO.dtd • < open-auction/ > and < closed-auction/ > are renamed to < open/ > and < closed/ >. Common elements of < openjiuction/ > and < closed.auction/ > are under < auction/ >. < auction > has either < open/ > or < closed/ >, and each of them contain the respective elements originally under < openauction/ > and < closedauction/ >. • Both < open.auctions / > and < closed .auctions / > are grouped by the value of < type/ > (i.e. flip), and the values of < type/ > are "Featured" and "Regular". (This can be a problem with the IBM XML generator because the #PCDATA under the original \type/i may be random text, so not restricted to "Feature" and "Regular".) • < mailbox/ > under < item/ > is unnested, and each < mail/ > refer-ences the < item/ > by IDREF. Variations from auction9.dtd to auctionO.dtd • < open.auctions/ > and < closed.auctions/ > are renamed to < open/ > and < closed >. Common elements of < openjiuction/ > and < closedjauction/ > are under < auction/ >. < auction > has either < open/ > or < closed/ >, and each of them contain the respective elements originally under < open.auction/ > and < closed .auction/ >. • Both < open.auctions/ > and < closed.auctions j > are grouped by the value of < type/ > (i.e. flip), and the values of jtype/i are "Featured" and "Regular". (This can be a problem with the IBM XML generator because the #PCDATA under the original itype/^ may be random text, so not restricted to "Feature" and "Regular".) • < item/ > is nested under new element < items/ > under root, and has an extra text element < region/ > to specify its region. 100 Appendix C X M a r k Q u e r i e s i n E x p e r i m e n t s Ql. //item\. / quantity — 200}/name Michigan QR1: //eNest[@aSixtyFour = 2]/@aUniquel Similar to Michigan QRl, QR3, QS1, QS2, QR3, QS4 Description: selection with 1 filter XQuery: for $n in input()//item[./quantity = 200]/name return $n Q2. //closedj2uction[./'price >= 50][./quantity <= 200}/type Michigan QS5: //eNest[@aSixtyFour >= 5][@aSixtyFour <= 8]/@aSixtyFour Similar to Michigan QS5, QS7 Description: selection with 2 filters (same level) XQuery: for $t in input()//closed_auction[./price >= 50][./quantity <= 200}/type return $t Q3. //eur ope/item[./payment = " Creiditcard"]/mailbox/mail[./from = " Arie"} [./to = " Joanne"}/date Michigan QS16: //eNest[@aSixtyFour = l]/eNest[position() = 2}[@aFour = l]/@aUniquel Similar to Michigan QS16, QS17 Description: selection with 3 filters XQuery: for $d in inputQ//eur ope/item[./payment = n Creiditcard"]/mailbox/mail [./from = "Arie"}[./to = " Joanne"]/date return $d Q4. Appendix C. XMark Queries in Experiments 101 / /person[./emailaddress = " Rosiles©s fu.ca"}]. / address]. / country = "Canada"]] j name Michigan QS18: / / eN est]©aLevel = 13] [./eNest[@aSixteen = 3})/©aUniquel Similar to Michigan QS18, QS19, QS20 Description: selection with 2 filters (one of which is a nested filter with /) XQuery: for $n in inputQ//person[./emailaddress = "Rosiles@sfu.ca"]]./'address [./country = " Canada"]]/name return $n Q5. 11 openjiuction]. I reserve — " yes"]]./itemref[@item = "itemO"]]].//personref [©person = " personO"]]/current Michigan QS34: //eNest]@aFour = l]]./eNest]©aLevel = !!]][.//eNest[@aSixtyFour - 3]] /©aUniquel Similar to Michigan QS34 Description: selection with 3 filters (2 of which are nested filter with / and //) XQuery: for $c in inputQ//openjiuction[./reserve = "yes"][./itemref[./@item = "itemO"]] [.//personref [./©person = "personO"]}/'current return $c Q6. //edge[@from = " categoryO"][@to = //item]./location = "Canada"] /incategory/©category}/©to Michigan QJ1: //eN est[@aSixtyFour = 2][@aUniquel = //eNest[@aSixtyFour = 2] /©aUniquel]/©aUniquel Similar to Michigan QJ1, QJ2 Description: 1 join, selection with 2 filters XQuery: for $cl in inputQ//edge[./@from = "categoryO"]/©to, $c2 in inputQ//item]./location = "Canada")/incategory/©category where $cl = $c2 return $cl Q7. //openjiuction]. / seller/©person — //closed jiuction[. / price = 50]/buyer/©person] / quantity Michigan QJ3: / / eOccasional[@aRef = j/eNest[©aSixtyFour = 3]/@aUniquel]/©aRef Similar to Michigan QJ3, QJ3 Description: 1 join, selection with 1 filter XQuery: Appendix C. XMark Queries in Experiments 102 for $oa in input()//open.auction, $pl in %oa/seller/©person, $q in %oa/quantity, $p2 in mput()//closed.auction[./price — 50]/buyer/©person where $pl = $p2 return $q Q8. Description: 3 joins XQuery: for $pl in input()/site//per son, $pid in $pl/©id, Spname in $pl/name, $p2 in input()/site//open-auction/seller/©person, $b in wpvX()/site//open jiuction/bidder, Sbdate in $b/date, Sbperson in %b/personref /©person, $d in input()/site//closed-auction/date where $pid = $p2 and Sbdate = $d and Sbperson = $p2 return $pname Q9. Description: 6 joins XQuery: for $pl in input()/si£e//person, $pid in $pl/@id, $pname in $pl/name, $p2 in input()/'site//open.auction/seller/©person, $b in input()/site//open.auction/bidder, Sbdate in %b/date, Sbperson in %b/personref /©person, $d in input()/siie//closedjxuction/date, $n in input()/site//closedjauction/buyer/©person, $p in input()/site//openjxuction/annotation/author/©person where $pid = $p2 and $bdate = $d and Sbperson = $p2 and $p2 = $n and $pid = $p and $pid = $n return Spname Q10. Description: 9 joins Appendix C. XMark Queries in Experiments 103 XQuery: for $pl in input() / site/ jper son, $pid in $pl/©id, Spname in $pl/name, . $p2 in input()/site//'open.auction/seller/©person, $b in input()/site/j'open.auction/bidder; Sbdate in $b/date, Sbperson in %b/personref /©person, $d in input()/site//closedjiuction/date, $n in input()/site/ / closed jiuction/buyer/©person, $p in input()/site//open.auction/'annotation/author/©person where $pid = $p2 and Sbdate = $d. and Sbperson = $p2 and $p2 = $n and $pid = $p and $p = Sbperson and $n = $p and Sbperson = $n and $pid = $n return Spname Appendix D DBResearch Queries in Experiments QRl. Find all co-authors of a given researcher XQuery (posed on schema " uw") : for Spaper in input()/uw/area//pbs/paper, Sauthor1 in %paper/author/text(), $author2 in $paper / author / textQ where Sauthor1 = "Halevy" and $author2 ! = Sauthorl return < coauthor > { $author2 } < /coauthor > QR2. Find all project members papers XQuery (posed on schema "dbprojects"): for $dbp in inputQ/dbprojects, Sproject in $dbp/project, SprojName in Sproject/'name/textQ, Smember in Sproject/'members/ * /person/name/textQ, Spaper in input()/dbprojects/project/pubs/paper, Sauthor in %paper/author/textQ, Stitle in $paper/title/textQ where Sauthor = Smember return < paper > { Stitle } < /paper > QR3. Find conflicting authors XQuery (posed on schema "pods"): for Spaper in inputQ/pods/conf /program/paper, Sauthorl in %paper/author/name/textQ, Appendix D. DBResearch Queries in Experiments 105 $author2 in %paper/author/name/textQ, Sconf in input()/pods/conf, $pcMember in $conf/PC/member/name/textQ, SconfPaper in %conf /program/paper, SconfTitle in $conj'Paper/title/textQ, SconfAuthor in $conj'Paper/author/name/textQ where Sauthorl = SconfAuthor and $author2 = SpcMember return < conflict > < pc — member > {$pcMember} < /pc — member > < paper > {$confTitle} < /paper > < /conflict > QR4. Find researchers who had a paper at a conference where they have been a PC member (perhaps a different year). XQuery (posed on schema "sigmod"): for Sconf in input()/sigmod/conf, SpcMember in %conf /'committees/program/member/name/textQ, Spaper in $conf'/'session/paper, Stitle in %paper/title/textQ, Sauthor in %paper/author/textQ where SpcMember = Sauthor return < pc — author > < name > {$author} < /name > < title > {%title} < /title > < /pc — author > Q5R. Find PC chairs of recent conferences and the projects they lead. XQuery (posed on schema "pods" and "uw"): for Schair in schema("pods")/pods/conf/PC/program — chair/name/textQ, $proj in schema("uw")/wwi/area/proj'eci, Sname in $proj/name/textQ, Smember in $proj/member/name/textQ where Smember = Schair return < project > < name > {$name} < /name > < leader > {$member} < /leader > < /project > 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051116/manifest

Comment

Related Items