On Detecting and Repairing Inconsistent Schema Mappings by Terence Cheung-Fai Ho B.Sc., The University of British Columbia, 2002 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University of British Columbia (Vancouver) October 2008 © Terence Cheung-Fai Ho, 2008 Abstract Huge amount of data flows around the Internet every second, but for the data to be useful at its destination, it must be presented in a way such that the target has lit tle problem interpreting it. Current data exchange technologies may rearrange the structure of data to suit expectations at the target. However, there may be semantics behind data (e.g. knowing the title of a book can determine its #pages) that may be violated after data translation. These semantics are expressed as integrity con straints (IC) in a database. Currently, there is no guarantee that the exchanged data conforms to the target’s ICs. As a result, existing applications (e.g. user queries) that assume such semantics will no longer function correctly. Current constraint repair techniques deal with data after it has been translated; thus take no consid eration of the integrity constraints at the source. Moreover, such constraint repair methods usually involve addition/deletion/modification of data, which may yield incomplete or false data. We consider the constraints of both source and target schemas; together with the mapping, we can efficiently detect which constraint is violated and suggest ways to correct the mappings. 11 Table of Contents Abstract H Table of Contents . H’ List of Tables vi List of Figures vi’ 1 Introduction 2 Definitions 10 2.1 DTD definition 10 2.2 Simple DTD definition 11 2.2.1 Choice-free 12 2.2.2 Acyclic 12 2.2.3 Duplicate-free 13 2.2.4 Unorderness 13 2.3 2.4 2.5 2.6 . 14 DTD Group 2.3.1 Single Path 14 2.3.2 DTD Group. 15 XML Database . 16 . 2.4.1 Simplifying text nodes to attributes 16 2.4.2 Notations 16 Missing Values and NULL Values 17 XML constraint definitions 17 2.6.1 Functional dependency 111 17 2.6 2.7 2.8 2.9 17 XML constraint definitions 2.6.1 Functional dependency 17 2.6.2 Inclusion dependency 18 18 Schema Mapping from HepToX Relational Constraint Definitions 20 2.8.1 Relational Functional Dependency 20 2.8.2 Relational Inclusion Dependency 20 2.8.3 Typed Relational IND 21 21 Typed XML Inclusion Dependency 2.10 Guaranteeing No Interaction of relational FD and IND Implication 2.10.1 Proper Circular 22 2.10.2 Split-freeness 23 2.10.3 Intersection 23 2.10.4 Reduced set of FDs and INDs 2.10.5 Theorem that Guarantees Non-Interaction between FD and 24 24 IND 3 25 Efficient XML Constraint Implication 3.1 Reduction of XML constraint implication to relational constraint implication 25 3.1.1 Assumptions 27 3.1.2 XML Database Equivalence 3.1.3 Schema Transformation Function 3.1.4 Constraint Transformation Function 3.1.5 Data Transformation Functions —u(D), Ic’(DR) Correctness of XML Constraint Propagation 3.1.6 3.1.7 3.1.8 3.2 22 27 () — u(t) 29 30 — . 35 XML to Relational Constraint Implication Reduction Algorithm to Test XML Constraint implication by Rela . tional Constraint Implication Guaranteeing No Interaction Between XML FDs and INDs 32 38 39 . . 40 40 3.2.2 Retaining Proper Circular Property Retaining Reduced Property 3.2.3 Inability to Retain Intersection Property 44 3.2.1 iv 43 3.3 4 4.2 4.3 6 3.2.5 Summary of non-interaction between FDs and INDs 44 . 45 Efficient implication of XIVIL constraints 45 3.3.1 Retaining Typed property 45 3.3.2 Implication for typed relational INDs 46 3.3.3 Implication for relational FDs 46 3.3.4 Summary of XML Constraints Implication 47 48 Constraint Translation 48 4.1.1 FD Translation 49 4.1.2 IND Translation 50 Testing Constraint Propagation 52 4.2.1 Testing FD Propagation 53 4.2.2 Testing IND Propagation 56 Constraint Propagation Summary 58 Detecting Constraint Violations 59 5.1 Checking for FD violation 59 5.2 Checking for IND violation 61 64 65 65 66 67 Resolving Violations 6.1 6.2 Repairing FD Violations 6.1.1 Scaling Back 6.1.2 Source Concatenation Repairing IND Violations 6.2.1 Scaling Back 67 6.2.2 Split Correspondence 68 Summary of Repairs 69 Conclusion and Future Work 70 6.3 7 Retaining Split-freeness Property Constraint Propagation 4.1 5 3.2.4 72 Bibliography V List of Tables 1.1 UBCBookstore 1 2 1.2 UBCLibra,y 3 3.1 ‘I’(LUBCBookstore) 31 3.2 I1(uBcBooksiore) 33 3.3 U(DUBCBookstore) 34 vi List of Figures 2 to AuflCLjbrary Example 1.1 XUBCBookstore 1.2 UBCBookstore Instance Example 5 1.3 Translated UBCLibrary Instance Example 5 1.4 Mapping Rules M for UBCBookstore—UBCLibrary Example 2.1 AUBCBookstore 15 3.1 Equivalent UBCBookstore Instance Example 28 3.2 Schema Transformation Algorithm 29 3.3 Creating Transformation Generated Constraints Algorithm 3.4 FD transformation Algorithm — . i (A) — P(A) 6 30 3.5 u (FD) u (IND) IND transformation Algorithm 3.6 Constraints Transformation Algorithm 3.7 Data Transformation Algorithm 33 3.8 Reverse Data Transformation Algorithm 35 — — — p. () Algorithm Testing XML Constraint Implication 3.10 Example Showing Circular INDs after Transformation 3.9 31 31 32 39 41 50 4.1 FD Translation Algorithm 4.2 IND Translation Algorithm 4.3 Testing FD Propagation Algorithm 55 4.4 Testing IND Propagation Algorithm 57 5.1 FD violation Checking Algorithm 60 5.2 IND violation Checking Algorithm 62 — 6(Fs,M) ö(Is,M) — vii 51 65 66 67 6.1 FD Violating Example 6.2 6.3 Fixing FD Violation by Concatenation Example Translated UBCLibrary After Repair By Concatenation Mapping 6.4 IND Violation Example 68 6.5 IND Violation Example Fixed by Split Correspondence 69 . viii . Glossary DR a relational database, 20 DuBcBookstore The XML Database for UBCBookstore, 4 A a DTD definition, 1 4) a function that returns child DTD elements of a DTD element, 11 fl a function that returns DTD attributes of a DTD element, 11 A set of Constraints, 2 operator that states two XML databases contain the same collection of data, 27 function to transform XML to relational, 25 function to reverse transform relational to XML, 25 a DTD element, 11 A a set of DTD attributes in a DTD, 11 acyclic , att a function that returns a attribute of an element, 12 16 C A Set of Correspondences, 7 choice-free , 11 ix D an XML database, 16 DTD Document Type Definition, 1 DTD group , 15 duplicate-free , 13 E a set of DTD elements in a DTD, 11 FD Functional Dependency, 1 IC Integrity Constraint, 1 IND Inclusion Dependency, I intersection , lab a function that returns the DTD element of an el 23 ement, 16 M a set of HepToX mapping rules, 19 proper circular , r the root DTD element of a DTD, 11 reduced , root the root element of an XML database, 16 split-freeness , 23 , 20 typed relational IND typed XML IND 22 24 ,21 13 unorderness , V a finite set of elements in an XML database, 16 x XML Extensible Markup Language, 1 xi Chapter 1 Introduction Extensible Markup Language (XML) has been the dominating data representation format on the Internet. It is self-describing and semi-structured; thus giving appli cations great flexibility to implement and interpret data. These properties make it ideal for data exchange, and it has been the dominating format for data exchange. However, we believe exchanging of data on the Internet does not utilize the full potential of XML. One way to further expand the power of data exchange is to examine the semantics of data in order to create better quality mappings which lead to better data exchange. As with most relational databases, there are integrity constraints (ICs) such as functional dependencies (FD5) and inclusion dependencies (INDs) that give the data more semantics than just tuples of values. There are already studies of ICs in XML such as [6, 10]. Therefore, the next step is to understand how these semantics can benefit data exchange. Although there are many different types of ICs, FDs and INDs are the most used ones. The two assist in storage, update, querying, and transformation of data. As a result, they are very heavily integrated into everyday applications nowadays. If applications are running on data that violates these constraints, they may behave erratically and return unsound and unexpected results. We demonstrate some of the challenges we face using the example in Fig ure 1.1. In Figure 1.1, there are two Document Type Definitions (DTDs), AUBCBookstore and lXUBCLibrary. A DTD is a schema which defines the syntax of XML databases. 1 / Figure 1.1: BookFD#1: BookFD#2: BookFD#3: BookFD#4: BookFD#5: BookFD#6: BookFD#7: BookFD#8: BookFD#9: 1UBCBookstore to tXUBCLibrary Example UBCBookstore.person:name —* DOB UBCBookstore.subject .book.author:name —> DOB UBCBookstore.subject.book:isbn —* title UBCBookstore.subject .book:isbn —* year UBCBookstore.subject .book:isbn — price UBCBookstore.subject .book:isbn —> #pages UBCBookstore.subject .book:isbn —* pub UBCBookstore.subject .book:isbn —* author.name UBCBookstore.subject .book:isbn —* author.DOB BookIND#1: UBCBookstore.subject .book.author[name, DOB] UBCBookstore.person [name, DOB] Table 1.1: c ZUBCBookstore In our example, the two DTDs are expressed graphically as DTD graphs (DTD trees in our example). Each graph contains nodes as DTD elements, and, for sim plicity, all leaf nodes (nodes without children) are DTD attributes. We have established a few correspondences, shown in arrows, from the at tributes in the source UBCBookstore to the attributes in the target UBCLibrary. These simple 1-1 correspondences We can also define ICs on a DTD. The FDs and INDs for UBCBookstore are listed in Table 1.1, and the ones for UBCLibrary are in Table 1.2. AUBCBooksrore keeps books under different subjects. BookFD#3-9 fl state that there is an isbn that implies all the other attributes within every book. In 2) UBCLibrary.person: name —* DOB UBCLibrary.person: name —* addr UBCLibrary.publication: call#.author.name —p call#.author.DOB UBCLibrary.publication: call#.subject, call#.author.name, call#.year —* title LibFD#5: UBCLibrary.publicat ion: call#.subject, call#.author.name , call#.year —* #pages LibFD#6: UBCLibrary.publicat ion: call#.subject, call# .author.name, call#.year —* pub LibFD#7: UBCLibrary.publisher: name —* addr LibFD#1: LibFD#2: LibFD#3: LibFD#4: LibIND#1: UBCLibrary.publication.call#.author[name, DOB] ç UBCLibrary.person[name, DOB] LibIND#2: UBCLibrary.publication [pub] UBCLibrary.publisher[name] c Table 1.2: EUBCLibrary addition, there is a main author listed under every book, and by BookIND#1 these authors are also listed independently as persons under the root node UBCBook , store. has a slightly flatter structure. Each Publication is listed under the root node UBCLibrary, and all its information is listed as attributes under its subtree. In a typical library, subject, author name, and year of a publica Meanwhile, the ZUBCLibra,y tion together determines the call number of a publication, which is a unique number that libraries use to identify publications. Similar to AUBCBooksjo,. above, there is a main author listed under each publication, and the same author is also listed as person under root. Similarly, each publisher listed as pub under publications is also listed under publisher under root. The two heterogeneous schemas have different hierarchies. For example, books in UBCBookstore are grouped by subjects, while subjects in UBCLibrary are listed under publications. There are also attributes that exist in one schema but not the other (e.g. publisher.addr in UBCLibrary). The two schemas also represent heterogeneous content. UBCBookstore keeps a record of books, but UBCLibrary records publications which subsume books. In addition, the retailer background of UBCBookstore is shown by the presence of 3 price which is absent in UBCLibrary for obvious reasons. Heterogeneity of the two schemas is not limited to the representation of the data. The two sets of constraints also show variations. In UBCBookstore, book.title and book.#pages are uniquely identified by book.isbn as specified by the FDs BookFD#3 and BookFD#6. However, UBCLibrary uses publication.subject, publication.authoiname, and publication.year to imply the values of publication.title and publication.#pages as specified in LibFD#4 and LibFD#5. Thus, problems arise if two books having the same subject, author name, and year but different #pages or titles in UBCBookstore are translated to instances of publication in UBCLibrary. The two translated publication instances will vio late the FDs LibFD#4 and LibFD#5. Suppose that UBCBookstore has stacks of unsold books, and decides to donate them to UBCLibrary. Along with the books, UBCBookstore also gives UBCLi brary an XML database DUBCBoo1ore containing information about the books it is donating. DUBCBOOkStOTe conforms to DTD UBCBookstore and its constraints. Upon receiving DUBCBookSjore, librarians at UBCLibrary want to integrate the data into its database which conforms to DTD UBCLibrary and its constraints. The librarians understand that the UBCBookstore data does not conform to its DTD, and it will be tedious work and error-prone to translate the data manually. In addition, there will probably be frequent donations from the UBCBookstore in the future. Therefore, an automated translation framework from data in UBCBookstore to UBCLibraiy will be appreciated. The librarians can create high-level correspondences from the UBCBookstore DTD to the UBCLibrary DTD. Moreover, using methods in [51 and [14], they can generate mapping rules from these correspondences. The mapping rules translate data in UBCBookstore to data that conforms to UBCLibrary DTD but not neces sary its constraints. The XML instance in Figure 1.2 conforms to the DTD ‘UBCBooksjore and its constraints. With the mapping rules in Figure 1.4 which are generated by algo rithms in [5], the librarians can translate data in Figure 1.2 to the data in Figure 1.3. The translated document in Figure 1.3 conforms to the DTD UBCLibrary, but it does not conform to the UBCLibrary constraints. It is obvious that pub of each 4 UBCBookstore person subect book book name /\ name /\ DOB 19490101 isbn year title pt-ice #pages author pub isbn year title price #pages author pub 320 ,i/ \\I9mPmt 550 ,,/ \\FtmPm 002 2001 DB 101 70 (701 2(01 Sorviv, 08 20 name DOB name Sm,th 19400101 Smith DOB 19490101 Figure 1.2: UBCBookstore Instance Example UBCLibrary publication publication call# ___ flrvtvD8 subject author 08 name Smith \ call# title #pages pub 540 FtPmm year 2001 title #pages pub 320 person /\ name DOB Smith 19490101 ___._._rS...DBI1 subject author 08 // DOB name 19490101 Smith \ year 2001 DOE 19490101 Figure 1.3: Translated UBCLibrary Instance Example instance of publication is not listed under publisher name as LibIND#2 of UB CLibraiy requires. Moreover, the two publications have the same subject, au thor name, and year, but different title and #pages. This violates the FDs LibFD#4 and LibFD#5 of UBCLibrary. Applications in the library use subject, author name, and year to identify pub lication. title, but given data in Figure 1.3 these applications will no longer behave correctly. Clearly, something needs to be done to resolve this problem. Fixing constraint violation after data is translated using existing constraint violation repair technique such as [3, 4] can be undesirable because UBCLibrary may not want to alter data. Correspondences are usually created by data administrators who know both the source and target schemas relatively well. If we can detect constraint violations at the time of correspondence establishment, and provide suggestions on how to repair the correspondences, such administrators are given more information and arsenals to create better quality correspondences. This thesis aims to detect such violations and suggests ways to repair them. 5 1. T.JBCLibrary—*fl($UBCBookstore)[ person—+f2($name,$DOB)[ name—*$name, DOB—*$DOB]] UBCBookstore—$UBCBookstore[ person—*$person[ name—*$name, DOB—>$DOB]] 2. UBCLibrary—f1 ($UBCBookstore)[ publication—*f3($sname,$aname,$DOB,$title,$year,$#pages,$pub) [ call#.subject—*$sname, call#.author.name—÷$aname, call#.author.DOB—*$DOB, call#.year—*$year, title—*$title, #pages—.$#pages 1] UBCBookstore—*$UBCBookstore[ subject—$subject[ name—*$sname, book—*$book[ title—*$title, #pages—$#pages, author.name—*$aname, author.DOB—*$DOB, pub—*$pub]]] Figure 1.4: Mapping Rules M for UBCBookstore--UBCLibrary Example 6 Meanwhile, every algorithm has to be efficient in order for our framework to be practical in the real world. Problem Statement Given a set of constraints Ys over source DTD As, a set of constraints -T over target DTD AT, and a set of user-defined correspondences C, determine if V XML 5 databases D = , after translation via C, the translated instances satisfy 5 s over A In addition, if violations exist, how do we fix them? Related Work and Challenges In order to implement an XML constraint translation system, an essential build ing block is solving constraint implication efficiently. However, as shown in [13] and [8], implication of relational FD and IND considered together is undecidable. Therefore, it is challenging to find an efficient XML constraint implication system. [101 shows interaction of XML constraints with DTDs, it finds keys and foreign keys consistency with DTDs undecidable. Thus taking DTD interactions out of the equation is essential to a tractable reasoning with XML constraints. We simplify DTDs in this thesis aiming to avoid such interactions. [9] propagates XML keys to FDs on a predefined relational database via a transformation. It shows XML key and foreign key propagation is undecidable; therefore, it only focuses on XML keys. It further emphasizes the difficulties in the problem of XML constraint propagation. [14] generates join satisfying mapping rules from user defined correspondences if the two schemas have identical representation of entities. It considers join condi tions on both the source and target schemas such that the mapping rules generated satisfy target join constraints. However, both keys and joins in the target can be violated if suspicious correspondences are present. They can be mistakes of the user who defines the correspondences, or they may be unavoidable due to different representations of entities in of the two schemas. The absence of correspondences creates ‘holes’ in the target. These missing values or NULL values further complicate the propagation problem. In addition, 7 inability to propagate does not necessary imply constraint violations, and these missing or NULL values play a big part in determining violations. Finally, once violations are detected, there are different ways to fix them. Each has its advantages and disadvantages; the combinations of these repairs give users choices to fix violations under different circumstances. Existing constraint viola tion repair techniques such as [3],[4] treat the translated data as one inconsistent database, and either delete, add modify tuples to make data adhere to the target constraints. In trying to make data conform to the target constraints, modification of translated data without knowing anything about the originating data may sac rifice the correctness. In contrast, our solution of correcting the correspondences offers a different approach to preserve the coherence of data with the intelligence of the originating data structure together with its semantics. Contributions 1. There currently does not exist an implication testing system for XML FDs and INDs considered together; therefore, we reduce XML constraint impli cation to relational constraint implication so that we can employ tools in well studied relational constraint reasoning. 2. We develop a polynomial algorithm to test XML constraint implication. 3. We develop a polynomial algorithm to test XML constraint propagation. 4. We develop a polynomial algorithm to detect XML constraint violation. 5. We suggest ways to repair mappings should a violation occur. Organization We present the rest of the document as follows. Chapter 2 lays the foundation for the later chapters by specifying definitions. Chapter 3 describes the reduc tion of XML constraint implication to relation constraint implication, plus how to make the implication problem tractable. We develop the constraint propaga tion algorithm in Chapter 4, and show that it runs in polynomial time. Chapter 5 builds constraint violation detection algorithm on top of the propagation algorithm. 8 Chapter 6 suggests ways of repairing mappings such that constraints will not be vi olated. Chapter 7 concludes our work and offers a glimpse of how this work can be extended. 9 Chapter 2 Definitions DTD definition 2.1 Majority of this definition is adopted from [2]. Definition 2.1.1 A DTD (Document Type Definition) is defined to be: A = (E,A, 4), fl, r), where: • E is a finite set of element types. • A is afinite set of (non-set) attributes. • 4) is a mapping from E to element type definitions: Given r E E, 4)Qr) = where S is a string type, or 4)Qr) is a regular expression a defined asfollows: Ia+ where e is the empty sequence, t’ E, ‘‘ represents union, ‘,‘ represents represents represents zero or one occurrence of cc, represents one or zero or more occurrences of a (Kleene closure), and concatenation, ‘‘ ‘?‘ ‘‘ more occurrences of a. • 11 is a mapping from E to the powerset ofA. If @1 definedfor t 10 E flQr), we say that @1 is e • r E and is called the element type of the root. Without loss of generality we assume that r does not occur in 4(t) for any t e E. For example, we can present ZUBCBoojore in Figure 1.1 in the form (E ,A, , H, r), where: • E = {UBCBookstore, subject, name, book, author, person} • A = {name, isbn, title, year, price, #pages, DOB, pub} • = {UBCBookstore —* (subject*,person*),subject —* (book*),book (author) } • H = {subject —* (name),book —f (isbn, title,year, price, #pages, pub) ,author (name,DOB),person —+ (name,DOB)} • r = UBCBookstore 2.2 Simple DTD definition In order for the algorithms in the rest of this thesis to be efficient, we simplify our DTD definition. As shown in [101, a DTD interacts with XML constraints to produce inconsis tent specifications. That is there does not exist an XML database that satisfies an inconsistent set of DTD and XML constraints. On the contrary, a consistent set of DTD and XML constraints is one that has at least one XML database that satisfies it. To make matters worse, [121 shows that a DTD by itself may be inconsistent too. We wish to guarantee no interaction between DTDs and XML constraints. However, a full-blown study of this topic is beyond the scope of our research. Therefore, we specify choice-free and acyclic DTDs in the following section; they together guarantee DTDs are by themselves consistent as proven in [12]. In addition, [10] shows that cardinality constraints of DTDs interact with XML keys and foreign keys. Therefore, simplifying cardinality constraints should elim inate or reduce interactions between DTDs and sets of XML FDs and 1NDs. We achieve this by imposing the properties duplicate-free and unorderness on DTDs. 11 2.2.1 Choice-free [12] proves choice-free is one of the properties required for a consistent DTD A (E,A, 4), fl, r). We redefine this property under our context. = Choice-free is the elimination of choices (options) of child DTD elements. To eliminate the ‘choice’ of child nodes, we redefine the regular expression allowed for 4)Qr) in our DTD definition A = (E,A,4),fl,r), where t E E. The original regular expression a of 4)Qr) is: a::=eIr!IaIata,aIa?Ia*Ia+S The ‘‘ (union) operator gives 4)Qr) the ‘choice’ property. Eliminating (2.1) ‘‘ from a: a::=cItIIa,aIcx?Ia*Ia+ (2.2) makes our new DTD definition satisfy the choice-free property. 2.2.2 Acyclic [12] shows that a DTD without cycles is another requirement for a consistent DTD. A DTD A = (E,A, 4), fl, r) is cyclic if A contains a DTD Element r e E such that t appears as a descendent of itself. The DTD graph of A would contain a cycle with r being one of the nodes of the cycle. Another benefit of having acyclic DTDs is the elimination of infinite paths. With a DTD containing cycles, it is possible to have a path of elements starting from the root that loops a cycle forever. The elimination of these infinite paths is important because we will later use fully specified paths to define our constraints. Having cycles in DTDs may allow constraints with infinite depth, and infinite num ber of constraints. Both are very undesirable properties, and will make reasoning with such constraints intractable. To incorporate the acyclic property into our DTD language definition, we need to add an extra statement to the definition of the regular expression language for 4)Qr) in (2.2), where ‘r e E. The new definition is: 12 a::=eIt!Ia,ala?Ia*Ia+ where 2.2.3 t’ ‘ (2.3) ancestor(t), and r’ Duplicate-free We define a new property duplicate-free, that aims to simplify cardinality con straints that can be implied by DTDs. In a DTD definition z = (E,A,4,fl,r), 4’Qr) allows a child z’ e E to appear multiple times; therefore, z’ can have minimum and maximum occurrence con straints of any integers. For example, if CP(t) = ‘U’, ‘r’, ‘r’, ‘r’?, t’?, ‘U”?, then ‘r’ has to appear at least three times under ‘r, and can appear a maximum of six times. In [10], it shows these cardinality constraints can interact with XML constraints to produce inconsistent specifications; therefore, we introduce duplicate-free to simplify cardinality constraints. The idea is simple; we do not allow a child to appear multiple times in a c(’) regular expression. Together with the unorderness property in Section 2.2.4, we will define a new regular expression language for c1 in our DTD definition. The duplicate-free property only allows child DTD elements to have minimum occurrences of either 0 or 1, and maximum occurrences of either 1 or oo• It greatly simplifies the complexity of constraint reasoning. 2.2.4 Unorderness Subelements in XML are ordered. For example: , and c1 (Rooti) = BC, 1 for an root element Root 1 =<A><B/>.<C/></A> T ) 2 , and ‘1’(Root 2 for another root element Root = CB, <A ><C/><B/></A> T = 2 . However, current 2 T 1 2 differ in their order of elements; thus T 1 and T T constraints view data as collections of data in which order does not matter. We aim to employ current constraint reasoning techniques; hence, for the rest of this thesis, we will ignore the order of elements. Thus, we achieve unorderness. 13 Together with the duplicate-free property, we redefine the regular expression language for (r) in our DTD definition z = (E,A,4,H,r), where z e B. The regular expression a of (‘r) in (2.3) is: a::=eI.rIa,aIa?Ia*Ia+ where r’ ancestor(r), and (2.4) ‘ Now that we can ignore order, and have to honour the duplicate-free property, we can simplify the regular expression language to a list of unique DTD elements, each is followed by a quantifier: a ::= c i [ii? 1*1+1 . . . , ;[1 whereVi,je [1. ..n}, andi j, r I? I*+] t t, (2.5) 1 gancestorQr). andt,r The quantifier ‘1’ behaves exactly as if there were no quantifiers. It states that the child DTD element has both minimum and maximum occurrence of 1. In Section 3.1.2, we define XML database equivalence that further elaborates on how we ignore order. 2.3 DTD Group We divide the set of DTD elements into groups to give the XJvIL constraints a plat form to apply their logic. This is the same as ‘group’ in [5], and analogous to logical relations in [14]. We are going to define single path in the next section, then use it as a building block to define DTD group in the section that follows. 2.3.1 Single Path {E,A, 1, fl, r} be a DTD defined in Definition 2.1.1. An edge is a relation from a parent DTD element e E to a child DTD element defined in 4), or it is a relation from ‘r to an attribute defined in fl(t). Such edges are single if it is from Let = an element to an attribute (because we don’t allow set attributes), or if it is between 14 two elements with the cardinality constraint of ‘?‘ or ‘1’. On the contrary, edges or ‘+‘ are multiple edges. between elements with the cardinality constraint of ‘‘ Edges link together to form paths, and a single path (SP) is a path that consists of single edges only. In addition, a single path always ends in a DTD leaf text node or a DTD attribute. SP is expressed as a fully specified relative XPath between an ancestor and a descendant. 2.3.2 DTD Group Definition 2.3.1 A DTD group is a group of DTD elements and attributes that are all connected by single paths. A DTD group guarantees one-to-one relations within a branch of elements and attributes in an XML database that have their DTD elements and DTD attributes within the same group. Such a branch of elements and attributes is analogous to a relational tuple. UBCBookstore 7* person subject name , book /\ name DOB isbn title year price #pages author pub /\ name Figure 2.1: DOB ‘UBCBookstore For the rest of this thesis, we will use G to denote a DTD Group. And we label each group with the path name of the highest node t, where cP(parentQr)) contains 15 either r* or t÷. We also conveniently call Figure 2.1 shows the 4 DTD groups in the DTD Group Node of G. UBCBoofore For example, subject and book do not belong to the same DTD group because the edge between subject and book is not a single path; it has one of the quantifiers that allow multiple occur rences of books in each subject. On the contrary, there is a single path between book and DOB; therefore, they are in the same DTD group. 2.4 XML Database We adopt the following definition of XML databases from [2]. An XML database D conforming to a DTD z (E,A, c, H, r) is defined to be a tree (V,lab,ele,att,root), where • V is a finite set of vertices (nodes). • lab: V — E. • ele: V_*StringUV*. • att is a partial function V x A —* String. For each v att(v,l) is defined} is required to be finite. • root 2.4.1 e V, the set {l e A I e V is called the root of d. Simplifying text nodes to attributes Since all the leaf nodes are either DTD elements containing text or DTD attributes, for simplicity without loss of generality, we assume all leaf nodes are DTD at tributes for the rest of this document. 2.4.2 Notations For any element x E V in D labelled by E E, lab(x) = t, and any path P origi nating from ‘r, x.P returns all elements that are descendents of x and are reachable from x via the path P if P ends in a DTD element, it returns string values that are values of attributes reachable from x via the path P if P ends in a DTD attribute. For example, let’s call the only subject element in Figure 1.2 x. Then, x.book.price returns “120”,”70”. 16 Missing Values and NULL Values 2.5 For the rest of this thesis we treat missing values as if they are NULL values. NULL values are known to be incomparable; that is they do not equate to any values, not even to other NULL values. NULL values are going to play a big role in how we treat FDs and INDs. XML constraint definitions 2.6 Functional dependency 2.6.1 An XML functional dependency, where t e FD has the form: ,. 1 E, and ‘r is a DTD group node. In addition, SI’ paths originating from t . . are single and ending in attributes. An XML database D=(V, lab, ele, att, root) satisfies denoted: DI=opDiff Vx,y e V,lab(x) A iEl = lab(y) (x.SF =y.SP) —*x.SP = y.SP Note that we cannot equate NULL values to any other values; thus NULL existence on the L.H.S. guarantees FD satisfaction. In [9], FDs are defined in participation of them turning into keys. Such keys are then used to normalize relational schemas. Since keys must be able to identify tuples, imcomplete keys are unacceptable. Therefore, [9] requires an instance to have NULL value for the R.H.S. of an FD if there exists NULL value among the values of the L.H.S. attributes. However, unlike [9], we want to reason only with the most fundamental form of ICs only, FDs and INDs. Therefore, we can concentrate on these two types of ICs only when we consider NULL values. FDs and 1NDs reason with leaf values only, but not node identities as in keys. Consequently, an instance with NULL values on the L.H.S. of an FD guarantees satisfaction. On the other hand, if the 17 valuation of the R.H.S. in an instance is NULL, we treat it as a unique value, and only require its valuation of the L.H.S. to differ from all other instances. Inclusion dependency 2.6.2 An XML inclusion dependency, aIND has the form: [SP , i,. .,SP ,, ç 2 1 ..,SP ,nl 2 ,i,. 2 1 ri[SP . where 2 ,t 1 ‘r e , and 1 ,.. .,SP 11 E, and 2 ,’r are DTD group nodes. In addition, SI’ 1 ‘r 2 respectively. . ,SP SP , 1 , 2 ,, are single paths in rj and ‘r 2 . . An XML database D=(V, lab, ele, att, root) satisfies D 1 aIND aIND, denoted: if Vx e V,lab(x) = ti,y A , x.SP , 1 e V,lab(y) = = ,s.t. 2 ‘r 1 , 2 y.SP iEl,...,n Note that missing values or NULL values are not allowed on either side of an IND because such values are incomparable; thus would violate the IND. 2.7 Schema Mapping from HepToX In order to translate data from a source DTD s t to a target DTD ‘IT, we need a mapping language to define how to translate. Both HepToX [5] and Clio [14] can derive schema mappings from simple correspondences drawn as arrows from attributes of source schema to those in the target. From correspondences, [141 generates inter-schema referential constraints in a form similar to SQL statements. Although, such constraints definite how data translate, they are too technical and thus are difficult to analyse. On the other hand, users in [5] create correspondences between source and target leaf nodes. These correspondences are used to generate free-log mapping rules that define how data from the source can be translated to the target. These mapping rules are in the form of tree expressions, as such they are more suitable for analysis for the rest of this document. 18 However, our mapping model differs slightly from [5] and [14] in that we do not employ the ID/IDREF function of DTD. In other words, we do not have key and foreign key constraints in our model. histead we deal with the more fundamental constraints: FD and ND. Thus there are no joins among DTD groups, and our mapping rules only translate instances along a DTD tree path (attributes along a path from the root to the group node in question). We give a brief outline here on how simple correspondences in our running example in Figure 1.1 become a set of mapping rules M in Figure 1.4 following algorithms in [5]. The target DTD UBCLibrary is broken into 3 DTD groups, the groups can be seen as logical relations in which attributes (plus leaf text nodes) together describe an entity. The 3 DTD groups are publication, person, and publisher, and they each become the target of a mapping rule except for publisher because none of its attributes are mapped. On the other hand, the source schema is broken into 3 groups: subject, book, and person. In each mapping rule, if the target group has any correspondences that originates from a particular source group, then the source group is included in the rule body of the rule along with its ancestors. Each mapped attribute in the rule body is bound to a unique variable, and its correspondence in the target is assigned that variable. This has the same function as in SQL where tables are bound to some variables in the FROM clause, then some of the variables appear in the SELECT clause to create view instances by retrieving data from the tables in which the variables are bound. In mapping rule #1, every person’s name and DOB from the source is put into name and DOB in person in the target. The rule is very straight forward. However, notice the target root is assigned a Skolem function fl($UBCBookstore), and $UBCBookstore is the binding variable of the root element in the source; thus translated instances will always be stitched together and have only one root. The function of our Skolem function is to assign non-leaf nodes an identity. Thus two non-leaf nodes with the same identity (same Skolem function with the same arguments), will be merged into one node. Mapping rule #2 takes book instances from the source, and put them into pub lication at the target. The rule also acquires the name of subject which is in the 19 parent group of publication, and puts that into attribute subject under publication in the target instance. 2.8 Relational Constraint Definitions In this section, we quickly define the relational constraints: FD and IND; [1] pro vides a more comprehensive explanation of FD and IND. We will be using rela tional FD and IND implication to solve the XML FD and IND implication problem; therefore, we will introduce them in the following sections. In addition, in order to solve our XML constraint implication problem effi ciently, we use typed relational IND which can be solved in polynomial time. We will also define typed relational IND in this section. 2.8.1 Relational Functional Dependency ,.. 1 Let R be a relational schema, and A FD be n+l attributes in R. A relational . has the form: R.Ai...A—*A A relational database DR conforming to R satisfies crR if V tuples A (ti .A = A) .t 2 —* 1 t .A = t 1,2 t in table R of A t . 2 iEl ,...,n 2.8.2 Relational Inclusion Dependency ,... ,A be n at 1 R be two relation schemas in a database schema R, A 2 , 1 Let R . A relational IND 0 2 ,... ,B be n attributes in R 1 , and B 1 R has the tributes in R form: B A ..B ..ACR . R . 1 . 2 . A relational database DR conforming to R satisfies 0 R if V tuple 2 of DR such that: 2 in table R DR there exists a tuple t A B .A=t 1 (t . 2 ) iEl,...,n 20 1 t 1 of in table R Typed Relational IND 2.8.3 A more comprehensive study of Typed Relational IND can be found in [7]. We give a quick definition and description of Typed Relational IND here. ,.. ,A, 1 2 be two relation schemas in a database schema R, A 1 ,R Definition 2.8.1 Let R . A typed relational IND 2 ,... ,B be n attributes in R 1 , and B 1 be n attributes in R . R has the same form as a typical relation IND: B A ..B ..ACR . R . 1 . 2 . The “typed” property requires Vi E [1, . . . , 1 n] A = . 1 B Moreover, any two DTD attributes A ,A within each relation schema R’ eR must not be of the same attribute type. A relational database DR conforming to R satisfies typical relational 1ND. aR just as it would for a In more details, there exists a pool of unique attributes U from which each 2 gets a subset of attribute. As a result, attributes of a relation relation schema R 1,R 1 is an attribute , where A 1 schema are unique within the schema. Moreover, if A=B , then A 2 1 and B are the same attribute in U. 1 and B of R 1 is an attribute of R 2.9 Typed XML Inclusion Dependency We model each DTD group as an entity, same as each relation schema represents an entity. In order to integrate the typed property, each DTD attribute appears at most once in a DTD group. Moreover, pairs of coffesponding attributes on the L.H.S. and R.H.S. of an XML IND must be of the same DTD attribute type. We now give a formal definition on how to integrate the typed property into a DTD A. Definition 2.9.1 Let A = (E,A,4,fl,r), Vtr € E, if DTD group G, then fl(t) flfl(t) = t,tj 0. Definition 2.9.2 Let o be an XML IND in the form: i, [SP , 1 . . . ,.. 1 , 2 ,SPi,] ç t[SP 21 . ] ,SP , 2 belong to the same where t 2 e E, r 1 2 ir ,...,SP are 1 SP , are DTD group nodes, SPi,l,...,SPl, and 2 1 and T2 respectively. single paths in r A is typed if 0• , 1 (‘ri.sP Guaranteeing No Interaction of relational FD and IND Implication 2.10 In order to find an efficient XML constraint implication system, we first look at the implication of the simpler relational constraints. As mentioned earlier, implication of FDs and INDs together is undecidable [8] [13]. The complication of the implication comes from the interaction between the two kinds of constraints. Thus if we can avoid the interaction of FDs and INDs, it will be a big step towards making the implication problem tractable. [11] proves it is possible to prevent such interaction albeit imposing some re strictions on the FDs and INDs. We quote each condition as defined in [11] in the following sections, and give each a short description. Finally, in Section 2.10.5, we present the theorem from [11] that guarantees non-interaction. 2.10.1 Proper Circular Definition 2.10.1 A set of INDs I over relational schemas R is circular if either there exists a nontrivial IND R[X] C R[YJ e I, where X,Y are arrays of attributes . .,R, e R, with m> 1, such ,R 1 R , in R, or there exists m distinct relation schemas, 2 [Y R ] Y C3 X R ] [ j ,R [X ç 2 R ] Rm[Xm] C Ri[Yi], that I contains the INDs: 1 . . A set of INDs I is noncircular if it is not 1 ,Y 1 1 are sets of attributes in R where X circular Definition 2.10.2 I is proper circular if it is either noncircular or whenever there . .,Rm e R, with m> 1, such that I contains ,R 1 R , exit m distinct relation schemas, 2 Y 3 X R j [ J ,R the INDs: R [XiJ ç 2 1 Y Ri[YiJ, then for all i E [ R 1 ,...,Rm[Xml { 1,2,. .,m} we have X = Yj; such a sequence of INDs is called a proper cycle. . c c . 22 Intuitively, I is circular if a subset of its INDs forms a circle in a graph (l<E), ] R[Yj is an edge 1 where each relation R is a vertex 1/, in V and each IND R,[X . I satisfies proper circular if I is not circular, or if a cycle exists, 1 1 to V from V at each vertex, the sequences of attributes X, Yj connecting the two edges are two identical sequences of attributes. 2.10.2 Split-freeness B, where R is a relation Two nontrivial FDs of the forms R:XB —* A and R:YA schema, X, Y are sets of attributes of R, and A,B are two attributes of R, are said to — be cyclic. A closure of a set of FDs F, denoted F+, is a set of FDs that contains every FD a where F = a. Definition 2.10.3 A set of FDs F over R satisfies the split-freeness property if VFj :Y 3 e I whenever there exist cyclic FDs, R:XB —* A, YA — B e F, then either R —* B e or R:(XflY)A — B e Fj. Roughly speaking, a set of FDs satisfies split-freeness if it does not imply two FDs where R.H.S. of each of these two FDs is contained in the L.H.S. of the other FD. 2.10.3 Intersection Two nontrivial FDs of the forms R:X — A and R:Y —* A, where R is a relation schema, X, Y are sets of attributes of R, and A,B are two attributes of R, are said to be incomparable if X and Y are incomparable. Two sets are incomparable if either is a subset of the other. 1E Definition 2.10.4 A set of FDs F over R satisfies the intersection property if VF E VA e R, whenever there exist incomparable FDs, R:X—*A,R: Y—*A e P7, then R:XflY—*A E P7. Intuitively, if a set of FDs satisfies the intersection property if it does not imply two non-trivial FDs that imply the same attribute. 23 2.10.4 Reduced set of FDs and INDs Definition 2.10.5 The projection of a set of FDs I over relation schema R onto a :W 1 :W —* Z I R 1 , denoted by P[YJ, is given by Fj[Y] = {R 1 set of attributes Y C R —ZFandWZC Y}. 1 Definition 2.10.6 A set of attributes Y ç R is said to be reduced with respect to R 1 is understood and a set of FDs I over R (or simply reduced with respect to Fj ifR from context) if i9[Y] contains only trivial FDs. A set of FDs and INDs = F U I [X] 1 is said to be reduced if VR . 1 [Y] E L Y is reduced with respect to F 3 R We illustrate the reduced property with an example. Our example consists of 1= two relation schemas author[name,DOB] and person[name,DOB], an ND a author[name,DOB] C person[name,DOBI, and an FD aF person : name —* DOB. guarantees that any tuples in author must have an equivalent tuple in person. aF guarantees names imply DOBs in such tuples in person. This in turn guarantees 1 a names imply DOBs in all tuples in author. 2.10.5 Theorem that Guarantees Non-Interaction between FD and IND Taken from [11], the following theorem states the conditions needed to prevent interaction: Theorem 2.10.7 If I (a set of INDs) is proper circulai F (a set of FD5) satisfies either the intersection property or the split-freeness property and reduced then F and I do not interact. 24 = (F UI) is Chapter 3 Efficient XML Constraint Implication In order to achieve efficient XML FD and IND implication, we first reduce XML FD and IND implication to relational FD and IND implication in Section 3.1. How ever, Chandra et al. [8] proved relational FD and IND implication is undecidable. Therefore, in Section 3.2, we introduce conditions on XML FDs and 1NDs that ensure the transformed relational FDs and INDs do not interact. 3.1 Reduction of XML constraint implication to relational constraint implication Relational constraint reasoning is a well researched area. Being able to reduce XML constraints to their relational counterparts can resolve a lot of uncertainties involving XML constraints. In this section, we aim to reduce the XML FD and 1ND implication problem to the relational FD and IND implication problem. We specify the problem of XML FD and 1ND implication here: be a set of FDs and INDs on z, and aA Problem Statement Let iX be a DTD, be an FD or ND on . Determine if E = a. We first construct a schema transformation function 25 schema t !- that takes i and outputs a set of relation schemas R, denoted: R = !.Lschema(1) In Section 3.1.3, we will define and explain this schema transformation function. and Next we construct a constraint transformation function Lco,(raj,g takes produces R 0 that conforms to R, denoted: I-Lconstraint (o) We will define and explain this constraint transformation function in more detail in Section 3.1.4. When given a set of XML constraints {ILconstraim(02,) I.1constraints(Y.) cons:raints t I- repeatedly calls on z, we define: consrraint t I- O E on each constraint in , and outputs the set of relational constraints. We also develop a data transformation function I data that takes a DTD z and 1 an XML database D that conforms to t, and output a relational database DR that conforms to R = I schema (s), denoted: 1 DR = !1data(1,DDe1ta) iLdata will be used in our proof that XML constraint implication can be reduced to relational constraint implication. We will construct Udata in Section 3.1.5. For simplicity, we overload the .t operator so that J.L is ILschema when the ar gument is a DTD, Iconstrainis if 1 constraint if the argument is an XML FD or IND, I 1 the argument is a set of XML constraints, and data 1 I if the argument is a DTD and XML database. We then prove that XML constraint implication can be reduced to relational constraint implication in Section 3.1.7. After proving the reduction is correct, we then present an algorithm to solve XML constraint implication with relational con straint implication in Section 3.1.8. 26 3.1.1 Assumptions 1. DTDs are simple DTDs as defined in Section 2.2: choice-free, duplicatefree, and acyclic. 2. For simplicity, we assume for all DTDs z including root r has no attributes. = (E,A, 1, fl, r) the DTD group 3. All the XML ENDs are typed. 4. Since all XML 1NDs are typed, Definition 2.9.1 guarantees all attributes within a DTD group are unique. Therefore, for simplicity, we can assume without loss of generality that each DTD group has only one element the - group node. 5. All the leaf nodes are attributes. 3.1.2 XML Database Equivalence () To simplify the XML model for constraint reasoning, we ignore order of ele ments. Thus we need a database equivalence operator () that determines two Ai,1 ,fli,ri) (E , 1 = (Vi,labi,elei,atti,rooti) on DTD z = 1 XML databases D ,fll , ri) to be equiv 1 1 ,4 ,lab ele2,att2,root2) on DTD A = (Ei,A (V , and 132 = 2 alent if their content is the same but their order of elements may be different. Here we define XML database equivalence (): y if: ;x 2 1 ,y e V Two XML elements x E V 1. node-name(x) = node-name(y) y)),s.t. lab atti(x,l) =att2(y,l) ( 2 fl (labi(x)), l e ( 1 2. Vl e IT (labi(x)),s.t. att2(y,l) 1 lab 1 E fl (y)), 2 11 3. V1 e ( 4. Vx’ E child(x), y’ e child(y) s.t. x’ 5. Vy’ E child(y), SIx’ e child(x) s.t. y’ 27 y’ atti(x,l) UBCBookstore peroon name /\ isbn year title price #pages author pub 902 200)1 06101 70 name book book 320 ,,/ \mPm name Smith \lh DUB 19490101 isbn year title price #paes author pub 55i1 (I 2001 SurviveflB 20 DOB 19490101 ,/ \moPr name DUB Smith 19490)01 Figure 3.1: Equivalent UBCBookstore Instance Example 1 D , if rooti 2 D root2. For example, the XML database in Figure 1.2 and the one in Figure 3.1 are equivalent. They differ by the order of the two books. The two databases would both satisfy or both not satisfy any set of FDs and 1NDs. XML FDs and INDs do not put constraints on the order of data; therefore, we can derive the following Lemma: , if D 2 1 and D 1 Lemma 3.1.1 For any Iwo XML databases D 1 FDsorINDsa,D Proof The , then V XML 2 D r-:ta. iffD f 2 operator is symmetric; therefore, we only need to prove one direction of ‘if and only if’. We will prove by contradiction. 1 For the proof on FDs, let the constraint a be an FD. Assume D a but a. D 1 must contain two elements x,y with the same L.H.S. attributes of a, , D 1 D 2 must contain 2 but they have different R.H.S. attribute of a. Since D two elements x’,y’ s.t. x’ x and y’ y. As a result, .x’ contains the same set of 2 D = attributes as x, and y’ contains the same set of attributes as y. x’ and y’ would cause 2 to violate a. This violates the hypothesis; thus Lemma 3.1.1 holds for FDs. D a but For the proof on INDs, let the constraint a be an IND. Assume D 1 1 must contain an element x whose attributes fit the L.H.S. of a, but D 2 j= a. D D must not contain an element y that satisfies R.H.S. of a with respect to x. Since a, D 2 must contain y’ s.t. y’ 1 1 must contain ,D 2 D satisfies R.H.S. of a with respect to x’. Again because D an element that is equivalent to y’, such an element would satisfy R.H.S. of a with respect to x. This violates the hypothesis; thus Lemma 3.1.1 holds for INDs too. 1 D , 132 must contain x’ s.t. x’ 2 D 2 x. Since D 28 = I 3.1.3 Schema Transformation Function — j.L(A) The XML schema transformation function ji(A) in Figure 3.2 is designed to trans late a DTD A into a set of relation schemas R that retains the DTD group structure of A. We Input: DTDA= (E,A,,fl,r) Output: A set of Relation Schemas R 1. R={} 1 e E, G, r 2. For each group G create a new relation schema R 3. create a new field for each attribute in G, and 4. label it with the SF of the attribute create a new field ‘id’ 5. create a new field ‘pid’ 6. } 1 R=RU{R 7. Figure 3.2: Schema Transformation Algorithm — 11(A) 1 because we see 1 for each DTD group G In Line 3, we create a new relation R each DTD group as a schema describing an entity. All the attributes within a DTD group are the properties of the entity that this group is describing; therefore, we . We then create the fields ‘id’ and 1 create a new field for each attribute within G ‘pid’ to retain the parent-child relation among DTD groups. Such relations will be expressed as 1NDs as transformation generated constraints. Running this algorithm against our example DTD we will generate three relation schemas: AUBCBOOkSjQTe in Figure 1.1, subject[id,pid,name] book[id,pid,isbn,title,year,price,#pages,author.name,author.DOB,pub] person[id,pid,name,DOB] They correspond to the three DTD groups in AUBCBooksjore. Every attribute in a DTD group becomes a field in the corresponding relation; in addition, ‘id’ and ‘pid’ are present in every relation to retain the parent-child relation of DTD groups. 29 Transformation Generated Constraints — During schema transformation, parent-child relations can be retained by generating an JND for each child DTD group whose ‘pid’ is subsumed by its parent’s ‘id’. In addition, ‘id’ is the node identifying attribute; thus it implies all the other attributes within a DTD group. Input: DTDA= (E,A,,fl,r) Output: A set of 1NDs P() 1. ‘I’(A)={} 2. For each DTD Group G e E )& 1 ifparent(G ø 4 3. 4. ‘P(A) = W(z) U {R[pid] parent(R)[id]} ) 1 For each DTD Attribute A e 11(G 5. :id—*A} 1 ‘P(z)=’P(z)U{R 6. c Figure 3.3: Creating Transformation Generated Constraints Algorithm We can apply ‘P to our example DTD UBCBooksjore — in Figure 1.1, and will result in the transformation generated constraints in Table 3.1. The field ‘id’ in each relation implies every other attribute within the relation. In addition, book is a child group of subject, and we have the IND book[pid] c subject [idl that captures the parent-child relation. 3.1.4 Constraint Transformation Function — takes an XML constraint of either FD FA, or IND I, and translates it to relational constraint FR (Figure 3.4) or ‘R (Figure 3.5) respectively: For example. the FD BookFD#1 on DTD group person in Table 1.1 is: Constraint transformation function t (ox) UBCBookstore.person: name 30 —* DOB person: id —* name person: id —* DOB person : id — pid subject : id — name subject : id —p pid book: id —* isbn book: id — title book: id —* year book: id — price book: id —k #pages book: id —* author.name book: id —> author.DOB book: id —* pub book: id —* pid book[pid} C subject [id] Table 3.1: Given an XML FD F: Gk : SPk,1, . . , ‘J.’(IXUBCBoojore) SP,,fl] — SPk,, Eu(F)=Rk:SPk,1,...,SPk,fl—*SPk,U Figure 3.4: FD transformation Algorithm — i (FD) u (BookFD#1) is on the transformed relation person: person : name — DOB For another example, the IND BookIND#1 on DTD groups book and person in Table 1.1 is: UBCBookstore.subject.book.author[name, DOB] Given an XML IND I: Gk[SPk,1,. , SPk,fl] c Ri[SP , SPi,n] , , SPk,fl] ç 1 = Rk[SPk,1, . . . Figure 3.5: 1ND transformation Algorithm 31 , — t (IND) Input: DTD A = (E,A,’F,fl,r), and a set of XML constraints Output: A set of FDs + INDs u (E,) on A 1. i’(){} 2. For each a E ZA 3. (ZL(ZA)UI1(o) Figure 3.6: Constraints Transformation Algorithm—si (Z) c UBCBookstore.person [name, DOB] t (BookIND#1) is on the transformed relations book and person: person [name, DOB] book[author.name, author.DOB} The constraint transformation function u defined in this section takes one XML constraint a at a time. We are going to overload the function i so that it takes a set of XML constraints E as argument. The function i(E) calls i(a) for each a e Z, and returns the results as a set. In Figure 3.6, we also define j.i(Z,) that transforms a set of constraints on A. It transforms each øA e Z by iL(a) in Figure 3.4 and Figure 3.5, then returns the set of resulting constraints. If we apply this constraints transformation function to FDs and INDs in UBC Bookstore Table 1.1, it will result in the set of constraints in Table 3.2. 3.1.5 Data Transformation Functions — u(DA), r’(DR) As shown in Figure 3.7, the data transformation function i takes an XML database D, that conforms to DTD A and produces a relational database DR that conforms to the transformed relational schema L(A). In Lines 2,3, the algorithm creates a new tuple for every group element in D. It creates a unique ID for the tuple and put it in the field ‘id’ in Line 4; and put the group element’s parent ID into ‘pid’ in Line 6. This retains the parent-child relation among the group elements. In Lines 8-11, it puts all attributes in the group in their corresponding fields in the tuple, and puts NULL in the fields for the attributes that do not exist. 32 person : name —* DOB book: author.name —* author.DOB book: isbn —* title book: isbn —* year book: isbn —* price book: isbn —* #pages book: isbn —* pub book: isbn —* author.name book: isbn — author.DOB book[author.name, author.DOB] C person[name,DOBI Table 3.2: ‘ (ZUBCBookstore) Input: DTD A = (E, A, 1, fl, r), XML database DA=(V,lab,ele,att,root) Output: Relational database DR ,. ,R,, for DR according to ji(A), where n = j ji(A) 1 1. create tables R ,and G, r 1 2. for each group node e E V, where lab(e) = G 3. create a new tuple tin table R, 4. create a unique ID UID(e) and put it in t[id] 5. if e has a parent node parent(e) and parent(e) r t[pid] = UID(parent(e)) 6. else t[pid] NULL 7. 8. for each attribute A E fl(lab(e)) if e.A does not exist 9. then t[A] = NULL 10. elset[A]=e.A 11. . . Figure 3.7: Data Transformation Algorithm For simplicity, if A is understood, we shorten the notation from u (A, D) to t(D) for the rest of this document. We demonstrate the data transformation on the XML database DUBCBOOkSIOTe (Figure 1.2). After translation, we get the three tables in Table 3.3. In Figure 1.2, there are two book instances with isbn ‘001’ and ‘002’. u (DUBCBookstore) put these two books into two different tuples in the table book. Notice each tuple has a unique 33 ID in its ‘id’ field. The two books in Figure 1.2 are under a subject node, which has a name attribute ‘DB’. The two corresponding book tuples have the unique ID of the subject tuple in their pid fields; therefore, it keeps record of the parent-child relation among groups. H id IDII (a) Translated Table ‘subject’ title id Ipid •sbn year 2 jIDi 001 2001 D 3 jIDi 002 12001 D name I DB pid NULL urviveDB PB 101 Iprice #pages uthor.namc 120 550 Smith 320 Smith 70 uLhor.DOB 19490101 19490101 pub IFreePress IFreePress (b) Translated Table ‘book’ id 4 ID pid NULL name Smith DOB 19490101 (c) Translated Table ‘person’ Table 3.3: ji (DUBCBootore) u takes a On the other hand, in Figure 3.8, data transformation function 4 relational database DR that conforms to relational schema ji (A), and produces an XML database D, that conforms to A. For simplicity, if A is understood, we shorten the notation from u ‘(A, DR) to U (DR) for the rest of this document. The reverse data transformation algorithm jr’ in Figure 3.8 does the reverse of J.L in Figure 3.7. Given output from u, it produces XML databases that are equivalent () to the input of j.t. Studying the two algorithms, we can easily derive — the following propositions: Proposition 3.1.2 V XML databases on DTD D, 34 Input: a DTD A, a Relational Instance DR Output: XML database DA that conforms to A 1. create an XML node root 2. for each table R e jt(A) 1 for each tuple t, E R 3. create an XML element 4. for each attribute A E t 5. ift.ALNULL 6. 7. e1[dj,([1d].A t.A 8. for each ed,Id created above ifpid=NULL 9. then make ejd,pjd a child of root 10. else make eid,pid a child of some 11. Figure 3.8: Reverse Data Transformation Algorithm Proposition 3.1.3 V relational databases DR on a set of relation schenws R DR 3.1.6 Correctness of XML Constraint Propagation With the constraint transformation functions defined in Section 3.1.4, we first show the preservation of FDs and 1NDs of the constraint transformation functions both from XML constraints. Then using this result, we prove the sound and complete ness of using relational constraint implication to imply XML constraints. Preservation of FDs We first claim that any XML FD is preserved by the constraint transformation. We show our claim must be correct by contradiction. If an FD F were violated after propagation, there must exist two tuples in the transformed database .t (DA) with the same set of L.H.S. attributes of the transformed FD (F), but different R.H.S. 35 attributes. Since every tuple in U(D) is traced from a group in the XML Database D, on zS that conforms to F, there. must exist two group instances in D, that have the same L.H.S. attributes of F, but different R.H.S. attributes. This would violate the assumption that D, conforms to F; thus, our claim holds. Lemma 3.1.4 VXlvlLdatabasesDA = {V,lab,ele,att,root} that conforms toDTD z, VXMLFDE fDA 1= 1 then ,i(z,D)I= ji(F). Proof F has the form: tk:SPk,1,.. lab(x) = lab(y) = ‘uk’ and A X.SPk, 1 —* . iEl,...,n Meanwhile, u(F)=Rk[SPk,1,. would exist 2 ,t 1 t e . . = SPk,. If D z F, then V x,yeV, where y.SPk,,, then X.SPk, = Y.SPk,u. ,SPfl]—*SPu. If our claim were wrong, there Rk(l.t(DA)) such that A iEI n tl.SPk,j = .SPk, 1 t2.SPk,, and t # 2 must have their Since 11 (D) is transformed from DA, the tuple ti, and t respective corresponding group nodes x,y such that A t1 .SPk,j = X.SPk,i and 2 .SPic,u. t t2.SPk,i = iEI n,u y.SPk,I. This shows x,yeV, where lab(x)=lab(y)=;, and A iEl Y.SPk,i, but x.SPk,u Y.SPk,u. Thus it contradicts D = n X.SPk,j F, and our claim holds. = • Next we show that the reverse transformation preserves FDs. We prove by contradiction. If an FD F were violated after reverse propagation, there must exist two group nodes x, y in the reverse transformed database U (DR) with the same set of L.H.S. attributes of the F, but different R.H.S. attributes. Since r’ (DR) is traced from a tuple in DR on that conforms to ji(F), there must exist two tuples in DR that have the same L.H.S. attributes of s(F), but different R.H.S. attributes. This would violate the assumption that DR conforms to every group in ji(F); thus our claim holds. Lemma 3.1.5 V relational databases DR conforms to the set of relaion schemas ji(A), where zisaDTD, VXMLFDsE Proof Let F be r,,:SPk,l,. .,SPt —* f DR j=t(F), then (DR) j=F. 1 SPk,. Then ji(F) is Rk[SPk,j,. .,SPk,fl] . —> SPk,U. If DR 1 .SPk, = 1 ,t 2 E Rk(DR), if A ti .SPk, = t2.SPk,i, then t 1= .i(F), then Vt iEl ,...,n t2.SPk,. 36 Let W’(DR) would exist x,y = e (V,lab,ele,att,root). However, if our claim were wrong, there V, where lab(x) = lab(y) = A tk, and X.SPk,j = Y.SPk,i, but Y.SPk,u. Since all group nodes in r’ (DR) are transformed from DR, the 1,t 2 such that 1 group nodes x,y must have their respective tuples t A ti .SPk, X.SPk, iEl ,...,n,u Y.SPk,i. This shows t é R,(D), and 2 , 1 t X.SPk,, and t2.SPk, = 1 .SPk. .SPk, butt 2 t , 1 # t2.SPk,u. Thus it contradicts DR A tI.SPk,í = EEl ,..,n 1= it(F), and our claim holds. I Preservation of INDs We then claim that every XIVIL IND is preserved by the constraint transformation. Data transformation transforms every group in an XML database DA to a tuple in the transformed relational database u (D). In addition, every tuple in 1 (DA) comes from a group in D. Therefore, if D satisfies an XML IND 1, then every group that appears on the L.H.S. of I must have a corresponding group on the R.H.S. of I. All of these groups and only these groups are transformed to L.H.S. and R.H.S. tables of ji(i) respectively. Therefore, i (D) satisfies .t (I). Lemma 3.1.6 V XML databases D = {V, lab, ele, att, root} that conforms to DTD u(I). z, V XML INDs I, fD 1=1, then u(D) j= 4 1 = I, then ,j. If D 1 ,. ,SP , 1 Proof I has the form: rk[SPk,1,. ,SP,fl] ç t[SP . , 1 1 = y.SP Vx e V, where lab(x) = ‘Ck, y e V, of which lab(y) t, and A x.SPk, . . . . jEl ,...,n [SPj, R ,SF]. And t(D) ensures Meanwhile, i(I) = Rk[SPk,1,.. ,SP,,,j 1 every group node z E V of group type t is transformed into a corresponding tuple t in R (ji (D)) with all the attributes of z put into the respective fields in t, and every ,. . t in R(i(D)) is traced from some group node z . . e V of group type t. Therefore, every t 1 e Rk(1(DA)) is traced from an x e V, where lab(x) = Tk. Together with the (ji(D)), i1(D) 1= 1 fact that every y e V, where lab(y) = t, is transformed into R t(I) is satisfied. I Next we show that the reverse transformation preserves INDs. Reverse data transformation transforms every tuple in a relational database DR to a group in the 37 1 (DR). In addition, every group in u’(DR) reverse transformed XML database ,r comes from a tuple in DR. Therefore, if DR satisfies a transformed IND u(I), then every tuple that appears on the L.H.S. of t (I) must have a corresponding tuple on the R.H.S. of u (I) to. All of these tuples and only these tuples are reverse transformed to L.H.S. and R.H.S. groups of I respectively. Therefore, J4_1 (DR) satisfies I. Lemma 3.1.7 V relational databases DR that conforms to relational schema u (s), VXMLINDs1 if DR 1= i(I), then 1r’(DR) =1. Proof Letlbe’rk[SPk,l,. ..,SPkn] ç ç1 [SP, ,. . ,SP R ]. 1 Then i (I) =Rk[SPk,I,...,SPk,fl] . If DR 1 t(I), then Vt e 2 E R1(DR), such that Rk(DR), t A tl.SPk,j = iEI ,...,n .SF, 2 t . 1 1 (DR) has a corre Let ir (DR)=(V,lab,ele,att,root). Every group node z in ji— sponding tuple tin DR with all values of attributes of z coming from their respective 1 e Rk(DR) fields in t. Thus every x e V, where lab(x) = Tk has an corresponding t (I), and every with same attribute/field values. Together with the fact that DR ji (Dj) =1 is satisfied. (Dj) has a corresponding y e V, where lab(y) = 1 2 eR t , I 3.1.7 XML to Relational Constraint Implication Reduction We can use relational constraint implication to test implication of XML constraints: Theorem 3.1.8 Given a set ofXML constraints Z and an XML constraint a, = a zffit() j=ji(a) To prove Theorem 3.1.8, we break it into two lemmas. Lemma 3.1.9 is re sponsible for the ‘if’ half of Theorem 3.1.8, and we prove Lemma 3.1.9 below. Lemma 3.1.10 is responsible for the ‘only if’ half of Theorem 3.1.8, and we prove Lemma 3.1.10 at the end. Lemma 3.1.9 Given a set ofXML constraints a, then ji() = (a) 38 and an XML constraint a, if 1= Algorithm TestXMLImplication(,a) Input: a set of XML FDs and INDs £, an FD or IND a Output: True if = a, false otherwise 1. return ji() 1= Is(a) Figure 3.9: Algorithm Testing XML Constraint Implication Proof For every relational database DR, Lemma 3.1.5 and Lemma 3.1.7 prove r(DR) i=if DR t)• DR)) 1= a. ByLemma3.1.4 andLemma3.1.6, 1 I1(1r ( u(a). By Proposition 3.1.3, DR (DR). By Lemma 3.1.1, DR 1= i.t(a). 1 (r If = a, then Ic 1 (DR) Therefore, i’() = = I ji(a) Lemma 3.1.10 Given a set of XML constraints E and an X?vIL constraint a, if a u() 1= i(a), then Proof For every XML database D, Lemma 3.1.4 and Lemma 3.1.6 prove u (D) 1= u() if D = Z. If t() 1= s(a), then ji(D) 1= (a). By Lemma 3.1.5 and Lemma 3.1.7, (,1(D). By Lemma 3.1.1, 1 ir’Cu(D)) = a. By Proposition 3.1.2, D ii DA = 3.1.8 a. Therefore, : a Algorithm to Test XML Constraint implication by Relational Constraint Implication From the sections above, we have proved that testing XML constraint implication can be handled by testing relational constraint implication. We build an algorithm here showing the steps required to employ testing relational constraint implication to solve the XML implication problem. The algorithm is very simple, it transforms a set of XML constraints and a single XML constraint to relational constraints, then test if the set of transformed relational constraints implies the one transformed relational constraint. 39 Guaranteeing No Interaction Between XML FDs and INDs 3.2 Now that we can use relational implication to test XML constraint implication, we can concentrate on complexity. Detecting XML constraint violation, and translat ing XML constraints via mappings all require an efficient XML constraint impli cation algorithm. XML constraint implication is undecidable because relational FD + IND impli cation is undecidable [13],[8]. As we proved in Section 3.1.6 we can use relational FD + IND implication to solve XML FD + IND implication, we inherit the unde cidability property. The undecidability complexity stems from interaction of FDs and INDs. How ever, according to Theorem 2.10.7 non-interaction of relational FDs and 1NDs is possible. Building on it, this section aims to achieve non-interaction of XML FDs and INDs. In order for XML FDs and INDs to not interact, after transformation of XML constraints to relational constraints, the relational constraints must satisfy condi tions in Theorem 2.10.7. For the rest of this section, we will introduce the necessary restrictions on our XML constraints such that after the FDs and INDs are transformed to their rela tional counterparts, they do not interact. 3.2.1 Retaining Proper Circular Property Proper Circular Property, defined in Definition 2.10.2, is one of the conditions of Theorem 2.10.7 necessary to avoid FD and IND interaction. Being able to retain proper circular property during transformation is necessary for efficient constraint implication. In order to retain the proper circular property of XML INDs, we deal with the slightly simpler retaining of acyclic XML INDs first. Extending on that result, we then explain how to retain the proper circular property. 40 root [Y G ] Xi] c 4 G [ 1 [pid] C G[id] 4 G 33 C G[Y] G ] [pid] G 2 G [id] 1 c (b) Transformed circular Relational INDs (a) DTD with Circular INDs after Transformation Figure 3.10: Example Showing Circular INDs after Transformation Retaining Acyclic Property Intuitively, each IND directionally connects two DTD groups. Two INDs can chain together to form a longer path if the DTD group on the R.H.S. of one IND is the DTD group on the L.H.S. of the other IND. For example, if the first 1ND connects , then there is a path from 3 2 to G 1 to G , and second IND connects G 2 DTD group G . A path becomes a cycle when a DTD group appears in a path more than 3 1 to G G 1 to the 3 to G once. For example, if we add a third ND that connects DTD group G two INDs above, we will form a cycle. Transforming a set of XML INDs with cycles to relational INDs will definitely violate the acyclic property (Definition 2.10.1). Consequently, FDs and INDs will interact, and we cannot efficiently test constraint implication. In addition, transfor mation created INDs involving ‘id’ and ‘pid’ may complete cycles that do not exist before; therefore, simply avoiding cycles in the set of XML INDs is not enough. [Y] is demonstrated by an arrow from G, to G. 1 In Figure 3.lOa, ND G,[X] ç G 1 corresponds The INDs in Figure 3.1 Oa is transformed to Figure 3.1 Ob with each R to G. As shown in Figure 3. lOb, the transformed INDs complete a cycle after the addition of transformation generated INDs. Our digraph has leaf DTD groups, DTD groups with no children, as vertices demonstrated by encircling ellipses. Arrows between vertices if there are INDs among ancestors of the 2 leaf nodes. Cycles in this digraph illustrate possible cy 41 des in the relational INDs, and the cycles are complete if within each, connected vertex, the group pointed to by the arrow head (incoming IND target) is a descen dant of the group that the outgoing arrow (outgoing 1ND source) points away. This is because after transformation, there will be INDs from descendants to ancestors, thereby completing circles by linking arrows within each vertex. Below, we give a set of formal conditions for a set of INDs to retain the acyclic property after transformation. Given a DTD A = (E,A, c1, fl, r), and a set of INDs I. [id] I Gi,G 2 I U {Gi[pid]cG 2 parent(Gi)} has cycles if: e E and G X]çG Y}, 2 [ 1 3aels.t.c:G 1. [ where G 1 , or ancestor_or_self(G ) 2 2. We construct a digraph where leaf DTD groups, DTD groups with no children, make up the vertices. The directed edges are: ,V are leaves, 1 V G e B, 2 V G , 1 [Y] E I, where 2 2 [X] C G 1 { (Vi,V) I G and G 1 2 G e ancestors _or..self(Vi), e ancestors_or_self(V2) } If a cycle C exists in this digraph, we call I circular if for each V 1 in C there is an IND coming in o: _il ç G[Y], and 1 G_i[X ] 1 there is an IND going out o: G[X G ÷i], and 1 1 [Y G ) 1 e descendants_orself(G Retaining Proper Circular Property Given the above requirements for retaining the acyclic property, we want to extend the property to proper circular INDs. _ [X 1 ] c _ 1 Recall, in Definition 2.10.2, that for any two consecutive INDs I: R . Sim 1 ] in a proper circular IND cycle, X = X 1 1 [X :1 [X’J, and 1 R [X C R R ilarly, in a proper circular cycle of XML INDs, any two consecutive ENDs I: . 1 X] C 1 ] C t4Xfl, and 11+1 : t _ 1 [x 1, x, = x 1 x!+ 42 The transformation generated INDs are on newly created attributes ‘id’ and ‘pid’ that do not exist in the original sets of attributes. There is no chance the transformation generated INDs can form proper circular cycles with the original set of INDs. In other words, the transformation generated INDs may complete cycles, but they never complete proper circular cycles. As a result, if there are no proper circular cycles in the original set of XML INDs, it is not possible to have proper circular cycles after transformation. Therefore, retaining proper circular property if the original set of of INDs I has no proper circular cyles, is the same as retaining acyclic properly in the last Section. However, if there are proper circular cycles in the original set of INDs, we want to make sure the transformation generated INDs do not interfere with the existing cycles. To avoid such interference, any IND in the proper circular cycle must not 1 [X C G[XJ where G 1 G be: ] 3.2.2 e descendant(G). Retaining Reduced Property Analogous to the projection definition in Definition 2.10.5 for relational FDs, we define projection for XML FDs below. , denoted 1 1 onto a set of attributes Y C G The projection of a set of FDs F over G byF[Y],isgivenbyF[Y]={G:W—>ZR:W-->ZEFandWZcY}. Again, analogous to the reduced definition in Definition 2.10.6 for relational constraints, we define the property reduced for XML constraints below. Definition 3.2.1 A set of attributes Y ç G is said to be reduced with respect to G 1 is understood 1 (or simply reduced with respect to F if G and a set of FD5 F over G from context) if F [YJ contains only trivial FDs. A set of FDs and INDs £ = F U I is said to be reduced if VG,[X] G[Y] E I, Y is reduced with respect to F. This definition is parallel to its relational counterpart. The transformation gen erated INDs involving attributes ‘id’ and ‘pid’ are all unary INDs. Thus they can not violate the reduced property. Moreover, the transformation generated FDs only have ‘id’s on the L.H.S. of the FDs. Since none of the original INDs have at tributes involving ‘id’s and the transformation generated INDs are all unary, the transformation generated FDs do not cause violation to a set of FDs + INDs that 43 are reduced. Therefore, the above XML reduced property definition is sufficient without the need to consider transformation generated constraints. Inability to Retain Intersection Property 3.2.3 : ‘id’ —* A,VA e R,. Thus it 1 The transformation generated FDs are in the form: R will violate the intersection property (Definition 2.10.4) if there are any non-trivial FDs in the original set of FDs. The violation arises because V FD F = R,: X —* A in the original set of FDs, ‘pid’. On the other hand, a transformation generated FD exists A, and ‘id’ flX = 0. Since we assumed F is not a trivial FD, it is not ‘id’ and A A R,: ‘id’ —> possible for 0 —* A. Therefore, it is impossible to retain intersection property if the original set of FDs is not empty. 3.2.4 Retaining Split-freeness Property In the last section, we have shown the inability to retain intersection property. Luckily, it is not a necessary requirement. Non-interaction can still be achieved if split-freeness property (Definition 2.10.3) can be retained. Two nontrivial XML FDs of the forms G:XB —* A and G:YA —÷ B are said to be cyclic. Definition 3.2.2 A set of FDs F over DTD i satisfies the split-freeness property ifVFj e E whenever there exist cyclic FDs, G:XB —* A, YA —+ B E Fj, then either G:Y —* B E Fj or G:(XflY)A —* BE The above definition is parallel to its relational counterpart. The transformation generated FDs always have ‘id’ on the L.H.S., while ‘id’ is on the R.H.S. of FDs in neither the original set of FDs nor the transformation generated FDs. Thus they never form cyclic FDs with themselves or FDs from the original set of FDs, and split-freeness retention is ensured. 44 3.2.5 Summary of non-interaction between FDs and INDs We have demonstrated the inability to retain the intersection property in Section 3.2.3, but defined properties of proper-circular in Section 3.2.1, split-freeness in Defini tion 3.2.2, and reduced in Definition 3.2.1 in XML context. A set of XML con straints satisfying these properties will have their transformed relational constraints satisfying proper-circular (Definition 2.10.2), split-freeness (Definition 2.10.3), and reduced (Definition 2.10.6) properties too. Therefore, the transformed constraints can enjoy non-interaction between FDs and INDs. This is a big step towards achieving efficient XIVIL constraint implication. 3.3 Efficient implication of XML constraints In Section 3.1, we have established that the problem of testing the implication of XML constraints can be reduced to the problem of testing the implication of relational constraints. We have also defined the necessary conditions in order for the transformed FDs and INDs not to interact. As a result, we can test implication of IND and FD independently. In Section 3.3.1, we will show that typed XML INDs become typed relational INDs after the transformation in Section 3.1.4. After that, we describe a polyno mial implication algorithm for typed relational INDs in Section 3.3.2. At the end, we briefly outline a linear implication solution for relational FDs in Section 3.3.3. 3.3.1 Retaining Typed property [1] shows relational FD implication has a linear solution, but IND implication is PSPACE-complete; therefore, we need an efficient solution. Fortunately, typed relational IND implication is polynomial. Using the typed XML IND defined in Definition 2.9.2, we show the typed property is retained after transformation. It is easy to see that the original set of INDs do not violate the typed property after transformation. However, the transformation introduces extra INDs involving ‘id’ and ‘pid’. We define a new type for ‘id’ from each transformed relational table. Each ‘pid’ can only be involved in one IND, and its R.H.S. is an ‘id’ in another table. Thus this ‘pid’ belongs to the same type as the type of the target ‘id’, and these 45 transformation generated INDs do not cause typed violations among themselves. Since these new INDs and the original set of INDs deal with two mutually exclusive sets of attributes; in addition, FDs and 1NDs don’t interact, there is no interaction of the two sets of INDs. Therefore, the typed property is retained after transformation. 3.3.2 Implication for typed relational INDs [7] shows the implication of typed relational INDs can be solved in polynomial time by solving a digraph reachability problem. We will describe the solution below. 1 [Ui],... ,R[U]}, and a set of typed {R 1 [x]. INDs over R. We want to check if E implies an typed IND a = Rk [X] ç R ,.. .,R} and (Rp,Rq) E E if 1 Construct a digraph G = (VE), where V = {R Given a set of relational schemas R there is R[W} ç Rq[Wj in . 1 R = such that XW. = a if there is a path from Rk to Checking for existence of such a path is a digraph reachability problem, and it can be solved by breadth first search (BFS). Let the number of INDs be m, and the total number of attributes (union of attributes from all relation schemas) be n, the time complexity for BFS is 0(m), and the space complexity is 0(m). However, ) time, but 0(n) 2 testing for attribute subsets during digraph building takes 0(n space (time complexity can be lowered if we accept a trade-off for higher space , and the overall O(m+n ) complexity). Therefore, the overall time complexity is 2 space complexity is 0(m+n). 3.3.3 Implication for relational FDs Relational FD implication is linear time [1]. [1] also describes the algorithm to test FD implication. The algorithm tests for implication by calculating the FD closure (attributes that the input set of attributes implies) on the L.H.S. attributes of the FD being tested. 46 3.3.4 Summary of XML Constraints Implication In this chapter, we have reduced XML constraint implication to a relational con straint implication problem. Trying to find an efficient implication algorithm, we have adopted proper-circular (Definition 2.10.2), split-freeness (Definition 2.10.3), and reduced (Definition 2.10.6) properties from [11]. We modified them to apply to XML FDs and INDs. As a result, we have shown how to prevent interaction be tween XML FDs and INDs. In addition, to make the problem of IND implication tractable, we have introduced the typed property in XML. Finally, we have em ployed existing efficient implication algorithms on relational FDs and typed 1NDs to efficiently test implication of XML constraints. 47 Chapter 4 Constraint Propagation When data is translated in the form of one schema to another, it carries some of the semantics with it. In this section, we introduce algorithms to translate constraints 5 to a target DTD AT via a set of HepToX mapping rules M from a source DTD A (defined in Section 2.7). As we will soon find out the constraint translation algorithms are decidable, but take exponential time because our algorithms compute closures. The expo nential complexity makes this solution inefficient, especially when a large set of constraints are considered. Computing a cover of FDs on the decomposed relations in relational func tional dependency preservation also takes exponential time. Similar to the alter nate method in dependency preservation, we define algorithms to check constraint on As, and a constraint aT on AT, propagation. That is given a set of constraint , if I3 Zs implies M(Ds) 1= 5 5 on A whether V XML database D 4.1 Constraint Translation Data can be translated via mappings; data semantics can also be translated. In this section, we construct algorithms to translate XML constraints via HepToX mappings. on As, 6 (s, M) denotes the translation of M) produces a set of via mapping rules M to a set of target constraints. 6 Given a set of constraints 48 constraints on the target DTD AT such that: 5 on As, if I3 VXML database D Es, then M(Ds) 1 ö(Es,M) If E 5 satisfies the conditions set in Section 3.2, there will be no interaction of FDs and INDs in constraint implication. Since we are translating source con straints, the non-interaction is very important to make the translation efficient be cause dependency preservation requires implication (possibly computation of clo sures). Therefore, we assume E satisfies conditions in Section 3.2. Our problem shares some similarities to the (functional) dependency preserva tion problem in the relational context. In fact, for the problem of FD translation, it employs the same algorithms as dependency preservation. 4.1.1 FD Translation Each mapping rule M 1 e M has a head and body. The rule head is a tree expression of a path from the target root to a group node. On the other hand, the rule body is a tree expression from the source root to multiple source group nodes. Each target attribute is only allowed to have one correspondence. With this re striction, there does not exist the problem of union of values from different sources. Therefore, FDs translated from one source to a target group will not be interfered by data from other sources. With the above established, we introduce the FD translation algorithm in Figure 4.1. As in dependency preservation, this is an algorithm that requires computation of FD closures; thus it requires exponential time. The algorithm is simple and helps to express the intuition behind FD translation. , the algorithm first computes the closure F of 5 Given a set of source FDs F 1 of each mapping rule M, to see if F It then examines the leaf target group TO . 5 . If an FD satisfies these 1 any of the FDs in F have all their attributes mapped to TG 1 and added to the output set of conditions, then it is translated to its form in TG target FDs FT. The actual FD translation at line 7 is very simple. Group instances of SG will have their data in X U Y transferred unchanged and as a whole to attributes group . Therefore our translation is sound. 1 instances U U V of TG 49 Algorithm 8(Fs,M) 5 over source DTD A, Input: a set of FDs F 5 to IXT a set of mapping rules M from z\ Output: a set of FDs FT over target DTD T 1. computeF.={SG:X—*YFs =SG:X—*Y} 2. for each M e M , 1 1 has a rule head which contains a target path TP 3. M 1 is a path of groups root,.. ,TG, TP 1 1 e rule body of M 4. for each group SG for each FD=SG:X—* Y E F 5. , 1 if M maps X U Y to attributes U U V in TG 6. . wherelXl=zIUIandIYI=IVI 7. thenFT=FTU{TGI:U—*V} Figure 4.1: FD Translation Algorithm — ö(Fs,M) This FD translation algorithm runs in exponential time; thus it is not very effi cient. 4.1.2 IND Translation IND Translation Algorithm as shown in Figure 4.2 translate a set of IND5 Is on the source schema Ly, to a set of INDs ‘T on the target schema Line 2,3 finds out all the DTD groups on both source and target schemas, and label each of them by indexes. Line 6-10 finds all the correspondences between DTD groups and store them in a 2-dimensional array W. We then compute I has attributes closure of js (all the INDs that can implied by is). If any IND I’ e 4 on its L.H.S. fully mapped to a DTD target group TGi, and attributes in its R.H.S. fully mapped to another DTD target group TGqI. Then we add this newly translated INDtOIT. Even though typed IND implication is polynomial, (Is, M) computes closure of I; thus it runs in exponential time and is not an efficient algorithm. 50 Algorithm 6(Is,M) Input: a set of INDs 1 s over source DTD Deltas, a set of mapping rules M from Deltas to DeltaT Output: a set of INDs IT over target DTD DeltaT 1. 1T{ } 2. examine M, find out all the source groups SGs and target groups TGs 3. index SGs and TGs 4. n=sizeOf(SGs) 5. m=sizeOf(TGs) 6. create a 2-dimensional array W of size n x m 7. foreach M E M 1 has a rule head which contains a target path TP, M 8. 1 TF is a path of groups root,. ,TG 1 3 e rule body of M foreach group SG 9. 10. W[j] [i] = {(As,p,AT,q) I As,p E R(SG j),AT,q E R(TG), there is a correspondence between As, and Ar,q in M} 11. compute I from Is 12. foreach I E I, where i=SGp[X]cSGq[Y1 13. forr=ltom if (A,A’)EW[pJ[r], VAEX 14. p’=r 15. if (B,B’)eW[q][r], YBeY 16. 17. q’ =r 18. X’={} 19. foreach AEX X’=X’u{A’}, where (A,A’)EW[p][p’] 20. 21. Y’={} 22. foreach BEY Y’=Y’ U{B’ }, where (B,B’ )EW[qJ [q’] 23. 24. ‘T 1TU{TG1[X] TGqi[Y’]} . . Figure 4.2: IND Translation Algorithm 51 — ö(Is,M) 4.2 Testing Constraint Propagation Given the undesirable exponential time complexity of constraint translation in Sec tion 4.1, we want to find an efficient solution to check constraint propagation. In this section, we develop efficient algorithms to test constraint propagation. We give our problem statement below: Given a set of constraints £s on source DTD Ay, a set of mapping rules M, and via M if V XML a constraint GT on target DTD ar is propagated from database D that conforms to S and M(Ds) 1= o and every attribute in T has a correspondence in S. We donate this Z j=M 0 T. Unlike traditional dependency preservation in which all the target attributes are mapped, our users may not create correspondences for all the attributes. Al though a constraint may be satisfied at the target because of missing values (from target attributes without correspondences), we feel a constraint cannot be consid ered ‘propagated’ if some of its attributes are not mapped. Therefore, we require to have a correspondence in L. Even though we have already established polynomial implication of XML con straints, testing constraint propagation is still not a trivial problem. We have to be all attributes in aT very careful to construct our environment to achieve decidability and tractability. Firstly, users are allowed to create high-level correspondences between schemas that greatly reduce complications. The forbidding of multiple source correspon dences to a single target node takes away the union operator. It guarantees data will come from only one part of the source. Secondly, one may suspect that the set of translated constraints will not sat isfy conditions set in Theorem 2.10.7; thus may lead to undecidability in testing propagation. In target constraint implication ZT j= aT, all XML database DT that conforms to and T has to satisfy aT. However, in constraint propagation, we and are only considering a subset of databases, namely VDs that conforms to M(Ds) 1= IT, and we need to check if M(Ds) 1= aT. If we can test the constraint in , in its corresponding form in the source, then we can guarantee trans lated data coming from that part of the source will always satisfy aT. Thirdly, there is the issue of NULL and missing values. NULL values can occur question aT if data coming from the source already contains NULL values, or an attribute in 52 the target has no correspondence (not mapped) in the source. In the prior case, if the NULL values in a source XML database D violates source constraint s’ then D will be violating the assumption of constraint propa gation. Therefore, there is no propagation. On the other hand, if VDs = Ys and D does not contain NULL values, M(D) fr=M UT, then VDs F= Ys and D contains NULL values, M(D) I=M UT. when a D with NULL values satisfies Es, we can substitute unique values in place of these NULL values to produce 4, and it will still satisfy Es. If M(4) 1= Ui’, This is because then M(Ds) 1= oi- because these unique values generated behave exactly as NULL values in this case as they do not equate to any other values. Therefore, the existence of NULL values do not prevent UT from propagating from Es. In the latter case, the target attribute has no correspondences in the source; thus it is impossible to find a corresponding source constraint. It follows if UT being tested involves such attributes, it is not propagated. As with the previous algorithms, we are able to guarantee no interaction of FDs and INDs in Section 3.2; therefore, we can separate the testing of propagation into two algorithms; one will test the propagation of FDs while the other is on INDs. 4.2.1 Testing FD Propagation The intuition behind the FD Propagation Testing algorithm (Figure 4.3) is to trace data from the source that will make up the data involving the FD U {TGk : X —* A} in question. For such data to conform to UT after translation, it must conform to some constraints in the source. Hence, we must first find out what these constraints 5 given in the source can imply them. are, then test if the FDs E Due to the one-to-one nature of our mappings, every mapped target attribute must be traced to one and only one attribute in the source. Thus checking if these traced attributes satisfy a corresponding source FD is sufficient to test propagation of UT. To find this FD in the source, the first attribute we should look at is A on the R.H.S. of UT. In line 8,9, we use the binding variable $A to trace the mapped attribute A’ in the source. In doing so, we also learn the source group SGk in 53 which A’ belongs. Since the definition of FDs confines an FD to attributes within a single group and our mapping rules do not consider joins among source groups, the corresponding FD we need to check must be in SGk where A’ belongs. On a side note, we only need to check attributes in SGk because of no interac tion of FDs and INDs. Therefore, we don’t need to worry about attributes in other groups having an effect on the FDs in SGk. This also explains why the algorithm only accepts FDs for s• We can easily separate INDs and FDs from each other in a set of mixed constraints, and feed this algorithm with just the FDs. Next in line 11-14, we look for all the attributes in X that are mapped from SGk, and put all their corresponding source attributes into X’. Notice that X’ I X I’ X is a superset of mapped attributes from X’. Due to the augmentation rule of FD implication, if a subset of X implies A, then X must imply A. At the end, we check if I implies SGk : X’ = A’, returns true if it does, and false otherwise. The algorithm checks FD implication at the end, while the rest of the algorithm runs in time 0(m), where m is the number of correspondences. Thus the algorithms runs in linear time because testing implication of FDs runs in 0(n), where n is the number of attributes. For example, we can check if LibFD#1: UBCLibrary.person : name in Table 1.2 is propagated from —+ DOB UBCBooksrore via the set of Mapping Rules M in Figure 1.4. At Line 8 of our algorithm, we find that the R.H.S. attribute DOB of LiBFD#1 appears in the mapping rule marked as ‘1’ in Figure 1.4, and it is assigned the variable $DOB in target DTD group person. At Line 9, we find $DOB is bound to the attribute DOB of the source DTD group person in the body of mapping rule ‘1’. At Line 12, name, which appears on the L.H.S. of LiBFD#1, is assigned to $name. At Line 13, $name is bound to the attribute name of the same source DTD group person as DOB. At Line 15, we test if UBCBookstore UBCBookStore.person : name —÷ DOB. This is of course true because the FD is the same as BookFD#1 in ZuBcBookstore; thus our algorithm returns true. 54 Algorithm TestingFDPropagation 5 Input: a set of FDs Es over source DTD A DTD A over target : X —‘ an FD 0 AT, = TGk T maps 5 A rules that of to M a set mapping AT Output: true if ESl=MoT 1. 2. 3. 4. 5. 6. 7. 8. M=null eM 1 forM 1 =root,.. ,TG 1 ) is a target path of groups TP 1 head(M ifTG,=TGk then Mk=Mj if Mk=null then return false in head(Mk) the leaf group assignment has the form: TGk—fk(...)[ . A —* $A, we denote it as headTGk(Mk) 9. in body(Mk), there exists a source group (ignoring its descendant groups): SGk —f $SGk [ A’—$A, we denote it as bodysGk (Mk) 10. X’={} 11. foreachB E X 12. there is B —* $B in headi-Gk(Mk) 13. if there exists B’ —* $B in bodysuk(Mk) then X’=X’u{B’} 14. 5 =X’—*A’ 15. returnE Figure 4.3: Testing FD Propagation Algorithm 55 4.2.2 Testing IND Propagation Testing IND propagation is similar to testing FD propagation in the previous sec tion in that we trace attributes through mappings to identify the one corresponding IND in the source and check if a set of source INDs s implies it. Similarly, because of no interaction of FDs and INDs (Section 3.2), we don’t have to consider FDs in our algorithm. In addition, our mapping rules use Cartesian product of source groups to product instances of target groups; thus there is no equality of attributes among source groups. Therefore, for each left and right hand sides of the traced IND, we only need to consider attributes within only one source group for each side. Our algorithm traces source attributes of both L.H.S. and R.H.S. of the IND in ,...,X] ç TGq[Yi,...,Yn]. It makes sure the traced set of 1 question UT = TG[X attributes X’ of the L.H.S. belongs to only one source group, and the traced set of attributes of the R.H.S. Y’ also belongs to only one source group. Notice that both X’ and Y’ are lists because the L.H.S. and R.H.S. sets of attributes in an IND are sequences, and their order matters. Moreover, unlike in FD propagation, IND implication does not have augmen tation (augmenting attributes to attributes already in the constraint produces satis fying constraints). Hence, I 1=1 X 1=l Y’ have a corresponding attribute in the source. Y , and every attribute in IND must Finally, if we can find X’ and Y’ which both contain attributes within their respective source groups SG and SGq, then S =M T if s = SG [X’] SGq [Y’l. The algorithm checks IND implication at the end, while the rest of the algo rithm runs in time 0(m), where m is the number of correspondences. Thus the ) time, where n is the number of attributes, because testing 2 algorithms runs in 0(n ). 2 implication of typed INDs runs in 0(n For example, we can check if LibIND#1: UBCLibrary.publication.call#.author[name, DOB] ç UBCLibrary.person [name, DOB] in Table 1.2 is propagated from EUBCBookstore via the set of Mapping Rules M in Figure 1.4. At Line 13 of our algorithm, we find that the L.H.S. attributes name and DOB of LiB1ND#1 appear in the mapping rule marked as ‘2’ in Figure 1.4, and 56 Algorithm TestinglNDPropagation Input: a set of INDs s over source DTD As, ,. 1 TGq[Yi,. , Y,] over target DTD AT, an IND aT = TG[X 5 to AT a set of mapping rules M that maps A Output: true if s j=M aT . . . . 1. MpMqfluIl 2. forMEM ) is a target path of groups TP=root,.. ,TG, 1 head(M 3. ifTG=TG 4. 1 then M=M 5. ifTGiTGq 6. then MqMi 7. 8. if M=nulI or Mq=nuIl then return false 9. 10. SGpSGqI)Ul1 11. X’=Y’=() 12. for each X, E (Xj,. , X,) 13. if ,X —’$Xe head(M) 1 then return false 14. 15. there exists X —* $X 1 e a source group SG e body(M) 16. if SG ynul1 and SG SG then return false 17. 18. SG=SG 19. X’.append(X’) (Y ...,Y) 20.foreachYE , 1 21. if / $Y e head(Mq) then return false 22. 23. there exists Y’ —* $1’, e a source group SG E body(Mq) 24. if SGq #null and SGq # SG then return false 25. 26. SGq = SG 27. Y’.append(Y/) 28. return Ys 1= SG[X’j ç SGq[Y’l . . . — Figure 4.4: Testing ND Propagation Algorithm 57 they are assigned the variables $name and $DOB in target DTD group publication respectively. At Line 15, we find $name and $DOB are bound to the attributes au tho, name and authorDOB of the source DTD group book in the body of mapping rule ‘1’. At Line 21, we find that the R.H.S. attributes name and DOB of LiBIND#1 appear in the mapping rule marked as ‘1’ in Figure 1.4, and they are assigned the variables $name and $DOB in target DTD group person respectively. At Line 23, we find $name and $DOB are bound to the attributes name and DOB of the source DTD group person in the body of mapping rule ‘1’. At Line 28, we test if: UBCBookstore = UBCBookStore.subject .book[author.name, author.DOB] c UBCBookStore.person [name, DOB) This is of course true because the IND is the same as BookTND#1 in UBCBookStore thus our algorithm returns true. 4.3 Constraint Propagation Summary To test constraint propagation, we first attempted to translate all source constraints to a cover of target constraints, then one could use implication algorithms in Chap ter 3 to check if the target constraint 0 T is implied by this cover. However, testing constraint involves the computation of closures; thus making the algorithm to run in exponential time. Therefore, we developed constraint propagation testing algo rithms that avoids computing a cover of target constraints by translating oT back to its form in source schema, and testing if s implies it. The propagation testing algorithms run in polynomial time. 58 Chapter 5 Detecting Constraint Violations When a constraint propagates, it guarantees that translated data does not violate the constraint at the target. However, when one does not propagate, it does not necessary imply violation. For example, in Figure 1.1 it is obvious that LibFD#3 in the target DTD UB CLibrary is propagated; thus it will be satisfied by data translated from UBCBook store using the specified mapping. However, LibFD#7 is not propagated. According to the mapping rules, no data will be translated under the group publisher. Since translated instances never contain publisher, LibFD#7 cannot be violated. This example demonstrates that propagated constraint implies satisfaction, but constraint satisfaction does not necessary imply propagation. As in checking constraint propagation, we can check for the violation of a target FD by only a set of source FDs and mapping rules independent of any INDs, and vice versa for checking IND violation. In the following two sections, we are going to introduce algorithms to check for FD and IND violations respectively. 5.1 Checking for FD violation The algorithm Figure 5.1 presented below takes a set of FDs s over source DTD and a set of mapping rules M. It , an FD 0 5 b T = TGk : X —* A over target DTD 59 Algorithm Checking for FD violation Input: a set of FDs s over source DTD As, an FD ar = TGk : X —* A over target DTD AT, 5 to Ar a set of mapping rules M that maps A. over instance I3 Output: true if 3 an XML As that conforms to s, s.t. M(Ds) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. if TestingFDPropagation(Es, ar , M) then return false M=null forM,EM =root,. 1 ) is a target path of groups TP 1 head(M ifTGj=TGk then Mk=Mj if Mk=null then return false foreachX, eX 1 if ,X 1 e head(Mk) $X then return false return true . . 1 ,TG — Figure 5.1: FD violation Checking Algorithm 5 over A that conforms to s would violate ar determines if any data instances D 5 through M. after the translation of D Propagation implies no violation; therefore, line 1,2 of the algorithm in Figure 5.1 check for propagation, and return false (no violation) if o- is propagated from s through M. For M(Ds) to violate ar, we need all attributes on the LH.S. of 0’j’ to be equal on two instances x,y over group TGk, but x.Ay.A. As our stance on NULL and missing values is that they are not comparable; thus x.X and y.Y cannot be equal if any of them contain NULL values or are missing some values. Hence, in order to create violations, all attributes in X must be mapped. On the other hand, if all attributes in X are mapped, whether A is mapped is 60 irrelevant. If A is mapped, together with the fact that T is not propagated, we can 5 such that there are translated instances x,y in M(Ds) and x.X=y.X always find D 5 such that there but x.Ay.A. If A is not mapped, then we just need to find D are translated instances x,y in M(Ds) and x.X=y.X. The absence of mapping of A guarantees x.Ay.A. This algorithm first checks for propagation which runs in 0(n) time, where n is the number of attributes. The rest of it runs in time 0(m), where m is the number of correspondences; thus, we can check for FD violations in 0(n) time. For example, we can check if LibFD#7: UBCLibrary.publisher: name —* addr will cause violations after translation. It is obvious that LibFD#7 is not propa gated; therefore, testFDPropagation at Line 1 is false. The algorithm go through all the mapping rules M (Figure 1.4) at Lines 4-6, but will not find a mapping rule mapping to the publisher DTD target group node because there are not corre spondences to publisher. Therefore, at Line 9, our algorithm returns false; thus it correctly answers LibFD#7 will not cause violations. 5.2 Checking for IND violation The algorithm in Figure 5.2 takes a set of INDs £s over source DTD i, an 1ND ,...,X] TGq[Yi,. ..,Y,] overtargetDTDAT, and asetof mapping 1 YT = TG[X 5 over S that conforms to rules M. It determines if any XML database D 5 through M. violate ar after the translation of D Similar to Figure 5.1, we check for propagation first. If aT will propagates, then there cannot exist violations. In order for M(D) to violate ar when ar is not propagated, it must contain an 1 y.Y. instance x of group TG, andy of group TGq, such that i e [1,.. , n], x.X . 1 or Y is irrelevant. This is In this case the existence of mappings for any X because, if mappings are missing for any of X, or Yj, the missing values guaran tee non-equality; thus they lead to violation. On the other hand if all K, and l s.t. with Ds are mapped, we can easily find an XML instance I3 over M(Ds) a. Failure to find such D implies s 1 =M aT which contradicts our as sumption that aT is not propagated. 61 Algorithm Checking for IND violation , 5 Input: a set of INDs s over source DTD A an IND 0 TGq[Yi,.. .,Y,J over target DTD AT, T = TG[Xi,.. ,X,] 5 to AT a set of mapping rules M that maps A over 5 A Output: true if an XML instance D that conforms to Es, s.t. M(Ds) c . 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. LHGroupFound=false forM,EM ) is a target path of groups TF=root,. 1 head(M ifTGETF then LHGroupFound=true if LllGroupFound=false then return false if TestinglNDPropagation(Es, T, M) then return false else return true . . 1 ,TG Figure 5.2: IND violation Checking Algorithm Therefore, the checking for IND violation is surprisingly simple; if any at tributes in TG or its descendants are mapped, then it is just the opposite outcome of testing IND propagation. Otherwise, since there can be no translated instances of TG; thus aT is satisfied because there are no instances on the L.H.S. of aT to violate it. Checking existence of mappings in TG or its descendants takes 0(m), where ) time, 2 m is the number of correspondences. Hence, the algorithm runs in 0(n where n is the number of attributes, because the IND propagation testing algorithm ). 2 that we employ runs in 0(n For example, we can check if LibIND#2: UBCLibrary.publication[pub] UBCLibrary.publisher[namej will cause violations after translation. At Line 3, the algorithm finds the L.H.S. tar get group publicatoin is in mapping rule marked as ‘2’ in Figure 1.4. Since there will be instances on the L.H.S. of the IND, the algorithm employs the TestinglND 62 Propagation algorithm in Section 5.1 to test if LibTND#2 propagates. The R.H.S. of LibIND#2 is not mapped; therefore, LibIND#2 does not propagate, and our algorithm returns true (meaning LibIND#2 will cause violations). 63 Chapter 6 Resolving Violations With the ability to detect constraint violations, we are setup to fix violations in this chapter. However, violations may be caused by many factors such as human error and heterogeneous schemas. There is no way to determine the reasons behind the violations automatically; only the person who creates the correspondences can decide how best to rectify the violations. The purpose of this chapter is to give options to the person trying to repair constraint violations. Each solution has its advantages and disadvantages, and we will discuss them in the following sections. Which method is best suitable for a correction differs depending on different situations. Therefore, we will leave the final decision making to administrators creating the correspondences. Unlike the rest of this thesis, the repairs presented here are heuristic solutions. There is no guarantee that one repair won’t cause other violations. They merely provide options for administrators to produce better quality mappings. We categorize the repairs into repairs that fix FD violations and the ones that fix IND violations. 64 6.1 Repairing FD Violations 6.1.1 Scaling Back Studying the algorithms in Detecting Constraint Violations Chapter 5, one can eas ily come to the conclusion that removing some of the correspondences can fix violations. This is indeed a method that is consider the safety net because it works in all situations. However, this has to be considered a pessimistic option as the resulting mappings will translate less information after deleting some correspon dences. In our example in Figure 1.1, we have shown that the FDs LibFD#4 and LibFD#5 are violated. Here, in order to demonstrate better, we simplify our example by re taining only attributes and correspondences that are necessary for our demonstra tion, and it is shown in Figure 6.1. UBCBookstore subject name book \ isbn Figure 6.1: FD Violating Example An easy way to fix the violations is to delete one of the correspondences to subject, authoi name, or year. By doing so, there will be missing values of the L.H.S. of LibFD#4 and LibFD#5. Hence, L.H.S. of LibFD#4 and LibFD#5 will never equate, and the two FDs cannot be violated. , 1 , SP, —* SP is violated, then delete the cor Formally, if FD a = TGk SP . . . respondence for TGk.S19 for any i € [1,... ,n] In our example, if we delete the correspondence between years of the two 65 schema, the translated data will be missing the attribute year. The resulting data will satisfy LibFD#4 and LibFD#5 because of the missing values of year. 6.1.2 Source Concatenation Scaling back deletes correspondences; thus some information is lost. It can be undesirable. As seen in the last section if we delete the correspondence between years of the two schemas in Figure 6.1 the resulting data will satisfy LibFD#4 and LibFD#5. However, the resulting translated database will be missing year; users are going to have a hard time searching for publications. In this section, we introduce a method that let us keep as much information as possible while satisfying target FDs. Here, we introduce a new type of correspon dence, the concatenating correspondence. A concatenating correspondence connects multiple source attributes within one DTD group to a single target attribute. During translation, each source group in stance will concatenate the strings of the source attributes of the correspondence and put the resulting string in the target attribute. UBCLibrary UBCBookstore publication subject name \c book Figure 6.2: Fixing FD Violation by Concatenation Example Figure 6.2 shows a concatenating correspondence connecting isbn and year at the source to year at the target. Translating the instance from Figure 6.2 will produce Figure 6.3. It is clear that the new translated instances no longer violate 66 UBCLibrary publication publication call# Smith title #pages .7 fl.......DB 1 540 subject author ,// DB name call# title #pages ,7’ subject author ,// DB year \DOB 2001001 name Smith 19490101 320 year \DOB 2001002 19490101 Figure 6.3: Translated UBCLibrary After Repair By Concatenation Mapping LibFD#4 and LibFD#5. This is achieved by concatenating attributes that implies the corresponding R.H.S. of target FDs to attributes of L.H.S. of those FDs. In our example, isbn implies title and #pages which are correspondences of R.H.S. of LibFD#5 and LibFD#5. Thus we concatenate isbn to a correspondence that is in the L.H.S. of LibFD#4 and LibFD#5. Administrators establishing the corre spondences need to pick attributes in the source that implies A’, where A’ is the source correspondence of A in the violated target FD uT TGK : X —* A. Which — correspondence to concatenate to in X is totally up to the administrator establish ing the correspondences. Here we choose the correspondence between year, and concatenate values in isbn to every year translated to UBCLibrary. 6.2 6.2.1 Repairing IND Violations Scaling Back When an IND is violated, there are instance on the L.H.S. of the violated IND aT that do not have corresponding instances on the R.H.S.. One way to avoid IND violation is to make sure no instances on the L.H.S. exist. Again, we can achieve this by deleting correspondences such that no instances appear on the L.H.S. of aT. Since we treat missing values in a group instance as NULL values, simply deleting the correspondences that makes up the L.H.S. attributes is not enough. We need to delete all the correspondences related to the 67 name Figure 6.4: IND Violation Example target group so that no group instances will ever appear. Figure 6.4 shows just the part of our running example that violates LibJND#2. Following the correspondences, instances of publications are translated, each pub lication has a pub. LibIND#2 requires each pub in publication to appear in pub lisher name, but this is not satisfied by our correspondences. Using the correspondence deleting method, we have to delete every correspon dence in the publication. There will not be any more publication instances; thus LibIND#2 will no longer be violated. As with the Scaling Back solution of FD violation repairs, deleting correspon dences loses information. It can be considered worse in this case because all cor respondences of the L.H.S. target group have to be deleted. 6.2.2 Split Correspondence sP [i,. TG , Instead of making sure the L.H.S. of an IND aT = TGk[SPk,I,. ,SP,j c 1 has no instances, we can make sure the R.H.S. contains all the instances of the . . L.H.S.. A split correspondence is actually multiple correspondences all originating from the same source attribute. After translation, all target attributes will have the same set of source attributes. This is a constructive method to correct IND violations. For every mapped attribute TGk.SPk, on the L.H.S. of aT, we find its corresponding source attribute 68 .. j 1 ,SP Figure 6.5: IND Violation Example Fixed by Split Correspondence 1 to the corresponding SGk.SPk,,, and create another correspondence from SGk.SPk, 1 .SF in the R.H.S. of the IND. attribute TG For every unmapped attribute TGk.SPk, on the L.H.S. of . .SP TG , any attributes in SGk to map to TGk.SPk, and 1 U, we need to find In Figure 6.4, our example violates Lib1ND#2. Using the split correspondence method, we create another correspondence to publisher name as shown in Figure 6.5. It is obvious that every translated pub appearing under publication will appear in publisher name. Therefore, LibIND#2 is satisfied. 6.3 Summary of Repairs In this chapter, we have suggested 3 heuristic ways to repair violations. Roll back of correspondences can repair all violations, but it can be undesirable due to loss of infonuation. Concatenation and split correspondences are more constructive, but they may modify translated data. Therefore, administrators have to decide which type of repair to apply, and if the sacrifice is worth the gain in constraint satisfaction. 69 Chapter 7 Conclusion and Future Work Conclusion We have proposed ways of repairing correspondences such that translated instances would satisfy target constraints. In order to repair, we first have to know what we are repairing. Thus we have developed algorithms to detect violations, check propagation of constraints, and check implication of constraints. To make sure our problem is tractable, we have investigated simplified DTD and restricted conditions for constraints such that the above algorithms all run in polynomial time. These efficient results are all on the two most fundamental type of constraints: functional dependencies and inclusion dependencies. Our mappings are generated from high-level correspondences that can be eas ily created by anyone who knows the two schemas well. We suggest repairs on these high-level correspondences. In addition, repairing these high-level corre spondences is easier for everyone to understand and to choose the best method of repairing. With consistent mappings translating data that is consistent with the target con straints, application at the target are guaranteed to run soundly. Future Work This thesis opens up several new problems. Data exchange can be more powerful if data from one source can be relayed to 70 other targets through chains of mappings. Instead of creating mappings between every pair of schemas we want data exchange, we can make use of existing map pings to form a network. Ideally, data exchange is possible if two schemas are connected via a path of mappings. Our constraint propagation algorithm will be more powerful if we can test for constraints propagation along a chain of mappings. To reason on the most basics of constraints: functional dependencies and inclu sion dependencies, we have yet to study some of the other popular constraints such as keys and foreign keys. Considering keys and foreign keys involves more studies of non-leaf nodes, and treating groups as entities. This is quite different from our study of only leaf nodes. We only need to deal with sets of values. However, we believe our studies is a very useful building block towards keys and foreign keys violation detections and repairs. Similar to the idea of relative keys in [6], there can be relative FDs and INDs. For a relative FD, the L.H.S. of it can involve ancestor attributes of the DTD group that the FD is based on. Relative IND is similar in that sequences of attributes on either sides of it can span over several DTD groups that are descendants and ancestors of each other. However, with attributes spanning over multiple DTD groups, our constraint implication that based on reduction to relational constraint implication will no longer work. Therefore, we need to find another implication system that is effi cient. 71 Bibliography [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. pages 20, 45, 46 Addison-Wesley, 1995. ISBN 0-201-53771-0. — [2] M. Arenas and L. Libkin. A normal form for xml documents. In PODS ‘02: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 115—125, New York, NY, USA, 2002. ACM Press. ISBN 1-58 1 13-507-6. pages 10, 16 doi :{http:f/doi.acm.org/1 0.11 45/303976.303983}. — [3] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS ‘99: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 68—79, New York, NY USA, 1999. ACM Press. ISBN 1-58 1 13-062-7. pages 5, 8 doi:{http://doLacm.org/10.1 145/303976.303983}. — [4] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD ‘05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 143—154, New York, NY USA, 2005. ACM. ISBN 1-59593-060-4. pages 5, 8 doi:http://doi.acm.org/1 0.1145/1066157.1066175. [5] A. Bonifati, E. Q. Chang, A. V. S. Lakshmanan, T. Ho, and R. Pottinger. Heptox: marrying xml and heterogeneity in your p2p databases. In VLDB ‘05: Proceedings of the 31st international conference on Very large data bases, pages 1267—1270. VLDB Endowment, 2005. ISBN 1-59593-154-6. pages 4, 14, 18, 19 [6] P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Keys for pages 1, 71 xml. Computer Networks, 39(5):473—487, 2002. — [7] M. A. Casanova and V. M. P. Vidal. Towards a sound view integration methodology. In PODS ‘83: Proceedings of the 2nd ACM 72 SIGACT-SIGMOD symposium on Principles of database systems, pages 36—47, New York, NY, USA, 1983. ACM. ISBN 0-89791-097-4. pages 21,46 doi:http:lldoi.acm.org/10.1145/588058.588065. — [8] A. K. Chandra and M. Y. Vardi. The implication problem for functional and inclusion dependencies is undecidable. SIAM J. Comput., 1 4(3):67 1—677, pages 7, 22, 25, 40 1985. [9] 5. Davidson, U. 0. Pennsylvania, W. Fan, C. Hara, J. Qin, and T. U. Propagating xml constraints to relations. In In ICDE, pages 543—554, 2003. pages 7, 17 [101 W. Fan and L. Libkin. On xml integrity constraints in the presence of dtds. J. AcM, 49(3):368—406, 2002. ISSN 0004-5411. pages 1, 7, 11, 13 doi:http:lldoi.acm.org/1 0.1145/567112.567117. — [11] M. Levene and G. Loizou. How to prevent interaction of functional and inclusion dependencies.. Information Processing Letters, 71(3—4):1 15—125, 1999. URL citeseer.ist.psu.edu/Ievene95how.html. —* pages 22, 24, 47 [12] S. Lu, Y. Sun, M. Atay, and F. Fotouhi. On the consistency of xml dtds. Data Knowl. Eng., 52(2):231—247, 2005. ISSN 0169-023X. dol :http://dx.doi.org/1 0.101 6/j.datak.2004.05.007. — pages 11, 12 [13] J. C. Mitchell. The implication problem for functional and inclusion dependencies. Inf Control, 56(3):154—173, 1983. ISSN 0019-9958. pages 7, 22,40 doi:http://dx.doi.org/10.1016/S0019-9958(83)80002-3. [14] L. Popa, Y. Velegrakis, M. A. Hernández, R. I. Miller, and R. Fagin. Translating web data. In VLDB ‘02: Proceedings of the 28th international conference on Very Large Data Bases, pages 598—609. VLDB Endowment, pages 4, 7, 14, 18, 19 2002. — 73
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On detecting and repairing inconsistent schema mappings
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On detecting and repairing inconsistent schema mappings Ho, Terence Cheung-Fai 2008
pdf
Page Metadata
Item Metadata
Title | On detecting and repairing inconsistent schema mappings |
Creator |
Ho, Terence Cheung-Fai |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | Huge amount of data flows around the Internet every second, but for the data to be useful at its destination, it must be presented in a way such that the target has little problem interpreting it. Current data exchange technologies may rearrange the structure of data to suit expectations at the target. However, there may be semantics behind data (e.g. knowing the title of a book can determine its #pages) that may be violated after data translation. These semantics are expressed as integrity constraints (IC) in a database. Currently, there is no guarantee that the exchanged data conforms to the target’s ICs. As a result, existing applications (e.g. user queries) that assume such semantics will no longer function correctly. Current constraint repair techniques deal with data after it has been translated; thus take no consideration of the integrity constraints at the source. Moreover, such constraint repair methods usually involve addition/deletion/modification of data, which may yield incomplete or false data. We consider the constraints of both source and target schemas; together with the mapping, we can efficiently detect which constraint is violated and suggest ways to correct the mappings. |
Extent | 1428416 bytes |
Subject |
Semantics Schema Mappings XML DTD |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-02-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051351 |
URI | http://hdl.handle.net/2429/4126 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2008-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2008_fall_ho_terence_cheung-fai.pdf [ 1.36MB ]
- Metadata
- JSON: 24-1.0051351.json
- JSON-LD: 24-1.0051351-ld.json
- RDF/XML (Pretty): 24-1.0051351-rdf.xml
- RDF/JSON: 24-1.0051351-rdf.json
- Turtle: 24-1.0051351-turtle.txt
- N-Triples: 24-1.0051351-rdf-ntriples.txt
- Original Record: 24-1.0051351-source.json
- Full Text
- 24-1.0051351-fulltext.txt
- Citation
- 24-1.0051351.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051351/manifest