On Detecting and Repairing Inconsistent Schema Mappings by Terence Cheung-Fai Ho B.Sc., The University of British Columbia, 2002 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University of British Columbia (Vancouver) October 2008 Β© Terence Cheung-Fai Ho, 2008 Abstract Huge amount of data flows around the Internet every second, but for the data to be useful at its destination, it must be presented in a way such that the target has lit tle problem interpreting it. Current data exchange technologies may rearrange the structure of data to suit expectations at the target. However, there may be semantics behind data (e.g. knowing the title of a book can determine its #pages) that may be violated after data translation. These semantics are expressed as integrity con straints (IC) in a database. Currently, there is no guarantee that the exchanged data conforms to the targetβs ICs. As a result, existing applications (e.g. user queries) that assume such semantics will no longer function correctly. Current constraint repair techniques deal with data after it has been translated; thus take no consid eration of the integrity constraints at the source. Moreover, such constraint repair methods usually involve addition/deletion/modification of data, which may yield incomplete or false data. We consider the constraints of both source and target schemas; together with the mapping, we can efficiently detect which constraint is violated and suggest ways to correct the mappings. 11 Table of Contents Abstract Table of Contents. List of Tables List of Figures H Hβ vi viβ 1 Introduction 1 2 Definitions 2.1 DTD definition 2.2 Simple DTD definition 2.2.1 Choice-free 2.2.2 Acyclic . 2.2.3 Duplicate-free 2.2.4 Unorderness 2.3 DTD Group 2.3.1 Single Path 2.3.2 DTD Group. 2.4 XML Database . . 10 10 11 12 12 13 13 14 14 15 16 16 16 17 17 17 2.4.1 Simplifying text nodes to attributes 2.4.2 Notations 2.5 Missing Values and NULL Values 2.6 XML constraint definitions 2.6.1 Functional dependency 111 2.6 XML constraint definitions 17 2.6.1 Functional dependency 17 2.6.2 Inclusion dependency 18 2.7 Schema Mapping from HepToX 18 2.8 Relational Constraint Definitions 20 2.8.1 Relational Functional Dependency 20 2.8.2 Relational Inclusion Dependency 20 2.8.3 Typed Relational IND 21 2.9 Typed XML Inclusion Dependency 21 2.10 Guaranteeing No Interaction of relational FD and IND Implication 22 2.10.1 Proper Circular 22 2.10.2 Split-freeness 23 2.10.3 Intersection 23 2.10.4 Reduced set of FDs and INDs 24 2.10.5 Theorem that Guarantees Non-Interaction between FD and IND 24 3 Efficient XML Constraint Implication 25 3.1 Reduction of XML constraint implication to relational constraint implication 25 3.1.1 Assumptions 27 3.1.2 XML Database Equivalence () 27 3.1.3 Schema Transformation Function β u(t) 29 3.1.4 Constraint Transformation Function β 30 3.1.5 Data Transformation Functions βu(D), Icβ(DR) . 32 3.1.6 Correctness of XML Constraint Propagation 35 3.1.7 XML to Relational Constraint Implication Reduction . 38 3.1.8 Algorithm to Test XML Constraint implication by Rela tional Constraint Implication 39 3.2 Guaranteeing No Interaction Between XML FDs and INDs . . 40 3.2.1 Retaining Proper Circular Property 40 3.2.2 Retaining Reduced Property 43 3.2.3 Inability to Retain Intersection Property 44 iv 3.2.4 Retaining Split-freeness Property 44 3.2.5 Summary of non-interaction between FDs and INDs . 45 3.3 Efficient implication of XIVIL constraints 45 3.3.1 Retaining Typed property 45 3.3.2 Implication for typed relational INDs 46 3.3.3 Implication for relational FDs 46 3.3.4 Summary of XML Constraints Implication 47 4 Constraint Propagation 48 4.1 Constraint Translation 48 4.1.1 FD Translation 49 4.1.2 IND Translation 50 4.2 Testing Constraint Propagation 52 4.2.1 Testing FD Propagation 53 4.2.2 Testing IND Propagation 56 4.3 Constraint Propagation Summary 58 5 Detecting Constraint Violations 59 5.1 Checking for FD violation 59 5.2 Checking for IND violation 61 6 Resolving Violations 64 6.1 Repairing FD Violations 65 6.1.1 Scaling Back 65 6.1.2 Source Concatenation 66 6.2 Repairing IND Violations 67 6.2.1 Scaling Back 67 6.2.2 Split Correspondence 68 6.3 Summary of Repairs 69 7 Conclusion and Future Work 70 Bibliography 72 V List of Tables 1.1 UBCBookstore 2 1.2 UBCLibra,y 3 3.1 βIβ(LUBCBookstore) 31 3.2 I1(uBcBooksiore) 33 3.3 U(DUBCBookstore) 34 vi List of Figures 1.1 XUBCBookstore to AuflCLjbrary Example 2 1.2 UBCBookstore Instance Example 5 1.3 Translated UBCLibrary Instance Example 5 1.4 Mapping Rules M for UBCBookstoreβUBCLibrary Example . 6 2.1 AUBCBookstore 15 3.1 Equivalent UBCBookstore Instance Example 28 3.2 Schema Transformation Algorithm β i (A) 29 3.3 Creating Transformation Generated Constraints Algorithm β P(A) 30 3.4 FD transformation Algorithm β u (FD) 31 3.5 IND transformation Algorithm β u (IND) 31 3.6 Constraints Transformation Algorithm β p.() 32 3.7 Data Transformation Algorithm 33 3.8 Reverse Data Transformation Algorithm 35 3.9 Algorithm Testing XML Constraint Implication 39 3.10 Example Showing Circular INDs after Transformation 41 4.1 FD Translation Algorithm β 6(Fs,M) 50 4.2 IND Translation Algorithm β ΓΆ(Is,M) 51 4.3 Testing FD Propagation Algorithm 55 4.4 Testing IND Propagation Algorithm 57 5.1 FD violation Checking Algorithm 60 5.2 IND violation Checking Algorithm 62 vii 6.1 FD Violating Example. 65 6.2 Fixing FD Violation by Concatenation Example 66 6.3 Translated UBCLibrary After Repair By Concatenation Mapping . 67 6.4 IND Violation Example 68 6.5 IND Violation Example Fixed by Split Correspondence 69 viii Glossary DR a relational database, 20 DuBcBookstore The XML Database for UBCBookstore, 4 A a DTD definition, 1 4) a function that returns child DTD elements of a DTD element, 11 fl a function that returns DTD attributes of a DTD element, 11 A set of Constraints, 2 operator that states two XML databases contain the same collection of data, 27 function to transform XML to relational, 25 function to reverse transform relational to XML, 25 a DTD element, 11 A a set of DTD attributes in a DTD, 11 acyclic , 12 att a function that returns a attribute of an element, 16 C A Set of Correspondences, 7 choice-free , 11 ix D an XML database, 16 DTD Document Type Definition, 1 DTD group , 15 duplicate-free , 13 E a set of DTD elements in a DTD, 11 FD Functional Dependency, 1 IC Integrity Constraint, 1 IND Inclusion Dependency, I intersection , 23 lab a function that returns the DTD element of an el ement, 16 M a set of HepToX mapping rules, 19 proper circular , 22 r the root DTD element of a DTD, 11 reduced , 24 root the root element of an XML database, 16 split-freeness , 23 typed relational IND , 20 typed XML IND ,21 unorderness , 13 V a finite set of elements in an XML database, 16 x XML Extensible Markup Language, 1 xi Chapter 1 Introduction Extensible Markup Language (XML) has been the dominating data representation format on the Internet. It is self-describing and semi-structured; thus giving appli cations great flexibility to implement and interpret data. These properties make it ideal for data exchange, and it has been the dominating format for data exchange. However, we believe exchanging of data on the Internet does not utilize the full potential of XML. One way to further expand the power of data exchange is to examine the semantics of data in order to create better quality mappings which lead to better data exchange. As with most relational databases, there are integrity constraints (ICs) such as functional dependencies (FD5) and inclusion dependencies (INDs) that give the data more semantics than just tuples of values. There are already studies of ICs in XML such as [6, 10]. Therefore, the next step is to understand how these semantics can benefit data exchange. Although there are many different types of ICs, FDs and INDs are the most used ones. The two assist in storage, update, querying, and transformation of data. As a result, they are very heavily integrated into everyday applications nowadays. If applications are running on data that violates these constraints, they may behave erratically and return unsound and unexpected results. We demonstrate some of the challenges we face using the example in Fig ure 1.1. In Figure 1.1, there are two Document Type Definitions (DTDs), AUBCBookstore and lXUBCLibrary. A DTD is a schema which defines the syntax of XML databases. 1 Figure 1.1: 1UBCBookstore to tXUBCLibrary Example Table 1.1: ZUBCBookstore In our example, the two DTDs are expressed graphically as DTD graphs (DTD trees in our example). Each graph contains nodes as DTD elements, and, for sim plicity, all leaf nodes (nodes without children) are DTD attributes. We have established a few correspondences, shown in arrows, from the at tributes in the source UBCBookstore to the attributes in the target UBCLibrary. These simple 1-1 correspondences We can also define ICs on a DTD. The FDs and INDs for UBCBookstore are listed in Table 1.1, and the ones for UBCLibrary are in Table 1.2. AUBCBooksrore keeps books under different subjects. BookFD#3-9 fl state that there is an isbn that implies all the other attributes within every book. In / BookFD#1: UBCBookstore.person:name β* DOB BookFD#2: UBCBookstore.subject .book.author:name β> DOB BookFD#3: UBCBookstore.subject.book:isbn β* title BookFD#4: UBCBookstore.subject .book:isbn β* year BookFD#5: UBCBookstore.subject .book:isbn β price BookFD#6: UBCBookstore.subject .book:isbn β> #pages BookFD#7: UBCBookstore.subject .book:isbn β* pub BookFD#8: UBCBookstore.subject .book:isbn β* author.name BookFD#9: UBCBookstore.subject .book:isbn β* author.DOB BookIND#1: UBCBookstore.subject .book.author[name, DOB] c UBCBookstore.person [name, DOB] 2) LibFD#1: UBCLibrary.person: name β* DOB LibFD#2: UBCLibrary.person: name β* addr LibFD#3: UBCLibrary.publication: call#.author.name βp call#.author.DOB LibFD#4: UBCLibrary.publication: call#.subject, call#.author.name, call#.year β* title LibFD#5: UBCLibrary.publicat ion: call#.subject, call#.author.name , call#.year β* #pages LibFD#6: UBCLibrary.publicat ion: call#.subject, call# .author.name, call#.year β* pub LibFD#7: UBCLibrary.publisher: name β* addr LibIND#1: UBCLibrary.publication.call#.author[name,DOB] Γ§ UBCLibrary.person[name, DOB] LibIND#2: UBCLibrary.publication [pub] c UBCLibrary.publisher[name] Table 1.2: EUBCLibrary addition, there is a main author listed under every book, and by BookIND#1 , these authors are also listed independently as persons under the root node UBCBook store. Meanwhile, the ZUBCLibra,y has a slightly flatter structure. Each Publication is listed under the root node UBCLibrary, and all its information is listed as attributes under its subtree. In a typical library, subject, author name, and year of a publica tion together determines the call number of a publication, which is a unique number that libraries use to identify publications. Similar to AUBCBooksjo,. above, there is a main author listed under each publication, and the same author is also listed as person under root. Similarly, each publisher listed as pub under publications is also listed under publisher under root. The two heterogeneous schemas have different hierarchies. For example, books in UBCBookstore are grouped by subjects, while subjects in UBCLibrary are listed under publications. There are also attributes that exist in one schema but not the other (e.g. publisher.addr in UBCLibrary). The two schemas also represent heterogeneous content. UBCBookstore keeps a record of books, but UBCLibrary records publications which subsume books. In addition, the retailer background of UBCBookstore is shown by the presence of 3 price which is absent in UBCLibrary for obvious reasons. Heterogeneity of the two schemas is not limited to the representation of the data. The two sets of constraints also show variations. In UBCBookstore, book.title and book.#pages are uniquely identified by book.isbn as specified by the FDs BookFD#3 and BookFD#6. However, UBCLibrary uses publication.subject, publication.authoiname, and publication.year to imply the values of publication.title and publication.#pages as specified in LibFD#4 and LibFD#5. Thus, problems arise if two books having the same subject, author name, and year but different #pages or titles in UBCBookstore are translated to instances of publication in UBCLibrary. The two translated publication instances will vio late the FDs LibFD#4 and LibFD#5. Suppose that UBCBookstore has stacks of unsold books, and decides to donate them to UBCLibrary. Along with the books, UBCBookstore also gives UBCLi brary an XML database DUBCBoo1ore containing information about the books it is donating. DUBCBOOkStOTe conforms to DTD UBCBookstore and its constraints. Upon receiving DUBCBookSjore, librarians at UBCLibrary want to integrate the data into its database which conforms to DTD UBCLibrary and its constraints. The librarians understand that the UBCBookstore data does not conform to its DTD, and it will be tedious work and error-prone to translate the data manually. In addition, there will probably be frequent donations from the UBCBookstore in the future. Therefore, an automated translation framework from data in UBCBookstore to UBCLibraiy will be appreciated. The librarians can create high-level correspondences from the UBCBookstore DTD to the UBCLibrary DTD. Moreover, using methods in [51 and [14], they can generate mapping rules from these correspondences. The mapping rules translate data in UBCBookstore to data that conforms to UBCLibrary DTD but not neces sary its constraints. The XML instance in Figure 1.2 conforms to the DTD βUBCBooksjore and its constraints. With the mapping rules in Figure 1.4 which are generated by algo rithms in [5], the librarians can translate data in Figure 1.2 to the data in Figure 1.3. The translated document in Figure 1.3 conforms to the DTD UBCLibrary, but it does not conform to the UBCLibrary constraints. It is obvious that pub of each 4 UBCBookstore subect person /\ name book book name DOB /\ 19490101 isbn year title pt-ice #pages author pub isbn year title price #pages author pub (701 2(01 Sorviv, 08 20 550 ,,/ \\FtmPm 002 2001 DB 101 70 320 ,i/ \\I9mPmt name DOB name DOB Smith 19400101 Sm,th 19490101 Figure 1.2: UBCBookstore Instance Example UBCLibrary publication publication person /\ call# title #pages pub call# title #pages pub name DOB ___ flrvtvD8 540 FtPmm ___._._rS...DBI1 320 Smith 19490101 subject author year subject author year 08 \ 2001 08 // \ 2001 name DOB name DOE Smith 19490101 Smith 19490101 Figure 1.3: Translated UBCLibrary Instance Example instance of publication is not listed under publishername as LibIND#2 of UB CLibraiy requires. Moreover, the two publications have the same subject, au thorname, and year, but different title and #pages. This violates the FDs LibFD#4 and LibFD#5 of UBCLibrary. Applications in the library use subject, author name, and year to identify pub lication. title, but given data in Figure 1.3 these applications will no longer behave correctly. Clearly, something needs to be done to resolve this problem. Fixing constraint violation after data is translated using existing constraint violation repair technique such as [3, 4] can be undesirable because UBCLibrary may not want to alter data. Correspondences are usually created by data administrators who know both the source and target schemas relatively well. If we can detect constraint violations at the time of correspondence establishment, and provide suggestions on how to repair the correspondences, such administrators are given more information and arsenals to create better quality correspondences. This thesis aims to detect such violations and suggests ways to repair them. 5 1. T.JBCLibraryβ*fl($UBCBookstore)[ personβ+f2($name,$DOB)[ nameβ*$name, DOBβ*$DOB]] UBCBookstoreβ$UBCBookstore[ personβ*$person[ nameβ*$name, DOBβ>$DOB]] 2. UBCLibraryβf1 ($UBCBookstore)[ publicationβ*f3($sname,$aname,$DOB,$title,$year,$#pages,$pub) [ call#.subjectβ*$sname, call#.author.nameβΓ·$aname, call#.author.DOBβ*$DOB, call#.yearβ*$year, titleβ*$title, #pagesβ.$#pages 1] UBCBookstoreβ*$UBCBookstore[ subjectβ$subject[ nameβ*$sname, bookβ*$book[ titleβ*$title, #pagesβ$#pages, author.nameβ*$aname, author.DOBβ*$DOB, pubβ*$pub]]] Figure 1.4: Mapping Rules M for UBCBookstore--UBCLibrary Example 6 Meanwhile, every algorithm has to be efficient in order for our framework to be practical in the real world. Problem Statement Given a set of constraints Ys over source DTD As, a set of constraints -T over target DTD AT, and a set of user-defined correspondences C, determine if V XML databases D5 = s over A5, after translation via C, the translated instances satisfy In addition, if violations exist, how do we fix them? Related Work and Challenges In order to implement an XML constraint translation system, an essential build ing block is solving constraint implication efficiently. However, as shown in [13] and [8], implication of relational FD and IND considered together is undecidable. Therefore, it is challenging to find an efficient XML constraint implication system. [101 shows interaction of XML constraints with DTDs, it finds keys and foreign keys consistency with DTDs undecidable. Thus taking DTD interactions out of the equation is essential to a tractable reasoning with XML constraints. We simplify DTDs in this thesis aiming to avoid such interactions. [9] propagates XML keys to FDs on a predefined relational database via a transformation. It shows XML key and foreign key propagation is undecidable; therefore, it only focuses on XML keys. It further emphasizes the difficulties in the problem of XML constraint propagation. [14] generates join satisfying mapping rules from user defined correspondences if the two schemas have identical representation of entities. It considers join condi tions on both the source and target schemas such that the mapping rules generated satisfy target join constraints. However, both keys and joins in the target can be violated if suspicious correspondences are present. They can be mistakes of the user who defines the correspondences, or they may be unavoidable due to different representations of entities in of the two schemas. The absence of correspondences creates βholesβ in the target. These missing values or NULL values further complicate the propagation problem. In addition, 7 inability to propagate does not necessary imply constraint violations, and these missing or NULL values play a big part in determining violations. Finally, once violations are detected, there are different ways to fix them. Each has its advantages and disadvantages; the combinations of these repairs give users choices to fix violations under different circumstances. Existing constraint viola tion repair techniques such as [3],[4] treat the translated data as one inconsistent database, and either delete, add modify tuples to make data adhere to the target constraints. In trying to make data conform to the target constraints, modification of translated data without knowing anything about the originating data may sac rifice the correctness. In contrast, our solution of correcting the correspondences offers a different approach to preserve the coherence of data with the intelligence of the originating data structure together with its semantics. Contributions 1. There currently does not exist an implication testing system for XML FDs and INDs considered together; therefore, we reduce XML constraint impli cation to relational constraint implication so that we can employ tools in well studied relational constraint reasoning. 2. We develop a polynomial algorithm to test XML constraint implication. 3. We develop a polynomial algorithm to test XML constraint propagation. 4. We develop a polynomial algorithm to detect XML constraint violation. 5. We suggest ways to repair mappings should a violation occur. Organization We present the rest of the document as follows. Chapter 2 lays the foundation for the later chapters by specifying definitions. Chapter 3 describes the reduc tion of XML constraint implication to relation constraint implication, plus how to make the implication problem tractable. We develop the constraint propaga tion algorithm in Chapter 4, and show that it runs in polynomial time. Chapter 5 builds constraint violation detection algorithm on top of the propagation algorithm. 8 Chapter 6 suggests ways of repairing mappings such that constraints will not be vi olated. Chapter 7 concludes our work and offers a glimpse of how this work can be extended. 9 Chapter 2 Definitions 2.1 DTD definition Majority of this definition is adopted from [2]. Definition 2.1.1 A DTD (Document Type Definition) is defined to be: A = (E,A, 4), fl, r), where: β’ E is a finite set of element types. β’ A is afinite set of (non-set) attributes. β’ 4) is a mapping from E to element type definitions: Given r E E, 4)Qr) = where S is a string type, or 4)Qr) is a regular expression a defined asfollows: Ia+ where e is the empty sequence, tβ E, ββ represents union, β,β represents concatenation, β?β represents zero or one occurrence of cc, ββ represents zero or more occurrences of a (Kleene closure), and ββ represents one or more occurrences of a. β’ 11 is a mapping from E to the powerset ofA. If @1 E flQr), we say that @1 is definedfor t 10 β’ r e E and is called the element type of the root. Without loss of generality we assume that r does not occur in 4(t) for any t e E. For example, we can present ZUBCBoojore in Figure 1.1 in the form (E ,A, , H, r), where: β’ E = {UBCBookstore, subject, name, book, author, person} β’ A = {name, isbn, title, year, price, #pages, DOB, pub} β’ = {UBCBookstore β* (subject*,person*),subject β* (book*),book (author) } β’ H = {subject β* (name),book βf (isbn, title,year, price, #pages, pub) ,author (name,DOB),person β+ (name,DOB)} β’ r = UBCBookstore 2.2 Simple DTD definition In order for the algorithms in the rest of this thesis to be efficient, we simplify our DTD definition. As shown in [101, a DTD interacts with XML constraints to produce inconsis tent specifications. That is there does not exist an XML database that satisfies an inconsistent set of DTD and XML constraints. On the contrary, a consistent set of DTD and XML constraints is one that has at least one XML database that satisfies it. To make matters worse, [121 shows that a DTD by itself may be inconsistent too. We wish to guarantee no interaction between DTDs and XML constraints. However, a full-blown study of this topic is beyond the scope of our research. Therefore, we specify choice-free and acyclic DTDs in the following section; they together guarantee DTDs are by themselves consistent as proven in [12]. In addition, [10] shows that cardinality constraints of DTDs interact with XML keys and foreign keys. Therefore, simplifying cardinality constraints should elim inate or reduce interactions between DTDs and sets of XML FDs and 1NDs. We achieve this by imposing the properties duplicate-free and unorderness on DTDs. 11 2.2.1 Choice-free [12] proves choice-free is one of the properties required for a consistent DTD A = (E,A, 4), fl, r). We redefine this property under our context. Choice-free is the elimination of choices (options) of child DTD elements. To eliminate the βchoiceβ of child nodes, we redefine the regular expression allowed for 4)Qr) in our DTD definition A = (E,A,4),fl,r), where t E E. The original regular expression a of 4)Qr) is: a::=eIr!IaIata,aIa?Ia*Ia+S (2.1) The ββ (union) operator gives 4)Qr) the βchoiceβ property. Eliminating ββ from a: a::=cItIIa,aIcx?Ia*Ia+ (2.2) makes our new DTD definition satisfy the choice-free property. 2.2.2 Acyclic [12] shows that a DTD without cycles is another requirement for a consistent DTD. A DTD A = (E,A, 4), fl, r) is cyclic if A contains a DTD Element r e E such that t appears as a descendent of itself. The DTD graph of A would contain a cycle with r being one of the nodes of the cycle. Another benefit of having acyclic DTDs is the elimination of infinite paths. With a DTD containing cycles, it is possible to have a path of elements starting from the root that loops a cycle forever. The elimination of these infinite paths is important because we will later use fully specified paths to define our constraints. Having cycles in DTDs may allow constraints with infinite depth, and infinite num ber of constraints. Both are very undesirable properties, and will make reasoning with such constraints intractable. To incorporate the acyclic property into our DTD language definition, we need to add an extra statement to the definition of the regular expression language for 4)Qr) in (2.2), where βr e E. The new definition is: 12 a::=eIt!Ia,ala?Ia*Ia+ (2.3) where tβ β ancestor(t), and rβ 2.2.3 Duplicate-free We define a new property duplicate-free, that aims to simplify cardinality con straints that can be implied by DTDs. In a DTD definition z = (E,A,4,fl,r), 4βQr) allows a child zβ e E to appear multiple times; therefore, zβ can have minimum and maximum occurrence con straints of any integers. For example, if CP(t) = βUβ, βrβ, βrβ, βrβ?, tβ?, βUβ?, then βrβ has to appear at least three times under βr, and can appear a maximum of six times. In [10], it shows these cardinality constraints can interact with XML constraints to produce inconsistent specifications; therefore, we introduce duplicate-free to simplify cardinality constraints. The idea is simple; we do not allow a child to appear multiple times in a c(β) regular expression. Together with the unorderness property in Section 2.2.4, we will define a new regular expression language for c1 in our DTD definition. The duplicate-free property only allows child DTD elements to have minimum occurrences of either 0 or 1, and maximum occurrences of either 1 or ooβ’ It greatly simplifies the complexity of constraint reasoning. 2.2.4 Unorderness Subelements in XML are ordered. For example: for an root element Root1,and c1 (Rooti) = BC, T1 =<A><B/>.<C/></A> for another root element Root2,and β1β(Root2)= CB, T2=<A ><C/><B/></A> T1 and T2 differ in their order of elements; thus T1 T2. However, current constraints view data as collections of data in which order does not matter. We aim to employ current constraint reasoning techniques; hence, for the rest of this thesis, we will ignore the order of elements. Thus, we achieve unorderness. 13 Together with the duplicate-free property, we redefine the regular expression language for (r) in our DTD definition z = (E,A,4,H,r), where z e B. The regular expression a of (βr) in (2.3) is: a::=eI.rIa,aIa?Ia*Ia+ (2.4) where rβ ancestor(r), and β Now that we can ignore order, and have to honour the duplicate-free property, we can simplify the regular expression language to a list of unique DTD elements, each is followed by a quantifier: a ::= c i [ii? 1*1+1 . . . , ;[1 I? I*+] (2.5) whereVi,je [1. ..n}, andi j, r t t, andt,r1gancestorQr). The quantifier β1β behaves exactly as if there were no quantifiers. It states that the child DTD element has both minimum and maximum occurrence of 1. In Section 3.1.2, we define XML database equivalence that further elaborates on how we ignore order. 2.3 DTD Group We divide the set of DTD elements into groups to give the XJvIL constraints a plat form to apply their logic. This is the same as βgroupβ in [5], and analogous to logical relations in [14]. We are going to define single path in the next section, then use it as a building block to define DTD group in the section that follows. 2.3.1 Single Path Let = {E,A, 1, fl, r} be a DTD defined in Definition 2.1.1. An edge is a relation from a parent DTD element e E to a child DTD element defined in 4), or it is a relation from βr to an attribute defined in fl(t). Such edges are single if it is from an element to an attribute (because we donβt allow set attributes), or if it is between 14 two elements with the cardinality constraint of β?β or β1β. On the contrary, edges between elements with the cardinality constraint of ββ or β+β are multiple edges. Edges link together to form paths, and a single path (SP) is a path that consists of single edges only. In addition, a single path always ends in a DTD leaf text node or a DTD attribute. SP is expressed as a fully specified relative XPath between an ancestor and a descendant. 2.3.2 DTD Group Definition 2.3.1 A DTD group is a group ofDTD elements and attributes that are all connected by single paths. A DTD group guarantees one-to-one relations within a branch of elements and attributes in an XML database that have their DTD elements and DTD attributes within the same group. Such a branch of elements and attributes is analogous to a relational tuple. UBCBookstore , subject person7* /\ name book name DOB isbn title year price #pages author pub /\ name DOB Figure 2.1: βUBCBookstore For the rest of this thesis, we will use G to denote a DTD Group. And we label each group with the path name of the highest node t, where cP(parentQr)) contains 15 either r* or tΓ·. We also conveniently call the DTD Group Node of G. Figure 2.1 shows the 4 DTD groups in UBCBoofore For example, subject and book do not belong to the same DTD group because the edge between subject and book is not a single path; it has one of the quantifiers that allow multiple occur rences of books in each subject. On the contrary, there is a single path between book and DOB; therefore, they are in the same DTD group. 2.4 XML Database We adopt the following definition of XML databases from [2]. An XML database D conforming to a DTD z (E,A, c, H, r) is defined to be a tree (V,lab,ele,att,root), where β’ V is a finite set of vertices (nodes). β’ lab: V β E. β’ ele: V_*StringUV*. β’ att is a partial function V x A β* String. For each v e V, the set {l e A I att(v,l) is defined} is required to be finite. β’ root e V is called the root of d. 2.4.1 Simplifying text nodes to attributes Since all the leaf nodes are either DTD elements containing text or DTD attributes, for simplicity without loss of generality, we assume all leaf nodes are DTD at tributes for the rest of this document. 2.4.2 Notations For any element x E V in D labelled by E E, lab(x) = t, and any path P origi nating from βr, x.P returns all elements that are descendents of x and are reachable from x via the path P if P ends in a DTD element, it returns string values that are values of attributes reachable from x via the path P if P ends in a DTD attribute. For example, letβs call the only subject element in Figure 1.2 x. Then, x.book.price returns β120β,β70β. 16 2.5 Missing Values and NULL Values For the rest of this thesis we treat missing values as if they are NULL values. NULL values are known to be incomparable; that is they do not equate to any values, not even to other NULL values. NULL values are going to play a big role in how we treat FDs and INDs. 2.6 XML constraint definitions 2.6.1 Functional dependency An XML functional dependency, FD has the form: where t e E, and βr is a DTD group node. In addition, SIβ1,. . . are single paths originating from t and ending in attributes. An XML database D=(V, lab, ele, att, root) satisfies denoted: DI=opDiff Vx,y e V,lab(x) = lab(y) = A (x.SF =y.SP) β*x.SP y.SP iEl Note that we cannot equate NULL values to any other values; thus NULL existence on the L.H.S. guarantees FD satisfaction. In [9], FDs are defined in participation of them turning into keys. Such keys are then used to normalize relational schemas. Since keys must be able to identify tuples, imcomplete keys are unacceptable. Therefore, [9] requires an instance to have NULL value for the R.H.S. of an FD if there exists NULL value among the values of the L.H.S. attributes. However, unlike [9], we want to reason only with the most fundamental form of ICs only, FDs and INDs. Therefore, we can concentrate on these two types of ICs only when we consider NULL values. FDs and 1NDs reason with leaf values only, but not node identities as in keys. Consequently, an instance with NULL values on the L.H.S. of an FD guarantees satisfaction. On the other hand, if the 17 valuation of the R.H.S. in an instance is NULL, we treat it as a unique value, and only require its valuation of the L.H.S. to differ from all other instances. 2.6.2 Inclusion dependency An XML inclusion dependency, aIND has the form: ri[SP1, ,. ..,SP1,,2Γ§2[SP,i,. . .,SP2,nl where βr1,t2 e E, and βr1,βr2 are DTD group nodes. In addition, SIβ11,.. .,SP1 and SP2,1. . . ,SP2, are single paths in rj and βr2 respectively. An XML database D=(V, lab, ele, att, root) satisfies aIND, denoted: D 1 aIND if Vx e V,lab(x) = ti,y e V,lab(y) = βr2,s.t. A x.SP1,, = y.SP2,1 iEl,...,n Note that missing values or NULL values are not allowed on either side of an IND because such values are incomparable; thus would violate the IND. 2.7 Schema Mapping from HepToX In order to translate data from a source DTD ts to a target DTD βIT, we need a mapping language to define how to translate. Both HepToX [5] and Clio [14] can derive schema mappings from simple correspondences drawn as arrows from attributes of source schema to those in the target. From correspondences, [141 generates inter-schema referential constraints in a form similar to SQL statements. Although, such constraints definite how data translate, they are too technical and thus are difficult to analyse. On the other hand, users in [5] create correspondences between source and target leaf nodes. These correspondences are used to generate free-log mapping rules that define how data from the source can be translated to the target. These mapping rules are in the form of tree expressions, as such they are more suitable for analysis for the rest of this document. 18 However, our mapping model differs slightly from [5] and [14] in that we do not employ the ID/IDREF function of DTD. In other words, we do not have key and foreign key constraints in our model. histead we deal with the more fundamental constraints: FD and ND. Thus there are no joins among DTD groups, and our mapping rules only translate instances along a DTD tree path (attributes along a path from the root to the group node in question). We give a brief outline here on how simple correspondences in our running example in Figure 1.1 become a set of mapping rules M in Figure 1.4 following algorithms in [5]. The target DTD UBCLibrary is broken into 3 DTD groups, the groups can be seen as logical relations in which attributes (plus leaf text nodes) together describe an entity. The 3 DTD groups are publication, person, and publisher, and they each become the target of a mapping rule except for publisher because none of its attributes are mapped. On the other hand, the source schema is broken into 3 groups: subject, book, and person. In each mapping rule, if the target group has any correspondences that originates from a particular source group, then the source group is included in the rule body of the rule along with its ancestors. Each mapped attribute in the rule body is bound to a unique variable, and its correspondence in the target is assigned that variable. This has the same function as in SQL where tables are bound to some variables in the FROM clause, then some of the variables appear in the SELECT clause to create view instances by retrieving data from the tables in which the variables are bound. In mapping rule #1, every personβs name and DOB from the source is put into name and DOB in person in the target. The rule is very straight forward. However, notice the target root is assigned a Skolem function fl($UBCBookstore), and $UBCBookstore is the binding variable of the root element in the source; thus translated instances will always be stitched together and have only one root. The function of our Skolem function is to assign non-leaf nodes an identity. Thus two non-leaf nodes with the same identity (same Skolem function with the same arguments), will be merged into one node. Mapping rule #2 takes book instances from the source, and put them into pub lication at the target. The rule also acquires the name of subject which is in the 19 parent group of publication, and puts that into attribute subject under publication in the target instance. 2.8 Relational Constraint Definitions In this section, we quickly define the relational constraints: FD and IND; [1] pro vides a more comprehensive explanation of FD and IND. We will be using rela tional FD and IND implication to solve the XML FD and IND implication problem; therefore, we will introduce them in the following sections. In addition, in order to solve our XML constraint implication problem effi ciently, we use typed relational IND which can be solved in polynomial time. We will also define typed relational IND in this section. 2.8.1 Relational Functional Dependency Let R be a relational schema, andA1,.. . be n+l attributes in R. A relational FD has the form: R.Ai...Aβ*A A relational database DR conforming to R satisfies crR if V tuples t1 , t2 in table R of A (ti .A =t2.A) β* t1 .A =t2.A iEl ,...,n 2.8.2 Relational Inclusion Dependency Let R1,R2 be two relation schemas in a database schema R, A1,... ,A be n at tributes in R1, and B1,... ,B be n attributes in R2. A relational IND 0R has the form: R.A.. CR2B.B A relational database DR conforming to R satisfies 0R if V tuple t1 in table R1 of DR there exists a tuple t2 in table R2 of DR such that: A (t1.A=t2B) iEl,...,n 20 2.8.3 Typed Relational IND A more comprehensive study of Typed Relational IND can be found in [7]. We give a quick definition and description of Typed Relational IND here. Definition 2.8.1 Let R1 , R2 be two relation schemas in a database schema R, A1,.. . ,A, be n attributes in R1, andB1,... ,B be n attributes in R2. A typed relational IND R has the same form as a typical relation IND: R1.A.. CR2B.B The βtypedβ property requires Vi E [1, . . . , n] A1 = B1. Moreover, any two DTD attributes A ,A within each relation schema Rβ e R must not be of the same attribute type. A relational database DR conforming to R satisfies aR just as it would for a typical relational 1ND. In more details, there exists a pool of unique attributes U from which each relation schema R1 , R2 gets a subset of attribute. As a result, attributes of a relation schema are unique within the schema. Moreover, if A=B1,where A1 is an attribute of R1 and B1 is an attribute of R2, then A1 and B are the same attribute in U. 2.9 Typed XML Inclusion Dependency We model each DTD group as an entity, same as each relation schema represents an entity. In order to integrate the typed property, each DTD attribute appears at most once in a DTD group. Moreover, pairs of coffesponding attributes on the L.H.S. and R.H.S. of an XML IND must be of the same DTD attribute type. We now give a formal definition on how to integrate the typed property into a DTD A. Definition 2.9.1 Let A = (E,A,4,fl,r), Vtr β¬ E, if t,tj belong to the same DTD group G, then fl(t) flfl(t) = 0. Definition 2.9.2 Let o be an XML IND in the form: [SP1,i, . . . ,SPi,] Γ§ t[SP2,1... ,SP2] 21 where t1r2 e E, ir2 are DTD group nodes, SPi,l,...,SPl, andSP2,1...,SP are single paths in r1 and T2 respectively. 0β’ is typed if A (βri.sP1, 2.10 Guaranteeing No Interaction of relational FD and IND Implication In order to find an efficient XML constraint implication system, we first look at the implication of the simpler relational constraints. As mentioned earlier, implication of FDs and INDs together is undecidable [8] [13]. The complication of the implication comes from the interaction between the two kinds of constraints. Thus if we can avoid the interaction of FDs and INDs, it will be a big step towards making the implication problem tractable. [11] proves it is possible to prevent such interaction albeit imposing some re strictions on the FDs and INDs. We quote each condition as defined in [11] in the following sections, and give each a short description. Finally, in Section 2.10.5, we present the theorem from [11] that guarantees non-interaction. 2.10.1 Proper Circular Definition 2.10.1 A set of INDs I over relational schemas R is circular if either there exists a nontrivial IND R[X] C R[YJ e I, where X,Y are arrays of attributes in R, or there exists m distinct relation schemas, R1,R2.. .,R, e R, with m> 1, such that I contains the INDs: R1[X] Γ§R2[Y],RXjC R3[Y] Rm[Xm] C Ri[Yi], where X1, Y1 are sets of attributes in R1. A set of INDs I is noncircular if it is not circular Definition 2.10.2 I is proper circular if it is either noncircular or whenever there exit m distinct relation schemas, R1,R2. . .,Rm e R, with m> 1, such that I contains the INDs: R1[XiJ Γ§R2[Yj,RXJc3[Y1,...,Rm[Xml c Ri[YiJ, then for all i E { 1,2,. . .,m} we have X = Yj; such a sequence ofINDs is called a proper cycle. 22 Intuitively, I is circular if a subset of its INDs forms a circle in a graph (l<E), where each relation R is a vertex 1/, in V and each IND R,[X1] R[Yj is an edge from V1 to V1. I satisfies proper circular if I is not circular, or if a cycle exists, at each vertex, the sequences of attributes X, Yj connecting the two edges are two identical sequences of attributes. 2.10.2 Split-freeness Two nontrivial FDs of the forms R:XB β* A and R:YA β B, where R is a relation schema, X, Y are sets of attributes of R, and A,B are two attributes of R, are said to be cyclic. A closure of a set of FDs F, denoted F+, is a set of FDs that contains every FD a where F = a. Definition 2.10.3 A set of FDs F over R satisfies the split-freeness property if VFj e I whenever there exist cyclic FDs, R:XB β* A, YA β B e F, then either R3:Y β* B e or R:(XflY)A β B e Fj. Roughly speaking, a set of FDs satisfies split-freeness if it does not imply two FDs where R.H.S. of each of these two FDs is contained in the L.H.S. of the other FD. 2.10.3 Intersection Two nontrivial FDs of the forms R:X β A and R:Y β* A, where R is a relation schema, X, Y are sets of attributes of R, and A,B are two attributes of R, are said to be incomparable if X and Y are incomparable. Two sets are incomparable if either is a subset of the other. Definition 2.10.4 A set ofFDs F over R satisfies the intersection property ifVF1 E E VA e R, whenever there exist incomparable FDs, R:Xβ*A,R: Yβ*A e P7, then R:XflYβ*A E P7. Intuitively, if a set of FDs satisfies the intersection property if it does not imply two non-trivial FDs that imply the same attribute. 23 2.10.4 Reduced set of FDs and INDs Definition 2.10.5 The projection of a set of FDs I over relation schema R onto a set of attributes Y C R1, denoted by P[YJ, is given by Fj[Y] = {R1:W β* Z I R1:W βZFandWZC Y}. Definition 2.10.6 A set ofattributes Y Γ§ R is said to be reduced with respect to R1 and a set ofFDs I over R (or simply reduced with respect to Fj ifR1 is understood from context) if i9[Y] contains only trivial FDs. A set of FDs and INDs = F U I is said to be reduced ifVR1[X] R3[Y] E L Y is reduced with respect to F1. We illustrate the reduced property with an example. Our example consists of two relation schemas author[name,DOB] and person[name,DOB], an ND a1 = author[name,DOB] C person[name,DOBI, and an FD aF person : name β* DOB. a1 guarantees that any tuples in author must have an equivalent tuple in person. aF guarantees names imply DOBs in such tuples in person. This in turn guarantees names imply DOBs in all tuples in author. 2.10.5 Theorem that Guarantees Non-Interaction between FD and IND Taken from [11], the following theorem states the conditions needed to prevent interaction: Theorem 2.10.7 If I (a set of INDs) is proper circulai F (a set of FD5) satisfies either the intersection property or the split-freeness property and = (F UI) is reduced then F and I do not interact. 24 Chapter 3 Efficient XML Constraint Implication In order to achieve efficient XML FD and IND implication, we first reduce XML FD and IND implication to relational FD and IND implication in Section 3.1. How ever, Chandra et al. [8] proved relational FD and IND implication is undecidable. Therefore, in Section 3.2, we introduce conditions on XML FDs and 1NDs that ensure the transformed relational FDs and INDs do not interact. 3.1 Reduction of XML constraint implication to relational constraint implication Relational constraint reasoning is a well researched area. Being able to reduce XML constraints to their relational counterparts can resolve a lot of uncertainties involving XML constraints. In this section, we aim to reduce the XML FD and 1ND implication problem to the relational FD and IND implication problem. We specify the problem of XML FD and 1ND implication here: Problem Statement Let iX be a DTD, be a set of FDs and INDs on z, and aA be an FD or ND on . Determine if E = a. We first construct a schema transformation function !-tschema that takes i and 25 outputs a set of relation schemas R, denoted: R = !.Lschema(1) In Section 3.1.3, we will define and explain this schema transformation function. Next we construct a constraint transformation function Lco,(raj,g takes and produces 0R that conforms to R, denoted: I-Lconstraint (o) We will define and explain this constraint transformation function in more detail in Section 3.1.4. When given a set of XML constraints on z, we define: I.1constraints(Y.) {ILconstraim(02,) O E I-tcons:raints repeatedly calls I-tconsrraint on each constraint in , and outputs the set of relational constraints. We also develop a data transformation functionI1data that takes a DTD z and an XML database D that conforms to t, and output a relational database DR that conforms to R =I1schema (s), denoted: DR = !1data(1,DDe1ta) iLdata will be used in our proof that XML constraint implication can be reduced to relational constraint implication. We will construct Udata in Section 3.1.5. For simplicity, we overload the .t operator so that J.L is ILschema when the ar gument is a DTD, I-1constraint if the argument is an XML FD or IND,I1constrainis if the argument is a set of XML constraints, andI1data if the argument is a DTD and XML database. We then prove that XML constraint implication can be reduced to relational constraint implication in Section 3.1.7. After proving the reduction is correct, we then present an algorithm to solve XML constraint implication with relational con straint implication in Section 3.1.8. 26 3.1.1 Assumptions 1. DTDs are simple DTDs as defined in Section 2.2: choice-free, duplicate- free, and acyclic. 2. For simplicity, we assume for all DTDs z = (E,A, 1, fl, r) the DTD group including root r has no attributes. 3. All the XML ENDs are typed. 4. Since all XML 1NDs are typed, Definition 2.9.1 guarantees all attributes within a DTD group are unique. Therefore, for simplicity, we can assume without loss of generality that each DTD group has only one element - the group node. 5. All the leaf nodes are attributes. 3.1.2 XML Database Equivalence () To simplify the XML model for constraint reasoning, we ignore order of ele ments. Thus we need a database equivalence operator () that determines two XML databases D1 = (Vi,labi,elei,atti,rooti) on DTD z =(E1,Ai,1fli,ri) and 132 = (V2,lab ele2,att2,root2) on DTD A = (Ei,A1,41fll , ri) to be equiv alent if their content is the same but their order of elements may be different. Here we define XML database equivalence (): Two XML elements x E V1 ,y e V2; x y if: 1. node-name(x) = node-name(y) 2. Vl e IT1(labi(x)), l efl2(laby)),s.t. atti(x,l) =att2(y,l) 3. V1 e112(laby)), 1 Efl(labi(x)),s.t. att2(y,l) atti(x,l) 4. Vxβ E child(x), yβ e child(y) s.t. xβ yβ 5. Vyβ E child(y), SIxβ e child(x) s.t. yβ 27 UBCBookstore peroon name book book name DUB /\ \lh 19490101 isbn year title price #pages author pub isbn year title price #paes author pub 902 200)1 06101 70 320 ,,/ \mPm (I 2001 SurviveflB 20 55i1 ,/ \moPr name DOB name DUB Smith 19490101 Smith 19490)01 Figure 3.1: Equivalent UBCBookstore Instance Example D1 D2, if rooti root2. For example, the XML database in Figure 1.2 and the one in Figure 3.1 are equivalent. They differ by the order of the two books. The two databases would both satisfy or both not satisfy any set of FDs and 1NDs. XML FDs and INDs do not put constraints on the order of data; therefore, we can derive the following Lemma: Lemma 3.1.1 For any Iwo XML databases D1 and D2, if D1 D2, then V XML FDsorINDsa,D1 iffD2fr-:ta. Proof The operator is symmetric; therefore, we only need to prove one direction of βif and only ifβ. We will prove by contradiction. For the proof on FDs, let the constraint a be an FD. Assume D1 a but D2 = a. D1 must contain two elements x,y with the same L.H.S. attributes of a, but they have different R.H.S. attribute of a. Since D2 D1, D2 must contain two elements xβ,yβ s.t. xβ x and yβ y. As a result, .xβ contains the same set of attributes as x, and yβ contains the same set of attributes as y. xβ and yβ would cause D2 to violate a. This violates the hypothesis; thus Lemma 3.1.1 holds for FDs. For the proof on INDs, let the constraint a be an IND. Assume D a but D2 j= a. D1 must contain an element x whose attributes fit the L.H.S. of a, but D1 must not contain an element y that satisfies R.H.S. of a with respect to x. Since D1 D2, 132 must contain xβ s.t. xβ x. Since D2 = a, D2 must contain yβ s.t. yβ satisfies R.H.S. of a with respect to xβ. Again because D1 D2, D1 must contain an element that is equivalent to yβ, such an element would satisfy R.H.S. of a with respect to x. This violates the hypothesis; thus Lemma 3.1.1 holds for INDs too. 28 I3.1.3 Schema Transformation Function β j.L(A) The XML schema transformation function ji(A) in Figure 3.2 is designed to trans late a DTD A into a set of relation schemas R that retains the DTD group structure of A. We Input: DTDA= (E,A,,fl,r) Output: A set of Relation Schemas R 1. R={} 2. For each group G1 e E, G, r 3. create a new relation schema R 4. create a new field for each attribute in G, and label it with the SF of the attribute 5. create a new field βidβ 6. create a new field βpidβ 7. R=RU{R1} Figure 3.2: Schema Transformation Algorithm β 11(A) In Line 3, we create a new relation R1 for each DTD group G1 because we see each DTD group as a schema describing an entity. All the attributes within a DTD group are the properties of the entity that this group is describing; therefore, we create a new field for each attribute within G1. We then create the fields βidβ and βpidβ to retain the parent-child relation among DTD groups. Such relations will be expressed as 1NDs as transformation generated constraints. Running this algorithm against our example DTD AUBCBOOkSjQTe in Figure 1.1, we will generate three relation schemas: subject[id,pid,name] book[id,pid,isbn,title,year,price,#pages,author.name,author.DOB,pub] person[id,pid,name,DOB] They correspond to the three DTD groups in AUBCBooksjore. Every attribute in a DTD group becomes a field in the corresponding relation; in addition, βidβ and βpidβ are present in every relation to retain the parent-child relation of DTD groups. 29 Transformation Generated Constraints β During schema transformation, parent-child relations can be retained by generating an JND for each child DTD group whose βpidβ is subsumed by its parentβs βidβ. In addition, βidβ is the node identifying attribute; thus it implies all the other attributes within a DTD group. Input: DTDA= (E,A,,fl,r) Output: A set of 1NDs P() 1. βIβ(A)={} 2. For each DTD Group G e E 3. ifparent(G1)&4ΓΈ 4. βP(A) = W(z) U {R[pid] c parent(R)[id]} 5. For each DTD Attribute A e 11(G) 6. βP(z)=βP(z)U{R:idβ*A} Figure 3.3: Creating Transformation Generated Constraints Algorithm β We can apply βP to our example DTD UBCBooksjore in Figure 1.1, and will result in the transformation generated constraints in Table 3.1. The field βidβ in each relation implies every other attribute within the relation. In addition, book is a child group of subject, and we have the IND book[pid] c subject [idl that captures the parent-child relation. 3.1.4 Constraint Transformation Function β Constraint transformation function t (ox) takes an XML constraint of either FD FA, or IND I, and translates it to relational constraint FR (Figure 3.4) or βR (Figure 3.5) respectively: For example. the FD BookFD#1 on DTD group person in Table 1.1 is: UBCBookstore.person: name β* DOB 30 person: id β* name person: id β* DOB person : id β pid subject : id β name subject : id βp pid book: id β* isbn book: id β title book: id β* year book: id β price book: id βk #pages book: id β* author.name book: id β> author.DOB book: id β* pub book: id β* pid book[pid} C subject [id] Table 3.1: βJ.β(IXUBCBoojore) Given an XML FD F: Gk : SPk,1, . . , SP,,fl] β SPk,, Eu(F)=Rk:SPk,1,...,SPk,flβ*SPk,U Figure 3.4: FD transformation Algorithm β i (FD) u (BookFD#1) is on the transformed relation person: person : name β DOB For another example, the IND BookIND#1 on DTD groups book and person in Table 1.1 is: UBCBookstore.subject.book.author[name, DOB] Given an XML IND I: Gk[SPk,1,. . , SPk,fl] c . . , = Rk[SPk,1, , SPk,fl] Γ§ Ri[SP1, , SPi,n] Figure 3.5: 1ND transformation Algorithm β t (IND) 31 Input: DTD A = (E,A,βF,fl,r), and a set of XML constraints on A Output: A set of FDs + INDs u (E,) 1. iβ(){} 2. For each a E ZA 3. (ZL(ZA)UI1(o) Figure 3.6: Constraints Transformation Algorithmβsi (Z) c UBCBookstore.person[name, DOB] t (BookIND#1) is on the transformed relations book and person: book[author.name, author.DOB} person [name, DOB] The constraint transformation function u defined in this section takes one XML constraint a at a time. We are going to overload the function i so that it takes a set of XML constraints E as argument. The function i(E) calls i(a) for each a e Z, and returns the results as a set. In Figure 3.6, we also define j.i(Z,) that transforms a set of constraints on A. It transforms each ΓΈA e Z by iL(a) in Figure 3.4 and Figure 3.5, then returns the set of resulting constraints. If we apply this constraints transformation function to FDs and INDs in UBC Bookstore Table 1.1, it will result in the set of constraints in Table 3.2. 3.1.5 Data Transformation Functions β u(DA), rβ(DR) As shown in Figure 3.7, the data transformation function i takes an XML database D, that conforms to DTD A and produces a relational database DR that conforms to the transformed relational schema L(A). In Lines 2,3, the algorithm creates a new tuple for every group element in D. It creates a unique ID for the tuple and put it in the field βidβ in Line 4; and put the group elementβs parent ID into βpidβ in Line 6. This retains the parent-child relation among the group elements. In Lines 8-11, it puts all attributes in the group in their corresponding fields in the tuple, and puts NULL in the fields for the attributes that do not exist. 32 person : name β* DOB book: author.name β* author.DOB book: isbn β* title book: isbn β* year book: isbn β* price book: isbn β* #pages book: isbn β* pub book: isbn β* author.name book: isbn β author.DOB book[author.name, author.DOB] C person[name,DOBI Table 3.2: β (ZUBCBookstore) Input: DTD A = (E, A, 1, fl, r), XML database DA=(V,lab,ele,att,root) Output: Relational database DR 1. create tables R1,. . . ,R,, for DR according to ji(A), where n = j ji(A) 2. for each group node e E V, where lab(e) = G1,and G, r 3. create a new tuple tin table R, 4. create a unique ID UID(e) and put it in t[id] 5. if e has a parent node parent(e) and parent(e) r 6. t[pid] = UID(parent(e)) 7. else t[pid] NULL 8. for each attribute A E fl(lab(e)) 9. if e.A does not exist 10. then t[A] = NULL 11. elset[A]=e.A Figure 3.7: Data Transformation Algorithm For simplicity, if A is understood, we shorten the notation from u (A, D) to t(D) for the rest of this document. We demonstrate the data transformation on the XML database DUBCBOOkSIOTe (Figure 1.2). After translation, we get the three tables in Table 3.3. In Figure 1.2, there are two book instances with isbn β001β and β002β. u (DUBCBookstore) put these two books into two different tuples in the table book. Notice each tuple has a unique 33 ID in its βidβ field. The two books in Figure 1.2 are under a subject node, which has a name attribute βDBβ. The two corresponding book tuples have the unique ID of the subject tuple in their pid fields; therefore, it keeps record of the parent-child relation among groups. H name IDBid pidIDII NULL (a) Translated Table βsubjectβ id Ipid β’sbn year title Iprice #pages uthor.namc uLhor.DOB pub D2 jIDi 001 2001 urviveDB 120 550 Smith 19490101 IFreePress D3 jIDi 002 12001 PB 101 70 320 Smith 19490101 IFreePress (b) Translated Table βbookβ id pid name DOB ID4 NULL Smith 19490101 (c) Translated Table βpersonβ Table 3.3: ji (DUBCBootore) On the other hand, in Figure 3.8, data transformation function 4u takes a relational database DR that conforms to relational schema ji (A), and produces an XML database D, that conforms to A. For simplicity, if A is understood, we shorten the notation from u β β(A, DR) to U (DR) for the rest of this document. The reverse data transformation algorithm jrβ in Figure 3.8 does the reverse of J.L in Figure 3.7. Given output from u, it produces XML databases that are equivalent () to the input of j.t. Studying the two algorithms, we can easily derive the following propositions: Proposition 3.1.2 V XML databases on DTD D, 34 Input: a DTD A, a Relational Instance DR Output: XML database DA that conforms to A 1. create an XML node root 2. for each table R e jt(A) 3. for each tuple t, E R1 4. create an XML element 5. for each attribute A E t 6. ift.ALNULL 7. e1[dj,([1d].A t.A 8. for each ed,Id created above 9. ifpid=NULL 10. then make ejd,pjd a child of root 11. else make eid,pid a child of some Figure 3.8: Reverse Data Transformation Algorithm Proposition 3.1.3 V relational databases DR on a set of relation schenws R DR 3.1.6 Correctness of XML Constraint Propagation With the constraint transformation functions defined in Section 3.1.4, we first show the preservation of FDs and 1NDs of the constraint transformation functions both from XML constraints. Then using this result, we prove the sound and complete ness of using relational constraint implication to imply XML constraints. Preservation of FDs We first claim that any XML FD is preserved by the constraint transformation. We show our claim must be correct by contradiction. If an FD F were violated after propagation, there must exist two tuples in the transformed database .t (DA) with the same set of L.H.S. attributes of the transformed FD (F), but different R.H.S. 35 attributes. Since every tuple in U(D) is traced from a group in the XML Database D, on zS that conforms to F, there. must exist two group instances in D, that have the same L.H.S. attributes of F, but different R.H.S. attributes. This would violate the assumption that D, conforms to F; thus, our claim holds. Lemma 3.1.4 VXlvlLdatabasesDA = {V,lab,ele,att,root} that conforms toDTD z, VXMLFDE fDA 1= 1 then ,i(z,D)I= ji(F). Proof F has the form: tk:SPk,1,.. . β* SPk,. If D z F, then V x,yeV, where lab(x) = lab(y) = βukβ and A X.SPk,1 = y.SPk,,, then X.SPk, = Y.SPk,u. iEl,...,n Meanwhile, u(F)=Rk[SPk,1,. . . ,SPfl]β*SPu. If our claim were wrong, there would exist t1,t2 e Rk(l.t(DA)) such that A tl.SPk,j = t2.SPk,, andt1.SPk, # iEI n t2 .SPic,u. Since 11 (D) is transformed from DA, the tuple ti, and t2 must have their respective corresponding group nodes x,y such that A t1 .SPk,j = X.SPk,i and iEI n,u t2.SPk,i = y.SPk,I. This shows x,yeV, where lab(x)=lab(y)=;, and A X.SPk,j = iEl n Y.SPk,i, but x.SPk,u Y.SPk,u. Thus it contradicts D = F, and our claim holds. β’ Next we show that the reverse transformation preserves FDs. We prove by contradiction. If an FD F were violated after reverse propagation, there must exist two group nodes x, y in the reverse transformed database U (DR) with the same set of L.H.S. attributes of the F, but different R.H.S. attributes. Since every group in rβ (DR) is traced from a tuple in DR on that conforms to ji(F), there must exist two tuples in DR that have the same L.H.S. attributes of s(F), but different R.H.S. attributes. This would violate the assumption that DR conforms to ji(F); thus our claim holds. Lemma 3.1.5 V relational databases DR conforms to the set of relaion schemas ji(A), where zisaDTD, VXMLFDsE fDR j=t(F), then1(DR) j=F. Proof Let F be r,,:SPk,l,. .,SPt β* SPk,. Then ji(F) is Rk[SPk,j,. . .,SPk,fl] β> SPk,U. If DR 1= .i(F), then Vt1 ,t2 E Rk(DR), if A ti .SPk, = t2.SPk,i, then t1 .SPk, = iEl ,...,n t2.SPk,. 36 Let Wβ(DR) = (V,lab,ele,att,root). However, if our claim were wrong, there would exist x,y e V, where lab(x) = lab(y) = tk, and A X.SPk,j = Y.SPk,i, but X.SPk, Y.SPk,u. Since all group nodes in rβ (DR) are transformed from DR, the group nodes x,y must have their respective tuples t1 , t2 such that A ti .SPk,1 iEl ,...,n,u X.SPk,, and t2.SPk, = Y.SPk,i. This shows t1,t2 Γ© R,(D), and A tI.SPk,Γ = EEl ,..,n t2.SPk,1,butt1 .SPk. # t2.SPk,u. Thus it contradicts DR 1= it(F), and our claim holds. I Preservation of INDs We then claim that every XIVIL IND is preserved by the constraint transformation. Data transformation transforms every group in an XML database DA to a tuple in the transformed relational database u (D). In addition, every tuple in 1 (DA) comes from a group in D. Therefore, if D satisfies an XML IND 1, then every group that appears on the L.H.S. of I must have a corresponding group on the R.H.S. of I. All of these groups and only these groups are transformed to L.H.S. and R.H.S. tables of ji(i) respectively. Therefore, i (D) satisfies .t (I). Lemma 3.1.6 V XML databases D = {V, lab, ele, att, root} that conforms to DTD z, V XML INDs I, fD 1=1, then u(D) j= 4u(I). Proof I has the form: rk[SPk,1,. . . ,SP,fl] Γ§ t[SP1,.. . ,SP1j. If D1 = I, then Vx e V, where lab(x) = βCk, y e V, of which lab(y) t, and A x.SPk,1 = y.SP1, jEl ,...,n Meanwhile, i(I) = Rk[SPk,1,.. . ,SP,,,j R1[SPj, ,. . . ,SF]. And t(D) ensures every group node z E V of group type t is transformed into a corresponding tuple t in R (ji (D)) with all the attributes of z put into the respective fields in t, and every t in R(i(D)) is traced from some group node z e V of group type t. Therefore, every t1 e Rk(1(DA)) is traced from an x e V, where lab(x) = Tk. Together with the fact that every y e V, where lab(y) = t, is transformed intoR1(ji(D)), i1(D) 1= t(I) is satisfied. I Next we show that the reverse transformation preserves INDs. Reverse data transformation transforms every tuple in a relational database DR to a group in the 37 reverse transformed XML database ,r1 (DR). In addition, every group in uβ(DR) comes from a tuple in DR. Therefore, if DR satisfies a transformed IND u(I), then every tuple that appears on the L.H.S. of t (I) must have a corresponding tuple on the R.H.S. of u (I) to. All of these tuples and only these tuples are reverse transformed to L.H.S. and R.H.S. groups of I respectively. Therefore, J4_1 (DR) satisfies I. Lemma 3.1.7 V relational databases DR that conforms to relational schema u (s), VXMLINDs1 ifDR 1= i(I), then 1rβ(DR) =1. Proof Letlbeβrk[SPk,l,. ..,SPkn] Γ§ Then i (I) =Rk[SPk,I,...,SPk,fl] Γ§R1[SP, ,. . . ,SP1]. If DR t(I), then Vt1 e Rk(DR), t2 E R1(DR), such that A tl.SPk,j = iEI ,...,n t2.SF,1 Let ir (DR)=(V,lab,ele,att,root). Every group node z in jiβ1 (DR) has a corre sponding tuple tin DR with all values of attributes of z coming from their respective fields in t. Thus every x e V, where lab(x) = Tk has an corresponding t1 e Rk(DR) with same attribute/field values. Together with the fact that DR (I), and every t2 eR1(Dj) has a corresponding y e V, where lab(y) = , ji (Dj) =1 is satisfied. I 3.1.7 XML to Relational Constraint Implication Reduction We can use relational constraint implication to test implication of XML constraints: Theorem 3.1.8 Given a set ofXML constraints Z and an XML constraint a, = a zffit() j=ji(a) To prove Theorem 3.1.8, we break it into two lemmas. Lemma 3.1.9 is re sponsible for the βifβ half of Theorem 3.1.8, and we prove Lemma 3.1.9 below. Lemma 3.1.10 is responsible for the βonly ifβ half of Theorem 3.1.8, and we prove Lemma 3.1.10 at the end. Lemma 3.1.9 Given a set ofXML constraints and an XML constraint a, if 1= a, then ji() = (a) 38 Algorithm TestXMLImplication(,a) Input: a set of XML FDs and INDs Β£, an FD or IND a Output: True if = a, false otherwise 1. return ji() 1= Is(a) Figure 3.9: Algorithm Testing XML Constraint Implication Proof For every relational database DR, Lemma 3.1.5 and Lemma 3.1.7 prove r(DR) i=if DR t)β’ If = a, then Ic1 (DR) = a. ByLemma3.1.4 andLemma3.1.6, I1(1r(DR)) 1= u(a). By Proposition 3.1.3, DR (r1DR). By Lemma 3.1.1, DR 1= i.t(a). Therefore, iβ() = ji(a) I Lemma 3.1.10 Given a set of XML constraints E and an X?vIL constraint a, if u() 1= i(a), then a Proof For every XML database D, Lemma 3.1.4 and Lemma 3.1.6 prove u (D) 1= u() if D = Z. If t() 1= s(a), then ji(D) 1= (a). By Lemma 3.1.5 and Lemma 3.1.7, irβCu(D)) = a. By Proposition 3.1.2, D ii1(,1(D). By Lemma 3.1.1, DA = a. Therefore, : a 3.1.8 Algorithm to Test XML Constraint implication by Relational Constraint Implication From the sections above, we have proved that testing XML constraint implication can be handled by testing relational constraint implication. We build an algorithm here showing the steps required to employ testing relational constraint implication to solve the XML implication problem. The algorithm is very simple, it transforms a set of XML constraints and a single XML constraint to relational constraints, then test if the set of transformed relational constraints implies the one transformed relational constraint. 39 3.2 Guaranteeing No Interaction Between XML FDs and INDs Now that we can use relational implication to test XML constraint implication, we can concentrate on complexity. Detecting XML constraint violation, and translat ing XML constraints via mappings all require an efficient XML constraint impli cation algorithm. XML constraint implication is undecidable because relational FD + IND impli cation is undecidable [13],[8]. As we proved in Section 3.1.6 we can use relational FD + IND implication to solve XML FD + IND implication, we inherit the unde cidability property. The undecidability complexity stems from interaction of FDs and INDs. How ever, according to Theorem 2.10.7 non-interaction of relational FDs and 1NDs is possible. Building on it, this section aims to achieve non-interaction of XML FDs and INDs. In order for XML FDs and INDs to not interact, after transformation of XML constraints to relational constraints, the relational constraints must satisfy condi tions in Theorem 2.10.7. For the rest of this section, we will introduce the necessary restrictions on our XML constraints such that after the FDs and INDs are transformed to their rela tional counterparts, they do not interact. 3.2.1 Retaining Proper Circular Property Proper Circular Property, defined in Definition 2.10.2, is one of the conditions of Theorem 2.10.7 necessary to avoid FD and IND interaction. Being able to retain proper circular property during transformation is necessary for efficient constraint implication. In order to retain the proper circular property of XML INDs, we deal with the slightly simpler retaining of acyclic XML INDs first. Extending on that result, we then explain how to retain the proper circular property. 40 root G1[Xi] c G4[Y] G4[pid] C G[id] G33] C G[Y] G2[pid] c G1[id] (b) Transformed circular Relational INDs (a) DTD with Circular INDs after Transformation Figure 3.10: Example Showing Circular INDs after Transformation Retaining Acyclic Property Intuitively, each IND directionally connects two DTD groups. Two INDs can chain together to form a longer path if the DTD group on the R.H.S. of one IND is the DTD group on the L.H.S. of the other IND. For example, if the first 1ND connects DTD group G1 to G2, and second IND connects G2 to G3, then there is a path from G1 to G3. A path becomes a cycle when a DTD group appears in a path more than once. For example, if we add a third ND that connects DTD group G3 to G1 to the two INDs above, we will form a cycle. Transforming a set of XML INDs with cycles to relational INDs will definitely violate the acyclic property (Definition 2.10.1). Consequently, FDs and INDs will interact, and we cannot efficiently test constraint implication. In addition, transfor mation created INDs involving βidβ and βpidβ may complete cycles that do not exist before; therefore, simply avoiding cycles in the set of XML INDs is not enough. In Figure 3.lOa, ND G,[X] Γ§ G1[Y] is demonstrated by an arrow from G, to G. The INDs in Figure 3.1 Oa is transformed to Figure 3.1 Ob with each R1 corresponds to G. As shown in Figure 3. lOb, the transformed INDs complete a cycle after the addition of transformation generated INDs. Our digraph has leaf DTD groups, DTD groups with no children, as vertices demonstrated by encircling ellipses. Arrows between vertices if there are INDs among ancestors of the 2 leaf nodes. Cycles in this digraph illustrate possible cy 41 des in the relational INDs, and the cycles are complete if within each, connected vertex, the group pointed to by the arrow head (incoming IND target) is a descen dant of the group that the outgoing arrow (outgoing 1ND source) points away. This is because after transformation, there will be INDs from descendants to ancestors, thereby completing circles by linking arrows within each vertex. Below, we give a set of formal conditions for a set of INDs to retain the acyclic property after transformation. Given a DTD A = (E,A, c1, fl, r), and a set of INDs I. I U {Gi[pid]cG2[ d] I Gi,G e E and G2 parent(Gi)} has cycles if: 1. 3aels.t.c:G[X]Γ§GY}, where G1 ancestor_or_self(G2),or 2. We construct a digraph where leaf DTD groups, DTD groups with no children, make up the vertices. The directed edges are: { (Vi,V) I G1[X] C G2[Y] E I, whereG1,G2V e B, V1,V2 are leaves, and G1 e ancestors_or..self(Vi), G2 e ancestors_or_self(V2) } If a cycle C exists in this digraph, we call I circular if for each V1 in C there is an IND coming in o: G_i[X1_ l Γ§ G[Y], and there is an IND going out o: G[X1] G1[Y1Γ·i], and G e descendants_orself(G1) Retaining Proper Circular Property Given the above requirements for retaining the acyclic property, we want to extend the property to proper circular INDs. Recall, in Definition 2.10.2, that for any two consecutive INDs I: R1_[X1_]c R1[XβJ, and : R1[X C R1[X1] in a proper circular IND cycle, X = X1. Sim ilarly, in a proper circular cycle of XML INDs, any two consecutive ENDs I: [x1_] C t4Xfl, and 11+1 : t1X] C x!+1, x, = x1. 42 The transformation generated INDs are on newly created attributes βidβ and βpidβ that do not exist in the original sets of attributes. There is no chance the transformation generated INDs can form proper circular cycles with the original set of INDs. In other words, the transformation generated INDs may complete cycles, but they never complete proper circular cycles. As a result, if there are no proper circular cycles in the original set of XML INDs, it is not possible to have proper circular cycles after transformation. Therefore, retaining proper circular property if the original set of of INDs I has no proper circular cyles, is the same as retaining acyclic properly in the last Section. However, if there are proper circular cycles in the original set of INDs, we want to make sure the transformation generated INDs do not interfere with the existing cycles. To avoid such interference, any IND in the proper circular cycle must not be: G1[X] C G[XJ where G1 e descendant(G). 3.2.2 Retaining Reduced Property Analogous to the projection definition in Definition 2.10.5 for relational FDs, we define projection for XML FDs below. The projection of a set of FDs F over G1 onto a set of attributes Y C G1, denoted byF[Y],isgivenbyF[Y]={G:Wβ>ZR:W-->ZEFandWZcY}. Again, analogous to the reduced definition in Definition 2.10.6 for relational constraints, we define the property reduced for XML constraints below. Definition 3.2.1 A set ofattributes Y Γ§ G is said to be reduced with respect to G and a set ofFD5 F over G1 (or simply reduced with respect to F if G1 is understood from context) if F [YJ contains only trivial FDs. A set of FDs and INDs Β£ = F U I is said to be reduced ifVG,[X] G[Y] E I, Y is reduced with respect to F. This definition is parallel to its relational counterpart. The transformation gen erated INDs involving attributes βidβ and βpidβ are all unary INDs. Thus they can not violate the reduced property. Moreover, the transformation generated FDs only have βidβs on the L.H.S. of the FDs. Since none of the original INDs have at tributes involving βidβs and the transformation generated INDs are all unary, the transformation generated FDs do not cause violation to a set of FDs + INDs that 43 are reduced. Therefore, the above XML reduced property definition is sufficient without the need to consider transformation generated constraints. 3.2.3 Inability to Retain Intersection Property The transformation generated FDs are in the form: R1: βidβ β* A,VA e R,. Thus it will violate the intersection property (Definition 2.10.4) if there are any non-trivial FDs in the original set of FDs. The violation arises because V FD F = R,: X β* A in the original set of FDs, A βidβ and A βpidβ. On the other hand, a transformation generated FD exists R,: βidβ β> A, and βidβ flX = 0. Since we assumed F is not a trivial FD, it is not possible for 0 β* A. Therefore, it is impossible to retain intersection property if the original set of FDs is not empty. 3.2.4 Retaining Split-freeness Property In the last section, we have shown the inability to retain intersection property. Luckily, it is not a necessary requirement. Non-interaction can still be achieved if split-freeness property (Definition 2.10.3) can be retained. Two nontrivial XML FDs of the forms G:XB β* A and G:YA βΓ· B are said to be cyclic. Definition 3.2.2 A set of FDs F over DTD i satisfies the split-freeness property ifVFj e E whenever there exist cyclic FDs, G:XB β* A, YA β+ B E Fj, then either G:Y β* B E Fj or G:(XflY)A β* B E The above definition is parallel to its relational counterpart. The transformation generated FDs always have βidβ on the L.H.S., while βidβ is on the R.H.S. of FDs in neither the original set of FDs nor the transformation generated FDs. Thus they never form cyclic FDs with themselves or FDs from the original set of FDs, and split-freeness retention is ensured. 44 3.2.5 Summary of non-interaction between FDs and INDs We have demonstrated the inability to retain the intersection property in Section 3.2.3, but defined properties of proper-circular in Section 3.2.1, split-freeness in Defini tion 3.2.2, and reduced in Definition 3.2.1 in XML context. A set of XML con straints satisfying these properties will have their transformed relational constraints satisfying proper-circular (Definition 2.10.2), split-freeness (Definition 2.10.3), and reduced (Definition 2.10.6) properties too. Therefore, the transformed constraints can enjoy non-interaction between FDs and INDs. This is a big step towards achieving efficient XIVIL constraint implication. 3.3 Efficient implication of XML constraints In Section 3.1, we have established that the problem of testing the implication of XML constraints can be reduced to the problem of testing the implication of relational constraints. We have also defined the necessary conditions in order for the transformed FDs and INDs not to interact. As a result, we can test implication of IND and FD independently. In Section 3.3.1, we will show that typed XML INDs become typed relational INDs after the transformation in Section 3.1.4. After that, we describe a polyno mial implication algorithm for typed relational INDs in Section 3.3.2. At the end, we briefly outline a linear implication solution for relational FDs in Section 3.3.3. 3.3.1 Retaining Typed property [1] shows relational FD implication has a linear solution, but IND implication is PSPACE-complete; therefore, we need an efficient solution. Fortunately, typed relational IND implication is polynomial. Using the typed XML IND defined in Definition 2.9.2, we show the typed property is retained after transformation. It is easy to see that the original set of INDs do not violate the typed property after transformation. However, the transformation introduces extra INDs involving βidβ and βpidβ. We define a new type for βidβ from each transformed relational table. Each βpidβ can only be involved in one IND, and its R.H.S. is an βidβ in another table. Thus this βpidβ belongs to the same type as the type of the target βidβ, and these 45 transformation generated INDs do not cause typed violations among themselves. Since these new INDs and the original set of INDs deal with two mutually exclusive sets of attributes; in addition, FDs and 1NDs donβt interact, there is no interaction of the two sets of INDs. Therefore, the typed property is retained after transformation. 3.3.2 Implication for typed relational INDs [7] shows the implication of typed relational INDs can be solved in polynomial time by solving a digraph reachability problem. We will describe the solution below. Given a set of relational schemas R = {R1 [Ui],... ,R[U]}, and a set of typed INDs over R. We want to check if E implies an typed IND a = Rk [X] Γ§ R1 [x]. Construct a digraph G = (VE), where V = {R1,.. .,R} and (Rp,Rq) E E if there is R[W} Γ§ Rq[Wj in such that XW. = a if there is a path from Rk to R1. Checking for existence of such a path is a digraph reachability problem, and it can be solved by breadth first search (BFS). Let the number of INDs be m, and the total number of attributes (union of attributes from all relation schemas) be n, the time complexity for BFS is 0(m), and the space complexity is 0(m). However, testing for attribute subsets during digraph building takes 0(n2) time, but 0(n) space (time complexity can be lowered if we accept a trade-off for higher space complexity). Therefore, the overall time complexity is O(m+n2), and the overall space complexity is 0(m+n). 3.3.3 Implication for relational FDs Relational FD implication is linear time [1]. [1] also describes the algorithm to test FD implication. The algorithm tests for implication by calculating the FD closure (attributes that the input set of attributes implies) on the L.H.S. attributes of the FD being tested. 46 3.3.4 Summary of XML Constraints Implication In this chapter, we have reduced XML constraint implication to a relational con straint implication problem. Trying to find an efficient implication algorithm, we have adopted proper-circular (Definition 2.10.2), split-freeness (Definition 2.10.3), and reduced (Definition 2.10.6) properties from [11]. We modified them to apply to XML FDs and INDs. As a result, we have shown how to prevent interaction be tween XML FDs and INDs. In addition, to make the problem of IND implication tractable, we have introduced the typed property in XML. Finally, we have em ployed existing efficient implication algorithms on relational FDs and typed 1NDs to efficiently test implication of XML constraints. 47 Chapter 4 Constraint Propagation When data is translated in the form of one schema to another, it carries some of the semantics with it. In this section, we introduce algorithms to translate constraints from a source DTD A5 to a target DTD AT via a set of HepToX mapping rules M (defined in Section 2.7). As we will soon find out the constraint translation algorithms are decidable, but take exponential time because our algorithms compute closures. The expo nential complexity makes this solution inefficient, especially when a large set of constraints are considered. Computing a cover of FDs on the decomposed relations in relational func tional dependency preservation also takes exponential time. Similar to the alter nate method in dependency preservation, we define algorithms to check constraint propagation. That is given a set of constraint on As, and a constraint aT on AT, whether V XML database D5 on A5, if I3 Zs implies M(Ds) 1= 4.1 Constraint Translation Data can be translated via mappings; data semantics can also be translated. In this section, we construct algorithms to translate XML constraints via HepToX mappings. Given a set of constraints on As, 6 (s, M) denotes the translation of via mapping rules M to a set of target constraints. 6 M) produces a set of 48 constraints on the target DTD AT such that: VXML database D5 on As, if I3 Es, then M(Ds) 1 ΓΆ(Es,M) If E5 satisfies the conditions set in Section 3.2, there will be no interaction of FDs and INDs in constraint implication. Since we are translating source con straints, the non-interaction is very important to make the translation efficient be cause dependency preservation requires implication (possibly computation of clo sures). Therefore, we assume E satisfies conditions in Section 3.2. Our problem shares some similarities to the (functional) dependency preserva tion problem in the relational context. In fact, for the problem of FD translation, it employs the same algorithms as dependency preservation. 4.1.1 FD Translation Each mapping rule M1 e M has a head and body. The rule head is a tree expression of a path from the target root to a group node. On the other hand, the rule body is a tree expression from the source root to multiple source group nodes. Each target attribute is only allowed to have one correspondence. With this re striction, there does not exist the problem of union of values from different sources. Therefore, FDs translated from one source to a target group will not be interfered by data from other sources. With the above established, we introduce the FD translation algorithm in Figure 4.1. As in dependency preservation, this is an algorithm that requires computation of FD closures; thus it requires exponential time. The algorithm is simple and helps to express the intuition behind FD translation. Given a set of source FDs F5, the algorithm first computes the closure F of F5. It then examines the leaf target group TO1 of each mapping rule M, to see if any of the FDs in F have all their attributes mapped to TG1. If an FD satisfies these conditions, then it is translated to its form in TG1 and added to the output set of target FDs FT. The actual FD translation at line 7 is very simple. Group instances of SG will have their data in X U Y transferred unchanged and as a whole to attributes group instances U U V of TG1. Therefore our translation is sound. 49 Algorithm 8(Fs,M) Input: a set of FDs F5 over source DTD A, a set of mapping rules M from z\5 to IXT Output: a set of FDs FT over target DTD T 1. computeF.={SG:Xβ*YFs =SG:Xβ*Y} 2. for each M e M 3. M1 has a rule head which contains a target path TP1, TP1 is a path of groups root,.. . ,TG, 4. for each group SG1 e rule body of M1 5. for each FD=SG:Xβ* Y E F 6. if M maps X U Y to attributes U U V in TG1, wherelXl=zIUIandIYI=IVI 7. thenFT=FTU{TGI:Uβ*V} Figure 4.1: FD Translation Algorithm β ΓΆ(Fs,M) This FD translation algorithm runs in exponential time; thus it is not very effi cient. 4.1.2 IND Translation IND Translation Algorithm as shown in Figure 4.2 translate a set of IND5 Is on the source schema Ly, to a set of INDs βT on the target schema Line 2,3 finds out all the DTD groups on both source and target schemas, and label each of them by indexes. Line 6-10 finds all the correspondences between DTD groups and store them in a 2-dimensional array W. We then compute I closure of js (all the INDs that can implied by is). If any IND Iβ e 4 has attributes on its L.H.S. fully mapped to a DTD target group TGi, and attributes in its R.H.S. fully mapped to another DTD target group TGqI. Then we add this newly translated INDtOIT. Even though typed IND implication is polynomial, (Is, M) computes closure of I; thus it runs in exponential time and is not an efficient algorithm. 50 Algorithm 6(Is,M) Input: a set of INDs 1s over source DTD Deltas, a set of mapping rules M from Deltas to DeltaT Output: a set of INDs IT over target DTD DeltaT 1. 2. 3. 4. 5. 6. 7. 1T{ } examine M, find out all the source groups SGs and target groups TGs index SGs and TGs n=sizeOf(SGs) m=sizeOf(TGs) create a 2-dimensional array W of size n x m foreach M E M 8. M1 has a rule head which contains a target path TP, TF is a path of groups root,. . . ,TG1 9. foreach group SG3 e rule body of M1 10. W[j] [i] = {(As,p,AT,q) I As,p E R(SGj),AT,q E R(TG), there is a correspondence between As, and Ar,q in M} 11. compute I from Is 12. foreach I E I, where i=SGp[X]cSGq[Y1 13. forr=ltom 14. if (A,Aβ)EW[pJ[r], VAEX pβ=r if (B,Bβ)eW[q][r], YBeY qβ =r Xβ={} foreach AEX Xβ=Xβu{Aβ}, where (A,Aβ)EW[p][pβ] Yβ={} foreach BEY Yβ=Yβ U{Bβ }, where (B,Bβ )EW[qJ [qβ] βT 1TU{TG1[X] TGqi[Yβ]} 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. Figure 4.2: IND Translation Algorithm β ΓΆ(Is,M) 51 4.2 Testing Constraint Propagation Given the undesirable exponential time complexity of constraint translation in Sec tion 4.1, we want to find an efficient solution to check constraint propagation. In this section, we develop efficient algorithms to test constraint propagation. We give our problem statement below: Given a set of constraints Β£s on source DTD Ay, a set of mapping rules M, and a constraint GT on target DTD ar is propagated from via M if V XML database D that conforms to S and M(Ds) 1= o and every attribute in T has a correspondence in S. We donate this Z j=M 0T. Unlike traditional dependency preservation in which all the target attributes are mapped, our users may not create correspondences for all the attributes. Al though a constraint may be satisfied at the target because of missing values (from target attributes without correspondences), we feel a constraint cannot be consid ered βpropagatedβ if some of its attributes are not mapped. Therefore, we require all attributes in aT to have a correspondence in L. Even though we have already established polynomial implication of XML con straints, testing constraint propagation is still not a trivial problem. We have to be very careful to construct our environment to achieve decidability and tractability. Firstly, users are allowed to create high-level correspondences between schemas that greatly reduce complications. The forbidding of multiple source correspon dences to a single target node takes away the union operator. It guarantees data will come from only one part of the source. Secondly, one may suspect that the set of translated constraints will not sat isfy conditions set in Theorem 2.10.7; thus may lead to undecidability in testing propagation. In target constraint implication ZT j= aT, all XML database DT that conforms to and T has to satisfy aT. However, in constraint propagation, we are only considering a subset of databases, namely VDs that conforms to and, M(Ds) 1= IT, and we need to check if M(Ds) 1= aT. If we can test the constraint in question aT in its corresponding form in the source, then we can guarantee trans lated data coming from that part of the source will always satisfy aT. Thirdly, there is the issue of NULL and missing values. NULL values can occur if data coming from the source already contains NULL values, or an attribute in 52 the target has no correspondence (not mapped) in the source. In the prior case, if the NULL values in a source XML database D violates source constraint sβ then D will be violating the assumption of constraint propa gation. Therefore, there is no propagation. On the other hand, if VDs = Ys and D does not contain NULL values, M(D) fr=M UT, then VDs F= Ys and D contains NULL values, M(D) I=M UT. This is because when a D with NULL values satisfies Es, we can substitute unique values in place of these NULL values to produce 4, and it will still satisfy Es. If M(4) 1= Uiβ, then M(Ds) 1= oi- because these unique values generated behave exactly as NULL values in this case as they do not equate to any other values. Therefore, the existence of NULL values do not prevent UT from propagating from Es. In the latter case, the target attribute has no correspondences in the source; thus it is impossible to find a corresponding source constraint. It follows if UT being tested involves such attributes, it is not propagated. As with the previous algorithms, we are able to guarantee no interaction of FDs and INDs in Section 3.2; therefore, we can separate the testing of propagation into two algorithms; one will test the propagation of FDs while the other is on INDs. 4.2.1 Testing FD Propagation The intuition behind the FD Propagation Testing algorithm (Figure 4.3) is to trace data from the source that will make up the data involving the FD U {TGk : X β* A} in question. For such data to conform to UT after translation, it must conform to some constraints in the source. Hence, we must first find out what these constraints are, then test if the FDs E5 given in the source can imply them. Due to the one-to-one nature of our mappings, every mapped target attribute must be traced to one and only one attribute in the source. Thus checking if these traced attributes satisfy a corresponding source FD is sufficient to test propagation of UT. To find this FD in the source, the first attribute we should look at is A on the R.H.S. of UT. In line 8,9, we use the binding variable $A to trace the mapped attribute Aβ in the source. In doing so, we also learn the source group SGk in 53 which Aβ belongs. Since the definition of FDs confines an FD to attributes within a single group and our mapping rules do not consider joins among source groups, the corresponding FD we need to check must be in SGk where Aβ belongs. On a side note, we only need to check attributes in SGk because of no interac tion of FDs and INDs. Therefore, we donβt need to worry about attributes in other groups having an effect on the FDs in SGk. This also explains why the algorithm only accepts FDs for sβ’ We can easily separate INDs and FDs from each other in a set of mixed constraints, and feed this algorithm with just the FDs. Next in line 11-14, we look for all the attributes in X that are mapped from SGk, and put all their corresponding source attributes into Xβ. Notice that Xβ I X Iβ X is a superset of mapped attributes from Xβ. Due to the augmentation rule of FD implication, if a subset of X implies A, then X must imply A. At the end, we check if I implies SGk : Xβ = Aβ, returns true if it does, and false otherwise. The algorithm checks FD implication at the end, while the rest of the algorithm runs in time 0(m), where m is the number of correspondences. Thus the algorithms runs in linear time because testing implication of FDs runs in 0(n), where n is the number of attributes. For example, we can check if LibFD#1: UBCLibrary.person : name β+ DOB in Table 1.2 is propagated from UBCBooksrore via the set of Mapping Rules M in Figure 1.4. At Line 8 of our algorithm, we find that the R.H.S. attribute DOB of LiBFD#1 appears in the mapping rule marked as β1β in Figure 1.4, and it is assigned the variable $DOB in target DTD group person. At Line 9, we find $DOB is bound to the attribute DOB of the source DTD group person in the body of mapping rule β1β. At Line 12, name, which appears on the L.H.S. of LiBFD#1, is assigned to $name. At Line 13, $name is bound to the attribute name of the same source DTD group person as DOB. At Line 15, we test if UBCBookstore UBCBookStore.person : name βΓ· DOB. This is of course true because the FD is the same as BookFD#1 in ZuBcBookstore; thus our algorithm returns true. 54 Algorithm TestingFDPropagation Input: a set of FDs Es over source DTD A5 an FD 0T = TGk : X ββ A over target DTD AT, a set of mapping rules M that maps A5 to AT Output: true if ESl=MoT 1. M=null 2. forMeM 3. head(M1)is a target path of groups TP1=root,.. . ,TG1 4. ifTG,=TGk 5. then Mk=Mj 6. if Mk=null 7. then return false 8. in head(Mk) the leaf group assignment has the form: TGkβfk(...)[ A β* $A, we denote it as headTGk(Mk) 9. in body(Mk), there exists a source group (ignoring its descendant groups): SGk βf $SGk [ Aββ$A, we denote it as bodysGk (Mk) 10. Xβ={} 11. foreachB E X 12. there is B β* $B in headi-Gk(Mk) 13. if there exists Bβ β* $B in bodysuk(Mk) 14. then Xβ=Xβu{Bβ} 15. returnE =Xββ*Aβ Figure 4.3: Testing FD Propagation Algorithm 55 4.2.2 Testing IND Propagation Testing IND propagation is similar to testing FD propagation in the previous sec tion in that we trace attributes through mappings to identify the one corresponding IND in the source and check if a set of source INDs s implies it. Similarly, because of no interaction of FDs and INDs (Section 3.2), we donβt have to consider FDs in our algorithm. In addition, our mapping rules use Cartesian product of source groups to product instances of target groups; thus there is no equality of attributes among source groups. Therefore, for each left and right hand sides of the traced IND, we only need to consider attributes within only one source group for each side. Our algorithm traces source attributes of both L.H.S. and R.H.S. of the IND in question UT = TG[X1,...,X] Γ§ TGq[Yi,...,Yn]. It makes sure the traced set of attributes Xβ of the L.H.S. belongs to only one source group, and the traced set of attributes of the R.H.S. Yβ also belongs to only one source group. Notice that both Xβ and Yβ are lists because the L.H.S. and R.H.S. sets of attributes in an IND are sequences, and their order matters. Moreover, unlike in FD propagation, IND implication does not have augmen tation (augmenting attributes to attributes already in the constraint produces satis fying constraints). Hence, I 1=1 X 1=l Yβ Y , and every attribute in IND must have a corresponding attribute in the source. Finally, if we can find Xβ and Yβ which both contain attributes within their respective source groups SG and SGq, then S =M T ifs = SG [Xβ] SGq [Yβl. The algorithm checks IND implication at the end, while the rest of the algo rithm runs in time 0(m), where m is the number of correspondences. Thus the algorithms runs in 0(n2) time, where n is the number of attributes, because testing implication of typed INDs runs in 0(n2). For example, we can check if LibIND#1: UBCLibrary.publication.call#.author[name,DOB] Γ§ UBCLibrary.person[name, DOB] in Table 1.2 is propagated from EUBCBookstore via the set of Mapping Rules M in Figure 1.4. At Line 13 of our algorithm, we find that the L.H.S. attributes name and DOB of LiB1ND#1 appear in the mapping rule marked as β2β in Figure 1.4, and 56 Algorithm TestinglNDPropagation Input: a set of INDs s over source DTD As, an IND aT = TG[X1,.. . TGq[Yi,. . . , Y,] over target DTD AT, a set of mapping rules M that maps A5 to AT Output: true if s j=M aT 1. MpMqfluIl 2. forMEM 3. head(M1)is a target path of groups TP=root,.. . ,TG, 4. ifTG=TG 5. then M=M1 6. ifTGiTGq 7. then MqMi 8. if M=nulI or Mq=nuIl 9. then return false 10. SGpSGqI)Ul1 11. Xβ=Yβ=() 12. for each X, E (Xj,. . . , X,) 13. if ,1ββ$Xe head(M) 14. then return false 15. there exists X β* $X1 e a source group SG e body(M) 16. if SG ynul1 and SG SG 17. then return false 18. SG=SG 19. Xβ.append(Xβ) 20.foreachYE(Y,..., ) 21. if / β $Y e head(Mq) 22. then return false 23. there exists Yβ β* $1β, e a source group SG E body(Mq) 24. if SGq #null and SGq # SG 25. then return false 26. SGq = SG 27. Yβ.append(Y/) 28. return Ys 1= SG[Xβj Γ§ SGq[Yβl Figure 4.4: Testing ND Propagation Algorithm 57 they are assigned the variables $name and $DOB in target DTD group publication respectively. At Line 15, we find $name and $DOB are bound to the attributes au tho, name and authorDOB of the source DTD group book in the body of mapping rule β1β. At Line 21, we find that the R.H.S. attributes name and DOB of LiBIND#1 appear in the mapping rule marked as β1β in Figure 1.4, and they are assigned the variables $name and $DOB in target DTD group person respectively. At Line 23, we find $name and $DOB are bound to the attributes name and DOB of the source DTD group person in the body of mapping rule β1β. At Line 28, we test if: UBCBookstore = UBCBookStore.subject .book[author.name, author.DOB] c UBCBookStore.person [name, DOB) This is of course true because the IND is the same as BookTND#1 in UBCBookStore thus our algorithm returns true. 4.3 Constraint Propagation Summary To test constraint propagation, we first attempted to translate all source constraints to a cover of target constraints, then one could use implication algorithms in Chap ter 3 to check if the target constraint 0T is implied by this cover. However, testing constraint involves the computation of closures; thus making the algorithm to run in exponential time. Therefore, we developed constraint propagation testing algo rithms that avoids computing a cover of target constraints by translating oT back to its form in source schema, and testing if s implies it. The propagation testing algorithms run in polynomial time. 58 Chapter 5 Detecting Constraint Violations When a constraint propagates, it guarantees that translated data does not violate the constraint at the target. However, when one does not propagate, it does not necessary imply violation. For example, in Figure 1.1 it is obvious that LibFD#3 in the target DTD UB CLibrary is propagated; thus it will be satisfied by data translated from UBCBook store using the specified mapping. However, LibFD#7 is not propagated. According to the mapping rules, no data will be translated under the group publisher. Since translated instances never contain publisher, LibFD#7 cannot be violated. This example demonstrates that propagated constraint implies satisfaction, but constraint satisfaction does not necessary imply propagation. As in checking constraint propagation, we can check for the violation of a target FD by only a set of source FDs and mapping rules independent of any INDs, and vice versa for checking IND violation. In the following two sections, we are going to introduce algorithms to check for FD and IND violations respectively. 5.1 Checking for FD violation The algorithm Figure 5.1 presented below takes a set of FDs s over source DTD b5, an FD 0T = TGk : X β* A over target DTD and a set of mapping rules M. It 59 Algorithm Checking for FD violation Input: a set of FDs s over source DTD As, an FD ar = TGk : X β* A over target DTD AT, a set of mapping rules M that maps A.5 to Ar Output: true if 3 an XML instance I3 over As that conforms to s, s.t. M(Ds) 1. if TestingFDPropagation(Es, ar , M) 2. then return false 3. M=null 4. forM,EM 5. head(M1)is a target path of groups TP1=root,. . . ,TG1 6. ifTGj=TGk 7. then Mk=Mj 8. if Mk=null 9. then return false 10. foreachX, eX 11. if ,X1 β $X1 e head(Mk) 12. then return false 13. return true Figure 5.1: FD violation Checking Algorithm determines if any data instances D5 over A that conforms to s would violate ar after the translation of D5 through M. Propagation implies no violation; therefore, line 1,2 of the algorithm in Figure 5.1 check for propagation, and return false (no violation) if o- is propagated from s through M. For M(Ds) to violate ar, we need all attributes on the LH.S. of 0βjβ to be equal on two instances x,y over group TGk, but x.Ay.A. As our stance on NULL and missing values is that they are not comparable; thus x.X and y.Y cannot be equal if any of them contain NULL values or are missing some values. Hence, in order to create violations, all attributes in X must be mapped. On the other hand, if all attributes in X are mapped, whether A is mapped is 60 irrelevant. If A is mapped, together with the fact that T is not propagated, we can always find D5 such that there are translated instances x,y in M(Ds) and x.X=y.X but x.Ay.A. If A is not mapped, then we just need to find D5 such that there are translated instances x,y in M(Ds) and x.X=y.X. The absence of mapping of A guarantees x.Ay.A. This algorithm first checks for propagation which runs in 0(n) time, where n is the number of attributes. The rest of it runs in time 0(m), where m is the number of correspondences; thus, we can check for FD violations in 0(n) time. For example, we can check if LibFD#7: UBCLibrary.publisher: name β* addr will cause violations after translation. It is obvious that LibFD#7 is not propa gated; therefore, testFDPropagation at Line 1 is false. The algorithm go through all the mapping rules M (Figure 1.4) at Lines 4-6, but will not find a mapping rule mapping to the publisher DTD target group node because there are not corre spondences to publisher. Therefore, at Line 9, our algorithm returns false; thus it correctly answers LibFD#7 will not cause violations. 5.2 Checking for IND violation The algorithm in Figure 5.2 takes a set of INDs Β£s over source DTD i, an 1ND YT = TG[X1,...,X] TGq[Yi,. ..,Y,] overtargetDTDAT, and asetof mapping rules M. It determines if any XML database D5 over S that conforms to will violate ar after the translation of D5 through M. Similar to Figure 5.1, we check for propagation first. If aT propagates, then there cannot exist violations. In order for M(D) to violate ar when ar is not propagated, it must contain an instance x of group TG, andy of group TGq, such that i e [1,.. . , n], x.X1 y.Y. In this case the existence of mappings for any X1 or Y is irrelevant. This is because, if mappings are missing for any of X, or Yj, the missing values guaran tee non-equality; thus they lead to violation. On the other hand if all K, and l are mapped, we can easily find an XML instance I3 over with Ds s.t. M(Ds) a. Failure to find such D implies s 1=M aT which contradicts our as sumption that aT is not propagated. 61 Algorithm Checking for IND violation Input: a set of INDs s over source DTD A5, an IND 0T = TG[Xi,.. . ,X,] c TGq[Yi,.. .,Y,J over target DTD AT, a set of mapping rules M that maps A5 to AT Output: true if an XML instance D over A5 that conforms to Es, s.t. M(Ds) 1. LHGroupFound=false 2. forM,EM 3. head(M1)is a target path of groups TF=root,. . . ,TG1 4. ifTGETF 5. then LHGroupFound=true 6. if LllGroupFound=false 7. then return false 8. if TestinglNDPropagation(Es, T, M) 9. then return false 10. else return true Figure 5.2: IND violation Checking Algorithm Therefore, the checking for IND violation is surprisingly simple; if any at tributes in TG or its descendants are mapped, then it is just the opposite outcome of testing IND propagation. Otherwise, since there can be no translated instances of TG; thus aT is satisfied because there are no instances on the L.H.S. of aT to violate it. Checking existence of mappings in TG or its descendants takes 0(m), where m is the number of correspondences. Hence, the algorithm runs in 0(n2) time, where n is the number of attributes, because the IND propagation testing algorithm that we employ runs in 0(n2). For example, we can check if LibIND#2: UBCLibrary.publication[pub] UBCLibrary.publisher[namej will cause violations after translation. At Line 3, the algorithm finds the L.H.S. tar get group publicatoin is in mapping rule marked as β2β in Figure 1.4. Since there will be instances on the L.H.S. of the IND, the algorithm employs the TestinglND 62 Propagation algorithm in Section 5.1 to test if LibTND#2 propagates. The R.H.S. of LibIND#2 is not mapped; therefore, LibIND#2 does not propagate, and our algorithm returns true (meaning LibIND#2 will cause violations). 63 Chapter 6 Resolving Violations With the ability to detect constraint violations, we are setup to fix violations in this chapter. However, violations may be caused by many factors such as human error and heterogeneous schemas. There is no way to determine the reasons behind the violations automatically; only the person who creates the correspondences can decide how best to rectify the violations. The purpose of this chapter is to give options to the person trying to repair constraint violations. Each solution has its advantages and disadvantages, and we will discuss them in the following sections. Which method is best suitable for a correction differs depending on different situations. Therefore, we will leave the final decision making to administrators creating the correspondences. Unlike the rest of this thesis, the repairs presented here are heuristic solutions. There is no guarantee that one repair wonβt cause other violations. They merely provide options for administrators to produce better quality mappings. We categorize the repairs into repairs that fix FD violations and the ones that fix IND violations. 64 6.1 Repairing FD Violations 6.1.1 Scaling Back Studying the algorithms in Detecting Constraint Violations Chapter 5, one can eas ily come to the conclusion that removing some of the correspondences can fix violations. This is indeed a method that is consider the safety net because it works in all situations. However, this has to be considered a pessimistic option as the resulting mappings will translate less information after deleting some correspon dences. In our example in Figure 1.1, we have shown that the FDs LibFD#4 and LibFD#5 are violated. Here, in order to demonstrate better, we simplify our example by re taining only attributes and correspondences that are necessary for our demonstra tion, and it is shown in Figure 6.1. isbn An easy way to fix the violations is to delete one of the correspondences to subject, authoi name, or year. By doing so, there will be missing values of the L.H.S. of LibFD#4 and LibFD#5. Hence, L.H.S. of LibFD#4 and LibFD#5 will never equate, and the two FDs cannot be violated. Formally, if FD a = TGk SP1, . . . , SP, β* SP is violated, then delete the cor respondence for TGk.S19 for any i β¬ [1,... ,n] In our example, if we delete the correspondence between years of the two UBCBookstore subject name book \ Figure 6.1: FD Violating Example 65 schema, the translated data will be missing the attribute year. The resulting data will satisfy LibFD#4 and LibFD#5 because of the missing values of year. 6.1.2 Source Concatenation Scaling back deletes correspondences; thus some information is lost. It can be undesirable. As seen in the last section if we delete the correspondence between years of the two schemas in Figure 6.1 the resulting data will satisfy LibFD#4 and LibFD#5. However, the resulting translated database will be missing year; users are going to have a hard time searching for publications. In this section, we introduce a method that let us keep as much information as possible while satisfying target FDs. Here, we introduce a new type of correspon dence, the concatenating correspondence. A concatenating correspondence connects multiple source attributes within one DTD group to a single target attribute. During translation, each source group in stance will concatenate the strings of the source attributes of the correspondence and put the resulting string in the target attribute. UBCLibrary publication Figure 6.2: Fixing FD Violation by Concatenation Example Figure 6.2 shows a concatenating correspondence connecting isbn and year at the source to year at the target. Translating the instance from Figure 6.2 will produce Figure 6.3. It is clear that the new translated instances no longer violate UBCBookstore subject name book\c 66 UBCLibrary publication publication call# title #pages call# title #pages ,7β 540 .7 fl.......DB 1 320 subject author year subject author year DB ,// \ 2001001 DB ,// \ 2001002 name DOB name DOB Smith 19490101 Smith 19490101 Figure 6.3: Translated UBCLibrary After Repair By Concatenation Mapping LibFD#4 and LibFD#5. This is achieved by concatenating attributes that implies the corresponding R.H.S. of target FDs to attributes of L.H.S. of those FDs. In our example, isbn implies title and #pages which are correspondences of R.H.S. of LibFD#5 and LibFD#5. Thus we concatenate isbn to a correspondence that is in the L.H.S. of LibFD#4 and LibFD#5. Administrators establishing the corre spondences need to pick attributes in the source that implies Aβ, where Aβ is the source correspondence of A in the violated target FD uT β TGK : X β* A. Which correspondence to concatenate to in X is totally up to the administrator establish ing the correspondences. Here we choose the correspondence between year, and concatenate values in isbn to every year translated to UBCLibrary. 6.2 Repairing IND Violations 6.2.1 Scaling Back When an IND is violated, there are instance on the L.H.S. of the violated IND aT that do not have corresponding instances on the R.H.S.. One way to avoid IND violation is to make sure no instances on the L.H.S. exist. Again, we can achieve this by deleting correspondences such that no instances appear on the L.H.S. of aT. Since we treat missing values in a group instance as NULL values, simply deleting the correspondences that makes up the L.H.S. attributes is not enough. We need to delete all the correspondences related to the 67 name target group so that no group instances will ever appear. Figure 6.4 shows just the part of our running example that violates LibJND#2. Following the correspondences, instances of publications are translated, each pub lication has a pub. LibIND#2 requires each pub in publication to appear in pub lisher name, but this is not satisfied by our correspondences. Using the correspondence deleting method, we have to delete every correspon dence in the publication. There will not be any more publication instances; thus LibIND#2 will no longer be violated. As with the Scaling Back solution of FD violation repairs, deleting correspon dences loses information. It can be considered worse in this case because all cor respondences of the L.H.S. target group have to be deleted. 6.2.2 Split Correspondence Instead of making sure the L.H.S. of an IND aT = TGk[SPk,I,. . . ,SP,j c TG1[sP,i,. .. ,SP1j has no instances, we can make sure the R.H.S. contains all the instances of the L.H.S.. A split correspondence is actually multiple correspondences all originating from the same source attribute. After translation, all target attributes will have the same set of source attributes. This is a constructive method to correct IND violations. For every mapped attribute TGk.SPk, on the L.H.S. of aT, we find its corresponding source attribute Figure 6.4: IND Violation Example 68 Figure 6.5: IND Violation Example Fixed by Split Correspondence SGk.SPk,,, and create another correspondence from SGk.SPk,1 to the corresponding attribute TG1.SF in the R.H.S. of the IND. For every unmapped attribute TGk.SPk, on the L.H.S. of U, we need to find any attributes in SGk to map to TGk.SPk, and TG1.SP,. In Figure 6.4, our example violates Lib1ND#2. Using the split correspondence method, we create another correspondence to publisher name as shown in Figure 6.5. It is obvious that every translated pub appearing under publication will appear in publisher name. Therefore, LibIND#2 is satisfied. 6.3 Summary of Repairs In this chapter, we have suggested 3 heuristic ways to repair violations. Roll back of correspondences can repair all violations, but it can be undesirable due to loss of infonuation. Concatenation and split correspondences are more constructive, but they may modify translated data. Therefore, administrators have to decide which type of repair to apply, and if the sacrifice is worth the gain in constraint satisfaction. 69 Chapter 7 Conclusion and Future Work Conclusion We have proposed ways of repairing correspondences such that translated instances would satisfy target constraints. In order to repair, we first have to know what we are repairing. Thus we have developed algorithms to detect violations, check propagation of constraints, and check implication of constraints. To make sure our problem is tractable, we have investigated simplified DTD and restricted conditions for constraints such that the above algorithms all run in polynomial time. These efficient results are all on the two most fundamental type of constraints: functional dependencies and inclusion dependencies. Our mappings are generated from high-level correspondences that can be eas ily created by anyone who knows the two schemas well. We suggest repairs on these high-level correspondences. In addition, repairing these high-level corre spondences is easier for everyone to understand and to choose the best method of repairing. With consistent mappings translating data that is consistent with the target con straints, application at the target are guaranteed to run soundly. Future Work This thesis opens up several new problems. Data exchange can be more powerful if data from one source can be relayed to 70 other targets through chains of mappings. Instead of creating mappings between every pair of schemas we want data exchange, we can make use of existing map pings to form a network. Ideally, data exchange is possible if two schemas are connected via a path of mappings. Our constraint propagation algorithm will be more powerful if we can test for constraints propagation along a chain of mappings. To reason on the most basics of constraints: functional dependencies and inclu sion dependencies, we have yet to study some of the other popular constraints such as keys and foreign keys. Considering keys and foreign keys involves more studies of non-leaf nodes, and treating groups as entities. This is quite different from our study of only leaf nodes. We only need to deal with sets of values. However, we believe our studies is a very useful building block towards keys and foreign keys violation detections and repairs. Similar to the idea of relative keys in [6], there can be relative FDs and INDs. For a relative FD, the L.H.S. of it can involve ancestor attributes of the DTD group that the FD is based on. Relative IND is similar in that sequences of attributes on either sides of it can span over several DTD groups that are descendants and ancestors of each other. However, with attributes spanning over multiple DTD groups, our constraint implication that based on reduction to relational constraint implication will no longer work. Therefore, we need to find another implication system that is effi cient. 71 Bibliography [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. ISBN 0-201-53771-0. β pages 20, 45, 46 [2] M. Arenas and L. Libkin. A normal form for xml documents. In PODS β02: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems, pages 115β125, New York, NY, USA, 2002. ACM Press. ISBN 1-58 1 13-507-6. doi :{http:f/doi.acm.org/1 0.11 45/303976.303983}. β pages 10, 16 [3] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS β99: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 68β79, New York, NY USA, 1999. ACM Press. ISBN 1-58 1 13-062-7. doi:{http://doLacm.org/10.1 145/303976.303983}. β pages 5, 8 [4] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD β05: Proceedings of the 2005 ACM SIGMOD international conference on Management ofdata, pages 143β154, New York, NY USA, 2005. ACM. ISBN 1-59593-060-4. doi:http://doi.acm.org/1 0.1145/1066157.1066175. pages 5, 8 [5] A. Bonifati, E. Q. Chang, A. V. S. Lakshmanan, T. Ho, and R. Pottinger. Heptox: marrying xml and heterogeneity in your p2p databases. In VLDB β05: Proceedings of the 31st international conference on Very large data bases, pages 1267β1270. VLDB Endowment, 2005. ISBN 1-59593-154-6. pages 4, 14, 18, 19 [6] P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Keys for xml. Computer Networks, 39(5):473β487, 2002. β pages 1, 71 [7] M. A. Casanova and V. M. P. Vidal. Towards a sound view integration methodology. In PODS β83: Proceedings of the 2nd ACM 72 SIGACT-SIGMOD symposium on Principles of database systems, pages 36β47, New York, NY, USA, 1983. ACM. ISBN 0-89791-097-4. doi:http:lldoi.acm.org/10.1145/588058.588065. β pages 21,46 [8] A. K. Chandra and M. Y. Vardi. The implication problem for functional and inclusion dependencies is undecidable. SIAM J. Comput., 1 4(3):67 1β677, 1985. pages 7, 22, 25, 40 [9] 5. Davidson, U. 0. Pennsylvania, W. Fan, C. Hara, J. Qin, and T. U. Propagating xml constraints to relations. In In ICDE, pages 543β554, 2003. pages 7, 17 [101 W. Fan and L. Libkin. On xml integrity constraints in the presence of dtds. J. AcM, 49(3):368β406, 2002. ISSN 0004-5411. doi:http:lldoi.acm.org/1 0.1145/567112.567117. β pages 1, 7, 11, 13 [11] M. Levene and G. Loizou. How to prevent interaction of functional and inclusion dependencies.. Information Processing Letters, 71(3β4):1 15β125, 1999. URL citeseer.ist.psu.edu/Ievene95how.html. β* pages 22, 24, 47 [12] S. Lu, Y. Sun, M. Atay, and F. Fotouhi. On the consistency of xml dtds. Data Knowl. Eng., 52(2):231β247, 2005. ISSN 0169-023X. dol :http://dx.doi.org/1 0.101 6/j.datak.2004.05.007. β pages 11, 12 [13] J. C. Mitchell. The implication problem for functional and inclusion dependencies. Inf Control, 56(3):154β173, 1983. ISSN 0019-9958. doi:http://dx.doi.org/10.1016/S0019-9958(83)80002-3. pages 7, 22,40 [14] L. Popa, Y. Velegrakis, M. A. HernΓ‘ndez, R. I. Miller, and R. Fagin. Translating web data. In VLDB β02: Proceedings of the 28th international conference on Very Large Data Bases, pages 598β609. VLDB Endowment, 2002. β pages 4, 7, 14, 18, 19 73
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On detecting and repairing inconsistent schema mappings
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On detecting and repairing inconsistent schema mappings Ho, Terence Cheung-Fai 2008
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | On detecting and repairing inconsistent schema mappings |
Creator |
Ho, Terence Cheung-Fai |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | Huge amount of data flows around the Internet every second, but for the data to be useful at its destination, it must be presented in a way such that the target has little problem interpreting it. Current data exchange technologies may rearrange the structure of data to suit expectations at the target. However, there may be semantics behind data (e.g. knowing the title of a book can determine its #pages) that may be violated after data translation. These semantics are expressed as integrity constraints (IC) in a database. Currently, there is no guarantee that the exchanged data conforms to the targetβs ICs. As a result, existing applications (e.g. user queries) that assume such semantics will no longer function correctly. Current constraint repair techniques deal with data after it has been translated; thus take no consideration of the integrity constraints at the source. Moreover, such constraint repair methods usually involve addition/deletion/modification of data, which may yield incomplete or false data. We consider the constraints of both source and target schemas; together with the mapping, we can efficiently detect which constraint is violated and suggest ways to correct the mappings. |
Extent | 1428416 bytes |
Subject |
Semantics Schema Mappings XML DTD |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-02-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051351 |
URI | http://hdl.handle.net/2429/4126 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2008-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2008_fall_ho_terence_cheung-fai.pdf [ 1.36MB ]
- Metadata
- JSON: 24-1.0051351.json
- JSON-LD: 24-1.0051351-ld.json
- RDF/XML (Pretty): 24-1.0051351-rdf.xml
- RDF/JSON: 24-1.0051351-rdf.json
- Turtle: 24-1.0051351-turtle.txt
- N-Triples: 24-1.0051351-rdf-ntriples.txt
- Original Record: 24-1.0051351-source.json
- Full Text
- 24-1.0051351-fulltext.txt
- Citation
- 24-1.0051351.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0051351/manifest