SeMap: A Generic Schema Matching System by Ting Wang B.Sc.', Zhejiang University, 2004 A THESIS S U B M I T T E D IN PART IAL F U L F I L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF Master of Science • in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia August, 2006 © Ting Wang 2006 Abstract The rapidly growing number of autonomous data sources on the web makes the need of effective tools of creating semantic mappings increasingly crucial. Moreover, the goal of allowing applications to have more expressive seman-tics requires a change in focus. While most previous work focus on creating mappings in specific data models for data transformation, they fail to cap-ture a richer set of possible relationships between schema elements. For example, current schema matching approaches might discover that 'TA' in one schema equals to 'grad TA' in another one, even though the relationship can be modeled more accurately by saying that 'grad TA' is a specializa-tion of 'TA'. This increased semantics of the mapping in turn allows for applications involving richer semantics. In this thesis we concentrate on the following problem: given initial match (correspondence) information produced by current schema match-ing techniques, how to construct a complex, semantically richer mapping that can be used across data models? Specifically, we aim at detecting the relationship types of 'Has-a', 'Is-a', 'Associates' and 'Equivalent'. Tech-nically, we achieve this goal in mainly three steps: (1) exploiting various types of semantic evidence for possible matches; (2) finding a globally op-timal match assignment; (3) identifying the relationship embedded in the selected matches. We implemented our semantic matching approach within a prototype system SeMap, and tested its accuracy and effectiveness. i i Table of Contents Abstract i i Table of Contents i i i List of Tables v List of Figures v i Acknowledgements v i i i 1 Introduction 1 1.1 Mo t i v a t i on 1 1.2 Con t r i bu t i on 4 1.3 Organ iza t ion 6 2 Related Work 8 2.1 Re la t ionsh ip Class i f icat ion 9 2.1.1 Equivalence Relat ionships 10 2.1.2 Set-Theoretic Re lat ionships 11 2.1.3 Gener ic Re lat ionships 11 2.2 Schema M a t c h i n g Techniques 12 2.2.1 Rule-Based Solut ions 13 2.2.2 Learning-Based Solut ions 15 2.3 Onto logy A l i gnment Techniques 16 2.4 Sample Prototypes 18 2.4.1 Rondo 18 2.4.2 C u p i d 18 2.4.3 C O M A 18 2.4.4 i M A P . 19 3 Problem Formulation 21 3.1 Representat ion y 22 i i i Table of Contents . 3.2 Problem Statement 26 3.3 Semantic Resources 27 3.3.1 Internal Resources 27 3.3.2 External Resources 29 3.4 Approach Overview 30 3.4.1 Schema Matcher 31 3.4.2 Match Selector 33 3.4.3 Mapping Assembler 35 4 SeMap System 37 4.1 Schema Matcher . 37 4.1.1 Base Matcher 38 4.1.2 Similarity Score and Lineage Information 39 4.1.3 Element-Level Matcher 39 4.1.4 Structure-Level Matcher 46 4.1.5 Architecture of Schema Matcher 48 4.2 Match Selector 50 4.2.1 Representation 51 4.2.2 Bidirectional search , 53 4.2.3 Modeling user interaction 55 4.3 Mapping Assembler 57 4.3.1 Combining Map3 and Mapt 58 4.3.2 Identifying relationships 58 4.3.3 Assembling mapping 62 5 Experimental Analysis 64 5.1 Experimental Setting 65 5.1.1 Data Set 65 5.1.2 Expert Mapping 66 5.1.3 Evaluation Metrics 67 5.1.4 Experimental Methodology 68 5.2 Experimental Result 69 5.2.1 Matching Accuracy 70 5.2.2. Component Contribution 74 5.2.3 Incorporating User Feedback 74 5.2.4 Discussion . . . • 76 6 Conclusion & Future Work 78 Bibliography 80 List of Tables 5.1 Characteristics of the input schemas 66 5.2 Characteristics of the expert mappings 67 v List of Figures 1.1 An example of input schemas and output mapping 2 2.1 A classification of current schema matching techniques. Cour-tesy of [22] • . . 13 3.1 Representation of model. The left plot shows a graphical representation of a model, comprised of nodes (elements) and edges (relationships). The right table shows the tuple repre-sentation of edges 22 3.2 Illustration of four relationship types handled by SeMap. . . . 25 3.3 An example of complex mapping handled by SeMap 26 3.4 Illustration of the matching process . . . 26 3.5 The basic system architecture of SeMap. It takes two models and external resources as input, and produces generic seman-tic mapping. It consists of three main parts: the schema matcher, the match selector and the mapping assembler. . . 31 4.1 Architecture of schema matcher. It consists of three layers, base matcher, combining layer and structure matcher 48 4.2 Partial match assignments from the perspectives of source and target schemas respectively 55 4.3 Mapping assembling for matches of different types. Each 1-1 equivalence match corresponds to one mapping element, while each element of complex match is associated with one mapping element 62 5.1 Matching accuracy of SeMap. The three plots show the recall, precision and F-measure of the matching results for the three relationship types Equivalent, Has-a, Is-a and total correct matches respectively 70 5.2 • Error analysis of the resulting mappings 73 vi List of Figures 5.3 The precision of SeMap after pruning incorrect matches. The bars from left to right shows the matching results for the three relationship types Equivalent, Has-a, Is-a respectively. . . . 73 5.4 Relative contribution of different types of semantic evidences to the matching results of SeMap. The two plots (from up to down) show the F-measure of identified matches (correspon-dences) and identified relationships respectively 75 5.5 F-measure of correct correspondences versus the amount of user interaction (percentage of expert matches provided over the total number of matches). The curves for four datasets (Real Estate 1/2, Course Info 1/2) are shown 77 vii Acknowledgements I would like to express my gratitude to all those who have offered me help in completing this thesis. Especially, I owe the greatest thanks to my supervisor Rachel Pottinger, who provided me with excellent guidance and support in the entire process of this thesis project. I want to thank Dr. Tsiknis for giving me insightful comments on this work, and being my second reader. I would like also to thank all the members at database management lab, especially Jian X u for their constructive suggestions. Without their help, this work would not be possible. Finally, I thank all my friends at the University of British Columbia. It has been a wonderful experience to grow up with them. vi i i Chapter 1 Introduction 1.1 Motivation Spurred by the growth of data sources on the web, information systems are witnessing a paradigm shift from monolithic databases to heterogeneous, interacting data sources. The fundamental problem in sharing data from multiple sources is to deal with the semantic heterogeneity inherent in their autonomous nature, and the key is to identifying the semantic correspon-dences between them. The operation of finding such correspondences is called Match, which takes two schemas as input and produces a semantic mapping, specifying the relationships between elements of the two schemas. Such semantic mappings play a crucial role in numerous data sharing ap-plications, including web data integration, schema evolution and migration, component-based development, etc. Currently, the creation of semantic mappings, especially complex ones is still mostly done manually, possibly supported by a graphical user interac-tion interface. Manually creating semantic mappings is a tedious, error-prone process. The labor-intensity grows linearly as the matches to be performed. Hence the rapidly increasing number of web data sources ne-cessitates automatic support for schema matching. 1 Chapter 1. Introduction The problem of semi-automatically creating mappings has attracted on intensive research in both the database and A l communities [2, 4, 10, 15, 28]. The procedure is comprised of two phases, schema matching and mapping construction. In schema matching, equivalence correspondences between el-ements of both schemas are identified. The equivalence correspondences can be one-to-one (1-1) matches, e.g., 'class' corresponds to 'course', or complex matches containing more than one element in each schema, e.g., 'TA' maps to some combination of 'grad TA' and 'ugrad TA'. Note that the focus of schema matching is to find such potential correspondences, rather than giv-ing a final mapping to the users. Finding this mapping is done in mapping construction, where the identified correspondences are built on by adding more specific semantic information to generate a semantically rich mapping. schema S Map S_T schema T Figure 1.1: An example of input schemas and output mapping. As a typical example of mapping construction, Clio [32] includes a set of user-interaction techniques to create SQL-style mappings, based on the output of an initial schema match. Such semantic mappings are necessary Chapter 1. Introduction to transform data. Clio however, like most other previous work on mapping construction, is restricted to relational and XML-sty le schemas; it does not capture the general richness of the possible relationships between elements in a data-model-independent fashion. Thus, although many common rela-tionship types exist across SQL and X M L (e.g., specialization), this work cannot be used to create the XML-style mapping. Data sources on the web however are, of various data models, e.g., X M L , H T M L , R D F , ontologies, text, etc. Hence exploring how to create richer, general relationships between schema elements, rather than concentrating on the specific data model under consideration, allows us to understand the general space of the possibilities. It also allows better reuse of ideas, since one does not have to create a separate algorithm for each ensuing data model. After a mapping with such general relationships is constructed, the transformations into a specific data model can be made more concretely. For example, it can be easily transformed into specific forms, e.g., SQL views or X S L T transformations, thus excluding the need of maintaining specific mappings separately. Also, a generic mapping can create a uniform in-terface between domain knowledge (ontologies) and web interface (database schemas), which is helpful for semantic web applications. Furthermore it can be fed into a model management system [17], which aims to solve meta-data problems in a data model neutral fashion, or used for knowledge inference when applied to ontology domain. A n example of a generic semantic mapping is shown in Figure 1.1, where two schemas S and T represent the concepts of 'class' and 'course' respec-tively. A generic mapping S.T is constructed, specifying a rich collection of 3 Chapter 1. Introduction semantic relationships between the elements of S and T, e.g., 'college' of T 'Has-A' 'dept' of S, whi le ' instructor' of S 'Is-A' 'faculty' of T . T h e relat ion-ship types adopted i n this thesis follow the rela t ionship classification of [21], Compared w i t h the equivalence relationships (1-1 or complex) considered i n previous l i terature, this relat ionship classification is semant ical ly richer and more expressive. E q u i p p e d w i t h such generic mappings, one can envision a number of appl icat ions. For example, one prob lem facing current semantic web appl icat ions is the lack of doma in specific knowledge (e.g., ontologies). If domain knowledge i n different representations can be mu t ua l l y converted, the col lect ion of knowledge w i l l be significantly enriched. 1.2 Contribution In this thesis we explore const ruct ing such generic semantic mappings , based on i n i t i a l match informat ion that shows correspondences between the ele-ments of b o t h schemas. T h i s i n i t i a l ma tch informat ion can be produced by current schema match ing techniques. M a p p i n g construct ion takes as input a set of i n i t i a l matches produced by a set of schema match ing algori thms, and generates a semant ical ly richer mapping , such as the one i n F igure 1.1, wh ich describes complex re la t ion-ships between elements of bo th schemas. Specifically, mapp ing const ruct ion is responsible for searching for a global o p t i m a l ma tch assignment from the poo l of possible assignments, so lv ing the conflicts among the selected matches, and ident i fying the complex relat ionships between the schema el -ements, e.g., the 'Has-A' relat ionships i n F igu re 1.1. However, cons t ruc t ing 4 Chapter 1. Introduction a generic semantic mapping is fundamentally difficult for several reasons: • Finding correspondences with generic semantic relationships is sub-stantially harder than simple equivalence, since the space of possibil-ity under consideration is much larger, and more semantic evidence is needed; • The pool of initial matches is possibly quite large. This search space is large enough in considering n:l equivalence matches to make most matching algorithms only consider 1:1 matches, but when relationships other than simple equivalence are considered, it is infeasible to try all possible combinations to find the optimal assignment; • Various semantic constraints can be imposed, rendering match selec-tion a complicated constrained optimization problem; • Identifying the relationships implicit in matches is a hard problem, and one that is made more difficult by attempting to make our output data model independent. As in schema matching, mapping construction inherently can not be fully automatic. The importance of user feedback is recognized in schema matching research [4, 31], however no systematic modeling of user interaction for mapping construction is available to date. One of the goals of our work is to limit interaction to critical points to help focus user attention and minimize user effort. Aiming at overcoming the problems listed above, in this thesis we de-scribe a prototype system SeMap to create a generic, semantic mapping. We 5 Chapter 1. Introduction choose a graph-based representation that is similar to that used in model management [17], which is expressive enough to accommodate both schemas of many types and other meta-data, such as ontologies. Specifically, we make the following contributions: • A n architecture for semi-automatically constructing generic semantic mappings based on initial correspondence information; • A novel probabilistic framework that incorporates match uncertainty and semantic constraints in a uniform way, and expresses match selec-tion to a mathematical optimization problem; • Effective modeling of user interaction to help focus user attention and minimize user effort, by detecting critical points where feedback is maximally useful; • Effective solution to extracting implicit relationship of initial match based on various types of semantic evidences; • A prototype system embodying the innovations above and a set of experiments to illustrate the correctness and effectiveness of our ap-proach. 1.3 Organization This thesis is a specification of our schema matching system SeMap. The goal is to present the technical details in implementing the system. Specifi-cally, we intend to make clear mainly the following three aspects: 6 Chapter 1. Introduction 1. The formulation of the problem, including the exact representation of the input/output of the system, the resources we use and the assump-tions we have made; 2. The specification of the system, including the system architecture, the exact input/output and interior structure of each component and their interaction; 3. The experimental analysis, including the dataset we can use, the met-ric we use to evaluate our approach, the experimental result and its explanation. The remainder of the thesis will be organized as follows: Chapter 2 presents a survey of related work; Chapter 3 formally defines the problem of mapping construction and gives an overview of the architecture of our system. In Chapter 4, we describe our mapping construction approach in more details; Chapter 5 presents the experimental analysis of our approach; and Chapter 6 concludes this thesis and presents future work. 7 Chapter 2 Related Work Semi-automatically creating semantic mappings has attracted upon inten-sive research in both the database (schema matching) and A l (ontology alignment) communities. The key differences and similarities of schema matching and ontology alignment include: • Differences. Ontologies are logical systems, which obey some formal semantics, i.e., they can be interpreted as a set of logical axioms; however database schemas often provide no explicit semantics for their data. • Similarities. Schemas and ontologies are quite similar in the sense that (1) they both provide a vocabulary of terms that describe a domain of interest and (2) they both constrain the meaning of terms used in the vocabulary [30]. Due to their differences, schema matching is usually performed with the techniques to guess the semantics implicit in the schemas, while ontology alignment is designed to exploit the knowledge explicitly encoded in the ontologies. Their similarities however make the solutions from these two problems mutually beneficial. Following, we wil l discuss the problems of schema matching and ontology alignment as a whole. 8 Chapter 2. Related Work In this chapter, we present a survey of related work in three parts: first in Section 2.1 we classify the current schema matching/ontology alignment techniques based on the relationships they can handle; we then discuss some typical techniques used in these approaches, specifically, schema matching in Section 2.2 and ontology alignment in Section 2.3; finally, we present several example prototype matching systems in Section 2.4. 2.1 Relationship Classification The relationship types created by matching techniques can be roughly di-vided into three categories: equivalent relationship, set-theoretic relation-ships and generic relationships. Specifically, two schema elements having the equivalent relationship means they are semantically equivalent, and the techniques to identify equivalent relationship is described in Section 2.1.1; the set-theoretic relationship classification regards each schema element as a set, and specifies their relationship as one of equivalence, subsumption, inter-section, disjointness and incompatibility, which is discussed in Section 2.1.2; the generic relationships refer to those non-equivalent relationships, such as Has-a and Is-a relationships discussed in this thesis. Two typical clas-sification of generic relationships can be found in ontology modelling [18] and meta-data management [21]. The techniques developed so far to handle generic relationships is presented in Section 2.1.3. 9 Chapter 2. Related Work 2.1.1 Equivalence Relat ionships With the main goal of data transformation in specific data models, most schema matching/ontology alignment algorithms to date aim at discover-ing the equivalence relationships [2, 3, 4, 10, 13, 14, 16, 31]. The found equivalence correspondence can be the case of a 1-to-l match (e.g., 'course' = 'class'), or a complex match (e.g., 'name' = concat('first-name' +'last-name')). The complexity of creating multi-arity (1-to-n or even n-to-m) matches is significantly harder than 1-to-l matches for several reasons: (1) while the number of candidate matches is bounded for 1-to-l match (the product of the sizes of two schemas), the number of match candidates to be considered in complex case is much larger. (2) it is inherently difficult to generate a match to start with in the case of multi-arity matches. That is in the case of n-to-m match, it is difficult to determine n and m in order to generate a set of candidate matches. Hence to date most the work on schema matching has been focused on discovering 1-to-l equivalence correspondences between schema elements [3, 4, 10, 13, 14, 16, 31]. R. Dhamankar et al. [2] proposed iMAP , a prototype of identifying 1-to-n correspondence matches, which re-formulates schema matching as a search in an often very large match space. To search effectively, it employs a set of searchers, each discovering specific types of complex matches. However, while attempting to discover semantically equivalent corre-spondences, it is possible that the matches identified by these techniques may not be exactly of equivalence relationships; they may instead be the 10 Chapter 2. Related Work semantically richer relationships we are endeavoring to find, such as the re-lationship between 'TA' and {'grad TA', 'ugrad TA'} as shown in Figure 1.1. 2.1.2 Set-Theoretic Relationships The equivalence relationship can be considered as a special case of the set-theoretic relationships, which can specify the relative containment relation-ship between two sets. In [26], an effective solution is proposed to identify inter-set relationships by bidirectionally comparing the containment of data instances and meta-data, of different schema elements. The problem with this approach is that the data instances associated with the two schemas should be in the same universe, otherwise the comparison of containment relationship is not meaningful. However, in many applications, especially web data integration, the data sources do not overlap. 2.1.3 Generic Relationships There have been very few works on finding generic relationships between schema elements. The solution proposed by D. Embley et al [5] relies heav-ily on a domain-specific ontology to find the relationships of Merge/Split (e.g., 'Address' consists of 'Street', 'City' and 'State'), Superset/Subset (e.g., 'Phone' contains both 'Phone.day' and 'Phone_evening'), and Set-Name as Value (e.g., the attribute 'Water-front' in one schema appears as a value of the attribute 'House-description' in the other schema). The basic idea is to first map the schema elements to a comprehensive domain-specific ontology, and the relationships between schema elements can then be determined by that of their counterparts in the ontology. This 11 Chapter 2. Related Work approach requires (1) a comprehensive ontology that covers all possible con-cepts that may appear in schemas in that domain; (2) a domain-specific the-saurus that can map schema elements to their alternative representations in the ontology. Such ontology and thesaurus are usually hard to obtain in real scenarios. Our work has fairly simple requirement for the needed semantic information, available in most schemas, and does not assume any compre-hensive ontology. Nevertheless the existence of such ontology can improve the quality of the matching results of our system. F. Giunchiglia et al. [7] proposed the concept of semantic matching, a pure schema-based approach. The basic idea is to first populate each element name with their meanings in some domain-specific dictionaries, and computes the specialization relationship of schema elements based on the containment relationship of their meanings. Their approach however works only for identifying Is-a relationship and tree-structured schemas. 2.2 Schema Matching Techniques The research on schema matching/ontology alignment provides a wealth of techniques to semiautomatically find semantic matches. The techniques can be classified by the information they exploit [22] as shown in Figure 2.1: the matches can be found by exploiting one type of semantic evidence (schema-level, data instance-level, etc), or combining multiple types of evidences (i.e., hybrid matchers, which integrate multiple matching criteria, and composite matchers, which combine results of independently executed matchers [22]). Matching techniques can also be classified by their methodologies into rule-12 Chapter 2. Related Work based and learning-based solutions, which will be discussed in Sections 2.2.1 and 2.2.2 respectively. schema matching techniques individual matcher combining matcher schema-level instance-level hybrid matcher ' element-level structure-level element-level • name similarity • graph matching • value patterns • type similarity • • • • data distribution • frequent term combining individual matcher manually: automatic: iterative matcher selection user feedback result combination Figure 2.1: A classification of current schema matching techniques. Cour-tesy of [22]. 2.2.1 Rule-Based Solutions Rule-based matching [7, 14, 16] techniques constitute a wealthy collection of schema matching solutions, which have been used in both early and cur-rent matching applications. Rule-based techniques discover similar schema elements by exploiting schema-level information using hand-crafted rules. A broad variety of rules have been devised to exploit all possible information, including element name (label), data types, structures, number of subele-ments, and integrity constraints. For example, F. Giunchiglia et al. [7] proposed to exploit the semantic meanings of element names to discover similar elements; Cupid [14] employs rules that categorize elements based on names, data types and domains; Similarity flooding measures pairwise similarity by propagating similarity from some fixed points according to the schema structures. 13 Chapter 2. Related Work The rule-based techniques have some desirable features: (1) they are usually inexpensive in computation and require no training process as in learning-based approaches; (2) they usually require only schema-level infor-mation, which is available in most matching scenarios; (3) if some domain knowledge is available, one can specify domain-specific rules, which can work very well in certain types of applications! For example, users can write reg-ular expressions that encode times or phone numbers, or quickly compile a collection of zip codes to help recognize these types of entities. The learning methods however can hardly deal with these scenarios. They either can not learn some complex rules, or require a large amount of training data with the correct representation for desired result, which is usually hard to obtain. However the rule-based techniques have several drawbacks: (1) they can not effectively exploit data-instance level information, even though the data instances provide valuable information, e.g., precise data format, data dis-tribution, statistical values, etc. It is possible in some cases that the schema-level information is opaque or very difficult to interpret, e.g., the element names like A or Bl are too abstract to be interpreted. In contrast, learn-ing methods such as Naive Bayes can easily construct some probabilistic rules that find similarity in such scenarios, based on the distribution of data instances [11]; (2) moreover, rule-based techniques can not exploit previ-ous matching results to improve the current matching process. Hence in a matching application for a specific domain, the rule-based techniques are usually insufficient. 14 Chapter 2. Related Work 2.2.2 Learning-Based Solutions Motivated by the drawbacks of rule-based matching methods, a collection of learning-based solutions have been proposed: these methods'have considered a variety of learning techniques, and exploited both schema-level and data instance level information. For example, Doan et al.. proposed the LSD system, which employs Naive Bayes learning method over data instances, and also exploited the structure information of X M L data format; The iMAP system [2] pays attention to the description of elements, in addition to other schema information. In developing learning techniques for schema matching, it has been real-ized that considering only schema-level or data instance-level evidence in the schemas being matched is often insufficient for a purpose of more accurate matching. Hence, several types of external resources have been considered to improve the matching quality. For example, assuming a domain-specific ontology is available, one technique is to first maps schemas/ontologies into the ontology, then constructing the matches based on the relationships inher-ent in that ontology [5]. For example, it is hard to identify the relationship between 'direct' and 'free toll' by using regular approaches such as string comparison. However, by mapping them to a domain-specific ontology, one can find that they are both specializations of the concept 'phone', so that it can be concluded that 'direct' is highly similar to 'free toll'. Some recent works advocate exploiting past matching results to improve current ones [3, 4], with the basic idea of learning from past matches to pre-dict unseen matching scenarios. An alternative solution considers learning 15 Chapter 2. Related Work from a corpus of schemas and matches [14]. Such corpus provides alternative representations of concepts in the domain, i.e., functions in the same way as ontology, thus can be leveraged to discover similarity between schema el-ements. However, it is not always practical to have such external resources, particularly since such these resources must be domain-specific to be effec-tive. 2.3 Ontology Alignment Techniques Ontology alignment deals with finding corresponding concepts in different ontologies. In this section, we present some typical work on ontology align-ment, and a comprehensive survey is referred to [10]. OntoMorph [1] focused on the problem of translation of symbolically represented knowledge between different knowledge representations. It used a description logic based approach, and offers syntactic rewriting to support the translation between two different knowledge languages, and semantic rewriting to support inference-based transformation. OntoMorph requires users to provide transformation rules, thus can be regarded as one type of rule-based technique. Prompt [20] proposed an ontology alignment mechanism that finds cor-responding concepts by refining an initial mapping (pairs of anchors) given by users or some simple linguistic matching approaches. Specifically, it ana-lyzes the paths in sub-graphs limited by the anchors and determines which concepts frequently appear in similar positions on similar paths. The phi-losophy followed by Prompt is similar to that of Similarity Flooding [16]. 16 Chapter 2. Related Work » FCA-Merge [28] is an example of alignment technique depending on external resource. The resources used in FCA-Merge are domain-specific documents, which cover the concepts in the ontologies. Through natural language analysis techniques, it generates a formal context for each docu-ment, which tells which documents contain which concepts. Based on these formal contexts, the Is-a relationships between concepts are inferred. How-ever since the formal context is built upon the generalization/specialization hierarchy of the concepts, this approach could not be extended to other re-lationships, such as Has-a. Moreover, the requirement of domain-specific documents is not always feasible. MAFRA [15] proposed a framework for sharing distributed ontologies via ,mapping. A multi-strategy process is employed to calculate the similarities between ontology entities, including lexical similarity, property similarity (attributes or relations). Both top-down and bottom-up similarity propaga-tions are employed. This can be considered as a counterpart of the hybrid matching techniques in schema matching. To our best knowledge, though the ontology itself can have complex re-lationships, e.g., Has-a or Is-a, the focus of most previous work on ontology alignment is finding semantically equivalent concepts or one specific type of relationships, e.g., Is-a in FCA-Merge, in different ontologies [10], rather than discovering corresponding concepts with more types of generic rela-tionships, and the rich relationships in ontologies are used only as one type of semantic evidence. 17 Chapter 2. Related Work 2.4 Sample Prototypes In this section, we consider some recent prototype of schema matching'sys-tems. 2.4.1 Rondo Rondo [17] is a complete prototype of generic model-management system, in which high-level operators are used to manipulate models and mappings between models. As one of its main operators, match is implemented using the Similarity Flooding (SF) algorithm [16]. SF utilizes a hybrid matching approach based on the idea of similarity propagation. It starts from a string-based comparison, e.g., common prefixes, suffixes, of the schema elements names to get an initial mapping, which is further refined using a fix-point computation. The matching process is well formulated as a mathematical optimization problem in SF. 2.4.2 Cupid Cupid [14] implements a hybrid matching algorithm that analyzes syntactic information at elements (e.g., string prefixes, suffixes), and structure infor-mation of schemas (e.g., tree matching weighted by leaves). Moreover, it exploits external resources, i.e., a pre-compiled thesaurus. 2.4.3 C O M A C O M A [3] is a composite schema matching system. It provides a matcher library composed of different matching algorithms. Its framework allows 18 Chapter 2. Related'Work the combination of partial results. The matcher library can be extended by adding new matching algorithms. Specifically, it contains 6 elementary matchers, 5 hybrid matchers and one reuse-oriented matcher. Compared with Cupid, this reuse-oriented matcher is a novel algorithm, which tries to leverage previously obtained results for new schemas. 2.4.4 iMAP iMAP [2] is a matching system that considers 1-to-n equivalence matches. The authors regard the problem of matching as a search in a usually infi-nite match space. The overall goal is achieved in three steps: (1) a set of basic matchers, called searchers are employed to detect similar elements ac-cording to different criteria (e.g., linguistic similarity, numerical equivalence, etc). Specifically, for each element in the target schema, a set of similar el-ements are found in the source schema by. the searchers, including 1-to-l and n-to-1 matches. (2) the match candidates generated in the first step are evaluated by a similarity evaluator module in this step, and the result is a similarity matrix which indicates the similarity between the target element and different match candidates. (3) a match selector module selects the best match candidate as the final result. i M A P also provides a explan'ation module which can provide explanation for the generated matches, e.g., the reason the match is selected, and the implicit equivalence relationship, etc. To the best of our knowledge, most previous work on schema matching focus on one-to-one equivalence relationships in finding semantic correspondences between two schemas. Little work is done in identifying multiple types of 19 Chapter 2. Related Work complex relationships. In the following chapters, we present SeMap, a pro-totype schema matching system which is designed to find generic semantic correspondences. 20 Chapter 3 Problem Formulation As discussed in the related work (Chapter 2), most work on schema match-ing so far focuses on finding one-to-one equivalence relationships between schema elements. The overall goal of our schema matching system, , is to identify generic semantic mapping between two schemas. And "generic se-mantic mapping" means (1) the matches may be non one-to-one, e.g., one element is mapped to multiple elements of the other schema, a.k.a, 1-to-n matches; (2) the relationship types may be non-equivalence, e.g., Has-a, Is-a, etc, as classified in Vanilla meta-meta model [21]. A n example of a generic semantic mapping is shown in Figure 1.1, where two schemas represent the concept of 'class'/'course' in different ways. The mapping contains complex correspondences, such as 'TA ' of schema S is mapped to 'undergrad TA' and 'grad TA' of schema T. Instead of the equiv-alence relation considered in most schema matching approaches, the rela-tionship types involved are also complex, e.g., the 'department' of schema S is considered as a member of the 'college' of schema T. 21 Chapter 3. Problem Formulation s P 0 'course' Associates 'course' 'course' Associates 'faculty' 'course' Has-a 'ugrad TA' 'course' Has-a 'grad TA' (b) Figure 3.1: Representation of model. The left plot shows a graphical rep-resentation of a model, comprised of nodes (elements) and edges (relation-ships) . The right table shows the tuple representation of edges. 3.1 Representation In this thesis we consider how to form a generic semantic mapping. Because we are attempting to solve this problem in a data model neutral fashion that could be applied equally well to relational or X M L schemas or an ontology, we adopt the terminology from Model Management [17], and say that we take as input two models1. A model is a complex design artifact, such as a relational schema, X M L schema, X M L D T D , or an ontology, etc. Technically, a model can be rep-resented as a directed labelled graph (V, E). Specifically, V is the set of nodes, each denoting an element of the schema, e.g., attributes in relational database table,' type definition in X M L schemas, clauses of SQL statement, etc. E is the set of binary, directed typed edges over V. Formally, each edge is a tuple < s,p,o >, where s is the source node, p is the type of edge, and o the target node 2, and p denotes the relationship between s and o. A n xIn what follows, we will use schema and model exchangeably. 2The notation < s,p,o > follows the notation of in ontolo-gies . 22 Chapter 3. Problem Formulation example of model representation is depicted in Figure 3.1, which illustrates the concept of 'course'. As indicated in [19, 24], in addition to Equivalent relationship, the con-cepts of generalization/specialization and part-of/whole have been long rec-ognized as ubiquitous and essential mechanisms in object-oriented modeling techniques, which have a large scope of applications, such as C A D , man-ufacturing, software development and computer graphics. In this thesis, we follow the relationship classification in Vanilla meta-meta model [21], which embeds the concepts of generalization/specialization and part-of/whole. Specifically, in the Vanilla meta-meta model, there are five relationship types, namely Associates, Is-a, Has-a, Contains, and Type-of, In this the-sis, we concentrate on the first four Associates, Is-a, Has-a and Contains, where Is-a represents the concept of generalization/specialization, Contains and Has-a represent the concept of part/whole, and Associates represents all other weak semantic relationships. Strictly speaking, though both Has-a and Contains embed the concept of part-of/whole, they are different in semantics. As indicated in [19], part-of relationships can be categorized in two dimensions, that is (1) the degree of sharing of parts among whole objects and (2) the degree of dependence between some part objects and some whole object(s). Contains and Has-a are different in the second dimension in that part objects are highly depen-dent on whole object(s) in Contains, while this dependence is not so strong in Has-a. This difference brings the rule that in a Contains relationship, the containee is a part of its container element, and cannot exist on its own (delete propagation). Moreover, Contains is a transitive relationship 23 Chapter 3. Problem Formulation and must be acyclic; while Has-a is weaker than Contains in that it does not propagate deletion and can be cyclic. Since we focus on the high-level part-of/whole relationship, we treat Has-a and Contains as the same in our framework. In addition, we also consider the equivalence relationship, which is the main focus of previous schema matching approaches. So totally, in our framework, we consider four relationship types: Equivalent, Has-a, Is-a, and Associates. Their formal definitions are specified as follows, and their graphical representation is shown in Figure 3.2: • Equivalent: E(x,y) means that x is equivalent to y semantically. This is a symmetric relationship type, i.e., E(x,y) E{y,x); • Has-a: H(x,y) indicates that x has a sub-component/member of y. This is an asymmetric relationship, i.e., H(x,y) can not infer H{y,x); • Is-a: I(x,y) means that x is a specialization of y. This is an asym-metric relationship; • Associates: A(x,y) indicates that x is associated with y. It is the weakest relationship that can be expressed. It has no constraints or special semantics. This is a symmetric relationship type. This representation is complex enough to capture many of the semantic relationships that appear in models, and yet is simple enough for a reason-able initial foray into the problem. A mapping, Maps-r is a formal description of the semantic relationships between two schemas, S and T. A mapping itself is a model consisting of a 24 Chapter 3. Problem Formulation Associates Equivalent Has-a Is-a Figure 3.2: Illustration of four relationship types handled by SeMap. set of mapping elements £, and a set of relationships TZ on £. The elements of the two schemas are related through the mapping ele-ments. Each mapping element e € £ is like any other element in schemas S and T. In addition to being the origin or destination of any kind of relation-ship found in a model, i.e., 71, each e € £ can be the origin of one or more mapping relationships, M(e,s), where s € S L I T , which specifies that the origin element e, corresponds to the destination element s. The semantics of a mapping relationship is such that for all si,s2 S S U T s.t. M(e,s\) and M(e,S2) , s i = S2, and s\ corresponds to s2. Given this rich mapping structure, the generic semantic relationship, not just simple correspondences between the elements of S and T can be expressed in this way: two semantically equivalent elements is represented by one mapping element; while the relationship of two mapping elements indicate that between their corresponding schema elements. For example, in Figure 3.3, the mapping element m i corresponds to the elements 'class' and 'course' representing the same concept; the relationship between 7714 and 7715 indicates 'instructor' 'is-a' 'faculty'. 25 Chapter 3. Problem Formulation Figure 3.3: A n example of complex mapping handled by SeMap. 3.2 Problem Statement Given the definition of model and mapping, we are now ready to formally define the goal of SeMap: given two models, S and T , find generic semantic, relationships required to create the mapping S-T between S and T. There may be some optional inputs to the matching process, specifically (1) an initial mapping S_T" which provides an initial set of correspondences, and needs to be refined by the process; (2) external semantic resources r used by the matching process, e.g., domain-specific thesauri, ontologies, etc. The matching process is illustrated in Figure 3.4. resource r SchemajS initial mapping 5_T" _ Matching Mapping S_T Schema T Figure 3.4: Illustration of the matching process. 26 Chapter 3. Problem Formulation 3.3 Seman t i c Re sou r ce s The semantic resources used by matching techniques can be categorized as internal resources, which is contained in the input schemas or associated data instances, and external resources, which is the semantic information not presented in the schemas or data instances. 3.3.1 Internal Resources The semantic resources of the input schemas include both element-level in-formation, which refers to the information stored at each schema element (e.g., element name, data type, structure, etc) and structure-level informa-tion, which refers to the information contained in the relationships between schema elements (e.g., relationship type, constraints, etc). In Section 3.3.1.1 and 3.3.1.2, we introduce the element-level and structure-level resources con-sidered in our SeMap system respectively. 3.3.1.1 Element-Level Information We consider the following element-level information: • Element name (label). Each element name is of String type. The name (label) provides a first layer semantic evidence of the possible meaning of this schema element. • Element type. If an element contains data, it is usually associated with a type indicating the storing format of the data. Note that in many representations, the data type of an model element is consid-ered as a separate element, which is linked to the element itself by a 27 Chapter 3. Problem Formulation Type-of relationship. In our system, we consider data type as an at-tribute of the model element, e.g., String is an attribute of the element 'professor', rather a separate element. The element type can provide hints in the sense that similar schema elements usually have the same or compatible data types. • Element description. It is a short description of the semantic mean-ing of the element, which usually contains more information than the element name only. For web interface where only schema-level infor-mation is available, the element description is especially valuable in determining the exact semantics of the elements. For example, it is hard to tell the semantics of an element only by its name 'people', in a flight ticket booking website. However, with the help of its descrip-tion of 'total passengers', one can conclude that 'people' stands for the overall number of tickets bought. • Data instances. As discussed in Chapter 2, data instances can provide valuable information that could not be found in schemas, e.g., precise data format, data distribution, statistical values, etc. Specifically, the data type of an element may not be exactly how its data is stored, which can only be found in data instances. For example, the element 'phone' may be of an Integer type. However, if looking at its data instances, one may notice that its exact format is of 'xxx-xxx-xxxx', which is not reflected in its data type. Meanwhile, the distribution of data instances is also useful in identifying similar schema elements, especially when the element names are obscure, e.g., A\ and B2 [11]. 28 Chapter 3. Problem Formulation 3.3.1.2 Structure-Level Information In addition to the element-level information discussed above, we also con-sider structure-level information. In our system SeMap, we mainly consider two types of structure-level evidence: • Relationship Type. Each edge between two schema elements is of certain relationship, which can be leveraged in matching process. The basic intuition is that if two elements are semantically similar, the elements having the same relationship with them are also highly likely to be semantically related. • Constraints. Each edge can have constraints, including (1) cardinality in relational database table, e.g., 1-n, 1-1, etc, (2) key properties of elements, e.g., unique, primary, etc. 3.3.2 External Resources Previous work on matching techniques has shown that internal semantic evi-dence is usually insufficient for achieving high quality matching results; some additional external resources should be leveraged to improve the matching quality. In SeMap, we consider two types of external resources: • Thesaurus. It is a dictionary which provides the different represen-tations of the same concept. Hence the element names can be first populated with their synonyms, so that one has a better chance to find similar elements. Specifically, SeMap uses W o r d N e t as the the-saurus. W o r d N e t is a comprehensive English lexical reference system, 29 Chapter 3. Problem Formulation which organize more than 60000 nouns, 11000 verbs, 6000 adjectives and 3000 adverbs into synonym sets (synsets). It is considered one of the most powerful tools for computational linguistics, and has been used in several matching applications [7]. • Ontology. Ontologies, especially domain-specific ontologies are pow-erful tools in discovering similar elements, even in identifying their implicit relationships. However they are not always obtainable. The collection of ontologies we employed in SeMap is provided from Onto-Builder [6]. 3.4 A p p r o a c h O v e r v i e w In this section, we present an overview of our generic matching system SeMap. As an implementation of the match operator, SeMap takes as in-put two schemas (models) S and T, and produces their generic semantic mapping S-T. In addition, SeMap has additional input of external semantic resource r. In order to identify the generic semantic relationships between schema elements, SeMap not only has to identify the correspondences of complex relationships, but also extract the implicit relationship types. Figure 3.5 shows the basic architecture of this mapping construction system. SeMap implements this goal mainly in three phases: In the first phase, schema matching, the candidate matches (correspondences) of generic semantic re-lationships are identified. Note that most previous work focus on finding correspondences of Equivalent relationships, while in our work we also have 30 Chapter 3. Problem Formulation lineage information source candidate , „ _ v generic schemas schema matcher. matches similarity score x match — f » K v sfilfir.t.nr e ec oMap, Map, mapping assembler mapping ^ 0

e', where e and e' are single elements in either schema. Tactics: Schema matching can be considered as a searching problem in a huge space, i.e., the number of candidate matches is quite large. There are an unbounded number of functions for combining attributes in a schema, and each one of them could be a candidate match. To search the space effectively, we employ a set of base matchers. Each base matcher considers 32 Chapter 3. Problem Formulation a meaningful subset of the space, corresponding to specific types of semantic information, e.g., element name, type, etc. The generic semantic match can be either of the form 1-to-l or the form 1-to-n (one source element corresponds to multiple target elements). However, since the focus of our work is to identify the relationship between schema elements, rather than finding the exact transformation rules, we consider the case of 1-to-n matches as n 1-to-l matches, and identify the relationship between the source element and each target element, which significantly simplifies the searching complexity. There are a number of different kinds of matchers that we can consider. In the current implementation of SeMap, the base matchers include the label matcher, the sense matcher, the type matcher, the structure matcher. SeMap's architecture is flexible enough that it would be easy to extend it to include other base matchers. The partial results produced by the basic matchers are combined to generate the similarity score and lineage information for each candidate match. Currently, the similarity score is calculated as a weighted sum of the similarity estimation in all base matchers. And the lineage information is the union of all the semantic evidences found by the base matchers. 3.4.2 Match Selector The match selector is responsible for assigning each schema element to a match, which as whole has the minimum uncertainty, and outputs two best match assignments Map3 and Mapt for the elements of S and T respectively. For example, as shown in Figure 3.3, from the perspective of source schema, 33 v Chapter 3. Problem Formulation the Maps may be this set of matches: 'class' : 'course', 'professor' : 'faculty', 'dept' : 'college', 'instructor' : 'faculty', and 'TA' : 'grad TA' , while the Mapt may consist of this set of matches: 'course' : 'class', 'college' : 'dept', 'ugrad TA' : 'TA ' , 'grad TA ' : 'TA ' and 'faculty' : 'professor'. These two mappings are then merged to form the final mapping as shown in Figure 3.3. More detailed discussion is referred to Section 4.2.2. Functionality: Match selector takes as input the list of candidate matches associated with their similarity scores, and search for the best global assignments from the set of candidate matches. Technically, from the pool of candidate matches, it selects a proper candidate match for each source/target element. Here 'proper' means that the candidate match has high probability (similarity score) and it satisfies user-defined domain con-straints to the maximum extent. Notations: Constraint: Constraint refers to the regularities imposed on the generated mapping. Roughly, in our system, we consider two kinds of constraints. (1) Schema-independent constraints are those imposed by the meta-meta data model languages, which express general rules that each model need to obey. For example, the Is-a relationship must be acyclic. (2) Schema-dependent constraints refer to those regularities imposed on specific schemas and data of the sources in the domain. Tactics: The functionality of the match selector is similar in spirit to the constraint handler as described in i M A P [4], which applies a set of domain constraints to select a subset of candidate matches. However, we propose a novel statistical model to incorporate both match uncertainty and domain constraints in the same framework, and express the match selection as a 34 Chapter 3. Problem Formulation constrained optimization problem, for which effective solution is available. We believe that if user interaction is involved, the accuracy of prediction can be greatly improved. In SeMap, we apply the technique of active learning to identify the critical points in selecting the matches where user interaction is maximally useful, so that the user effort can be significantly reduced. 3.4.3 Mapping Assembler The mapping assembler combines Maps and Mapt, identifies the relation-ship embedded in the selected matches, and assembles them into a generic semantic mapping that includes richer semantic relationships. For example, by consulting the associated lineage information, specifically, that the labels of the element TA' is a substring of that of 'grad TA', the two elements are detected of Is-a relationship. A detailed discussion of this part is referred to Section 4.3. Functionality: The best match assignments Maps and Mapt are com-bined to form a final mapping. Then for each match in the final mapping, the mapping assembler extracts related semantic evidences from lineage in-formation to identify the implicit relationship type. Tactics: The combining of the two match assignments Maps and Mapt is achieved by ranking each match according to their contribution to the final mapping, that is (1) the likelyhood of this match (2) the violation of domain constraints by including this match in the final mapping. The highest subset of matches are selected to form the final mapping. To identify the implicit relationship of each selected match, all types of semantic evidences generated by schema matcher are considered. For each 35 Chapter 3. Problem Formulation type of semantic evidence, a set of heuristic rules are specified. The results are then combined to vote for the final decision of the relationship type. In this chapter, we formalize the problem of creating generic seman-tic mapping, and give an overview of the schema matching system SeMap. We propose the techniques that (1) identify the correspondences (candidate matches) of generic semantic relationships; (2) select the matches from the pool to form the final mapping; (3) determine the relationship types implicit in the selected matches. In the next chapter, we.present the SeMap system in detail. r 36 Chapter 4 SeMap System In this chapter, we present the technique details of our schema matching system SeMap. As shown in Chapter 3, the overall architecture of SeMap consists of three main parts, the schema matcher, the match selector, and the mapping assembler, which are responsible for finding correspondences (candidate matches), selecting a subset of candidate matches to form a map-ping and identifying the implicit relationships respectively. The three parts are discussed in detail in Section 4.1 4.2 and 4.3 respectively. To illustrate how SeMap can produce the generic semantic matches, we show the match-ing process of the example in Figure 3.3 along introducing the components of our system. 4.1 S c h e m a M a t c h e r One problem facing schema matching/ontology alignment is the lack of suf-ficient semantic evidence to discover matches, which is even more severe if one intends to identify generic relationships between schema elements. A key conclusion from previous research is that an effective schema match-ing tool requires a combination of base techniques, e.g., linguistic matching, structure matching, detecting overlapping of data instance, etc [22]. Hence 37 Chapter 4. SeMap System we follow a composite approach in detecting potential correspondences. 4.1.1 Base Matcher The components of a schema matcher are a set of pre-existing matchers that exploit any available information, and incorporate base techniques in a uniform framework, such like C O M A [3] and i M A P [2], as discussed in Chapter 2. In these frameworks, a set of base matching approaches (aka matchers) are organized as a matcher library. They discover initial corre-spondences based on different types of semantic evidence, e.g., schema level, data-instance level, corpus, and ontology. W i t h the equipment of this set of base matchers, the problem of dis-covering initial matches can be modeled as a search in the possible space using various matchers, each of which exploits a meaningful subset of the space [2]. Based on the level of granularity on which matching is performed, base matchers can be classified as either element-level matchers, or structure-level matchers. The former computes mapping elements between individ-ual nodes, and the latter computes mapping elements between subgraphs. Named with the semantic resource they exploit, the base matchers we imple-ment in SeMap include sense matcher, label matcher, type matcher, ontology matcher, data instance matcher and structure matcher. Among them, the sense matcher, the label matcher, the type matcher, the ontology matcher and the data instance matcher are on element-level, while the structure matcher is on structure-level. 38 Chapter 4. SeMap System 4.1.2 Similarity Score and Lineage Information In addition to the set of initial matches (correspondences), we also expect the schema matcher to provide the following information: (1) similarity score. Each candidate' match m is associated with a similarity score Sim(m) € [0,1], indicating the belief about its uncertainty, with 1 meaning perfectly certain; (2) lineage information. For each initial match, one records the flow of information in and out of the system, such as assumptions, domain knowledge, etc. From this lineage information, one can trace how this match is generated. Lineage information will be valuable in discovering the generic relationships. The,task of similarity evaluation has been discussed in previous work [2, 3, 4], which exploits various types of similarity information, and employs learning, statistical or heuristic techniques. The module of recording lin-eage information is similar in spirit to the explanation module in [2], which provides explanation for user questions posed on the generated matches, such as explaining existing match, absent match, or match ranking. How-ever our work focuses on distinguishing different semantic relationships. The implementation of similarity evaluation and lineage information recording in SeMap will be specified along presenting each base matcher. 4.1.3 Element-Level Matcher Element-level matching techniques analyze the information at individual elements. A t the element level we exploit all the techniques discussed in the literature [7, 15, 22]; However these techniques can not be directly applied, 39 Chapter 4. SeMap System instead they must be extended to support the similarity score and lineage information. 4.1.3.1 Label Matcher The label-based matcher finds semantically related elements by evaluating the syntactic similarity of their labels (names). Typically it will find as similar names 'Match' and 'match', but not 'match' and 'alignment'. Before applying strict string comparison, we employ a number of stan-dard natural language preprocessing procedures that can help greatly im-prove the results of comparison: • Case normalization. It converts each alphabetic character in a string to its lower-case counterpart; • Soundex elimination. Soundex is an encodings of names based on their pronunciation instead of their spelling, e.g., '4U' and 'for you'; • Digit suppression. It suppresses digits and leaves characters only, how-ever it needs to be used with care, since there are cases digits are semantically meaningful, e.g., soundex, or chemical names; • Canonicalization. This is the procedure of converting names to its standard form by stemming and some other techniques. This is impor-tant to deal with symbols with special prefix/suffix [22], e.g., 'cname' —» 'customer name', and 'empno' —• 'employee number'; • Tokenization. The label may be comprised of a set of tokens,, e.g., 'french-course' is segmented as 'french' and 'course'; 40 Chapter 4. SeMap System • Stopword elimination. It eliminates those frequently-used words that can be found in a list (usually like, ' t o ' , ' a ' , . . . ) . The preprocessing steps take as input element labels, and produce for each label a set of tokens whose conjunction carries its meanings. For ex-ample, the label of ' g rad - T A ' is converted to the tokens of ' g r a dua t e ' and 'TA' . The similarity of two strings can be defined in various ways, e.g., Ham-ming distance, substring similarity, N-gram distance, edit distance, etc [10]. In our implementation, we adopt the N-gram distance, which is proved effective in information retrieval research, e.g., Rondo [17]. N-gram distance works as follows: let ngram(s, n) be the set of substrings of string s of length n, the N-gram distance of strings s\ and s2 is defined as \ngram(si,n) r)ngram(s2,n)\ 0(Si,S2) = r-r.—Ti—f\ n * min(\si\, \S2\) The normalization guarantees the N-gram distance within the range [0,1]. Let T i and T2 be the sets of tokens of the two elements respectively, the similarity score is defined as S im(Ti ,T 2 ) = l - min *(*i,t 2) i.e., it corresponds to the minimum distance between the tokens of two elements. The lineage information recorded by the label matcher includes (1) if the token sets of the two elements are equivalent; (2) if any tokens of the 41 Chapter 4. SeMap System two labels overlap; (3) if any tokens of the two elements are prefix/suffix or substring/string. As an example, in Figure 3.3, the tokens of 'TA' , 'grad TA' and 'ugrad TA' overlap, which is recorded as the lineage information of matches 'TAYgrad TA ' and 'TAYugrad TA' . From such semantic evidence, one can infer the possible relationships existing between the two elements, which will be discussed in Section 4.3. 4.1.3.2 Type Matcher The type of schema elements carries the information of data type (e.g, string, integer, float, etc), value domains (e.g., integer in the range of [1, 12]), and key characteristics (e.g., unique, primary, foreign). The type matcher deter-mines the similarity of schema elements based on such semantic information. In our implementation, we consider six commonly used basic data types, namely String, Integer, Float, Date, Nodata, Enum. The first four data types are self-explanatory. The Nodata type means that the element has no asso-ciated data instance, which usually appears at non-leaf node in the schema, e.g., X M L tree. Enum is short for Enumerate, e.g., an element may have the instance from a set of color {'red', 'yellow', 'blue'}. Note that all these basic data types can be combined sequentially to from a complex data type, i.e., Composition data type. For example, the address format 'Street + P.O.Box + City' can be considered as a composition of String, Integer and String. In implementing the type matcher, we define the possible relationship between two data types as one of equivalent, compatible and incompatible. We determine the relationship between two types using the following heuristic rules: 42 Chapter 4. SeMap System • The same basic data types are equivalent; • A l l the basic data types are mutually incompatible, except that the Integer type is compatible with that of Float; • Two composite data types are compatible if one appears as a subse-quence of the other, e.g., {String, Integer, String} is compatible with {Float, String, Integer, String}, while {Integer, String, String} is not; • Two value domains are compatible if one is a subset of the other, e.g., positive integer number is compatible with positive float number. In SeMap, the similarity score produced by the type matcher is denned on three metrics, data type, value domain (if there is any), and key character-istics (if there is any). For equivalent data types, it produces 1, compatible 0.5, and incompatible 0. The lineage information recorded by the type matcher includes (1) if the data types of the two elements are equivalent, compatible or incompatible; (2) if the data types are composite, and the number of basic data types involved; (3) if the value domains of the two elements overlap; (3) if their key properties (if any) are the same. For example in Figure 3.3, the elements 'TA ' , 'grad TA' , 'ugrad TA' , 'professor', 'instructor' and 'faculty' all have the same type String, which means they are possibly related. While 'class id' and 'course no' not only have the same data type, but also the same key property since they are both unique in the two schemas. 43 Chapter 4. SeMap System 4 .1 .3 .3 S e n s e M a t c h e r It is common that the same concept has different semantically equivalent rep-resentations in different schemas. Such implicit similarity can hardly be de-tected by syntactic analysis techniques. For example, 'car' and 'automobile' are synonyms, and refer to the same concept; while 'book' and 'article' have the same hypernym 'publication', and they are considered similar. Such semantic-based matching is especially valuable for schemas with relatively flat structure, and other scarce affiliated syntactic information (e.g., data instances). Moreover, since we are attempting to deal with generic rela-tionships, this semantic matching technique provides important hints for detecting the embedded relationships. Exploiting synonyms and hypernyms however requires the use of thesauri or dictionaries. In the implementation of SeMap, we employ WordNet as the dictionary, which is an English lexical database. In WordNet, the senses (atomic meanings of a word or expression) of words are organized into sets of synonyms (synsets), and synsets are in turn organized into hierarchical based on their semantic relationships. The basic idea of the sense matcher is to first populate the tokens of each element with their senses in WordNet. The senses of an element is thus the union of the senses of all its tokens. B y comparing their senses, one can evaluate the similarity of the corresponding two elements. The possible relationship between two synsets X and Y considered in our system include: • Holonym. Y is a holonym of X if X is a part of V ; 44 Chapter 4. SeMap System • Hypernym. V is a hypernym of X if X is a (kind of) Y; • Hyponym. X is a hyponym of Y if X is a (kind of) Y; • Meronym. X is a meronym of Y if X is a part of Y. We measure the similarity score of two elements in terms of the rela-tionship of their senses, as denned as follows: (1) 1 if there is at least one sense of the first label which is the same or a synonym of the second, i.e., they share the same synset; (2) 0.5 if there exists at least one sense of one label that has a sense of the other as a hypernym, a holonym, a hyponym or a meronym; (3) 0 if two labels share no sense in common. The lineage information records the number of relationships detected for each kind of holonym, hypernym, hyponym and meronym. For example, in Figure 3.3, 'course' and 'class' are synonyms; one sense of 'department' is a meronym of that of 'college'; while both 'professor' and 'instructor' have senses which are hyponyms of that of 'faculty'. Note that within the context of the schema, the token.may have only a small subset of its possible senses. The pruning of those fake senses depends on other matchers. This pruning will be explored in our future work. 4.1.3.4 Ontology Matcher Domain-specific ontology provides a powerful tool to identify related schema elements. The basic idea is to first map the schema elements to their counter-parts in the domain-specific ontology, (which may requires enhancing schema elements by populating with their semantically equivalent representations), 45 Chapter 4. SeMap System and then identify the relationships of schema elements by that between their counterparts in the ontology. 4 .1 .3 .5 D a t a - I n s t a n c e M a t c h e r Instance-level data provides important insight into the semantic meaning of the schema elements. This is especially valuable when the schema-level information is limited. Even when substantial schema information is avail-able, the use of data instance-level matching can also be used to uncover incorrect interpretations of schema information. Specifically, some useful matching techniques that can be applied to data instances include: (1) For text data type, information retrieval techniques, such as linguistic characterization evaluates the similarity of two schema elements by comparing the relative frequencies of words and combinations of words in their data instances; (2) For numerical data type, statistical characterization such as numerical value ranges and averages, can provide insight into the similarity of the corresponding schema elements. The main benefit of evaluating instances is a precise characterization of the actual contents of schema elements, which can moreover, help enhance those schema-level matchers. For example, the value ranges found by the data-instance matcher can improve the effectiveness of the type matcher. 4.1.4 Structure-Level Matcher While element-level matcher provides background or context information for each pair of elements, one still needs to consider the matching problem within the picture of the whole schema, which is achieved by a structure-level 46 Chapter 4. SeMap System matcher. Within the structure-level matching techniques, each schema is viewed as a labeled graph, representing schema elements and their inter-relationships, and the similarity comparison between a pair of nodes is based on their po-sitions in their graph and their relationship with neighboring nodes. The intuition behind structure-level matcher is that if two schema elements from two schemas are highly related, their neighbors having same relationship to them might also be related to some extent. Matching graphs is a combina-torial problem that is computationally expensive. It can usually only solved by approximate approaches. The structure-level matcher employed in SeMap is Similarity Flood-ing [15]. It solves the optimization problem by a fixed-point algorithm. Based on an initial mapping generated by other matching techniques, simi-larity flooding propagates the similarity of mapped elements to adjacent ones which have similar relationship to the mapped elements. The algorithm ter-minates after a fixed point has been reached. The output of the similarity flooding is a similarity score for each pair of schema elements from the two schemas. For example, in Figure 3.3, if the elements 'class' and 'course' are detected as highly similar, then the elements 'TA' (which has Has-a relation-ship to 'class'), and 'ugrad TA' (which has Has-a relationship to 'course') are also likely to be similar. Note that structure-level matcher is based only on initial mapping information and structure of graph, and does not consider any other semantic information. Its output does not contain any lineage information. 47 Chapter 4. SeMap System label ' \token r preprocessing rzr£> . ' v. (WordNet) sense type r^sl type matcher] o a y B c r & era ' lineage info initial sim structure matcher final sim Figure 4.1: Architecture of schema matcher. It consists of three layers, base matcher, combining layer and structure matcher. 4.1.5 Architecture of Schema Matcher Instead of applying the matchers discussed above individually, we explore their interaction in order to achieve a best matching quality. The basic archi-tecture of the schema matcher is illustrated in Figure 4.1. It mainly consists of three layers, element-level matcher, combining layer and structure-level matcher. Element-level matchers include the label matcher, the sense matcher, the type matcher, etc, which find the similarity between schema elements by exploiting their element-level information. Note that some element-level matchers need preprocessing step to facilitate matching. For example, the preprocessing step converts the labels of elements into meaningful tokens, which can then be processed by the label matcher; the tokens of each element should be first populated with their senses in WordNet, in order to be used by the sense matcher. The output of each element-level matcher is (1) similarity score and (2) lineage information, as discussed in previous section. In the combining phase, the similarity scores produced by the element-level matchers are integrated to form a unified score. The combining scheme 48 Chapter 4. SeMap System can be of various forms, as long as it guarantee that the final result is well normalized, i.e., in the range of [0,1]. In the implementation of SeMap, we follow a simple linear combining scheme, that is, each elementrlevel matcher em is associated with a weight wem, and the similarity score S i m e m produced by that matcher is damped by that weight when computing the unified score. Formally, the similarity score Sim = ^ e m w e m S i m e m , where Ylemwem •— 1 guarantees that the result is well normalized. This set of weight parame-ters must be tuned carefully in order to achieve the optimal result. A n d the tuning of the parameters is problem-specific, i.e., specific for different input schemas, which is not the focus of this thesis. A n automatic learning approach is referred to [27]. As an example, in Figure 3.3, the elements 'instructor' and 'faculty' are detected as equivalent by the type matcher, similar by the sense matcher, and unsimilar by the label matcher. If all the mathers have the same weight, and the equivalent, similar and unsimilar matches have similarity scores 1, 0.5, and 0 respectively. Then 'instructor' and 'faculty' have the unified score ( l + 0 . 5 + 0)/3 = 0.5. The unified similarity scores generated in the second phase are fed into the structure-level matcher as initial mapping. The structure matching is then performed to produce the final similarity score for each pair of elements. In conclusion, for each potential match, the schema matcher produces (1) similarity score, indicating the uncertainty about this match, and (2) lineage information, recording how this match is detected. 49 Chapter 4. SeMap System 4.2 Match Selector As shown in the SeMap architecture, Figure 3.5, the matcher selector is the second main part of the architecture, which is responsible for selecting a sub-set from the pool of initial matches produced by the schema matcher. It also contains a user interaction module which exploits user feedback to improve the mapping quality. In this chapter, we introduce a novel probabilistic framework, which allows us to express the match uncertainty and domain constraints in a uniform way, and match selection can then be transformed as an optimization problem, shown in Section 4.2.2. With in this framework, we reduce the need of user interaction and focus user attention by identify-ing critical points where user feedback is maximally helpful, as discussed in Section 4.2.3. Given the pool of initial matches and associated similarity scores, the match selector searches for a global optimal match assignment that satisfies a set of domain constraints, e.g., in the case of Figure 1.1, user may specify that each class has only one instructor, then mapping 'instructor' to two elements will violate this constraint. Most prior work studied this problem in the case of 1-1 correspondences. They either first apply the constraints to narrow down the pool of possible mappings, and then transform match selection to a stable marriage problem in a bipartite graph [16], or assume that the initial matches are mutually independent, and seek a trade-off between the uncertainty and constraints satisfaction [4]. As indicated above, the similarity scores represent our belief about the 50 Chapter 4. SeMap System initial matches, and the semantic constraints represent our preference for certain matches, i.e., they are both probabilistic in nature. Hence it is natural to adopt a probabilistic model to express match uncertainty and constraints in a uniform way. In this thesis, we present such a framework, which allows us to in turn model match selection as an optimization prob-lem. In Section 4.2.1, we introduce the representation of this probabilistic framework, and discuss the match selection problem and its solution within this framework in Section 4.2.2. 4.2.1 Representation In this section we show how to incorporate match uncertainty and semantic constraints in a probabilistic framework. This approach is inspired by the work of [12]. As a research proposal, [12] indicates the need of a probabilis-tic model to express this uncertainty. Our work can be considered as an implementation of this idea, moreover, we extend this framework to support domain constraints. Hence it is a contribution to both schema matching and mapping construction. Formally, each schema element e is associated with a set of initial matches Me, and can be assigned to a match m e Me- The probability of assigning e to match m* € Me is denned as: /^ * N Simfm*) P(e = m*)= v ' EmeMe Sim(m) where Sim(m) is the similarity score of match rn provided by the schema matcher. Intuitively, this represents the preference for matches with lower 51 Chapter 4. SeMap System uncertainty. It is easy to verify that this model is well normalized: £ P(e = m) = l m€Me Each mapping consists of a set of matches, of which some may violate the domain constraints specified by the users. To achieve the best quality mapping, we take consideration of domain constraints in selecting matches. Specifically, we associate each constraint c with a penalty function HC(A4), checking whether a match assignment M violates c and returning the degree of violation. Describing Iic(M) is beyond the scope of this thesis, but it must increase as the severeness of the violation. Each constraint c is also assigned a weight ac, indicating its strictness. The weights can be hard coded or learned from known mappings [14]. Given a set of constraints C, the probability of a set of elements £ taking match assignment Ai (where ei £ £ is assigned to m, S M) is defined as: P{£ = M\C) = 4(11 P(ei = mi)}^c£c-<*cTic(M)) where Z is a normalization constant, in order to guarantee that P(£ = M\C) G [0, l ] . 3 In this equation, the total likelihood of the match assignment for £ is captured in Y\e.e£ P{ei = mi), while the exponential part represents the penalty of violating the constraints. The more constraints that axe violated, the lower the probability is. "Formally, Z = ZM Ue,ee P ( e ' = m 0 ( E c 6 C - Q c I M A 1 , ) 52 Chapter 4. SeMap System 4.2.2 Bidirectional search We intend to express the match selection as a well-defined mathematical problem, so that existing formal approaches can be used. With in the prob-abilistic framework introduced in Section 4.2.1, one can model match selec-tion as an optimization problem, and a rich set of tools can be applied to the problem. Since in this framework, the joint probability of the selected matches measures the uncertainty of the mapping, for a set of schema ele-ments £, match selection amounts to finding the match assignment M that maximizes the probability P(£ = M\C) under constraints C. It is infeasible to consider all possible combinations from the domain of £ to find the optimal match assignment. For example if each e e £ is associated with k initial matches, one potentially has to consider \£ \ h combinations. It is highly likely that £ comprises a series of disjoint subset £ \ , £ 2 , • • which are mutually independent, i.e., in graphical model, the graph of £ consists of a set of separated sub-graphs. Hence we can optimize these independent parts separately, that is maxP(£ = M\C) = max TT P{£i = Mi\C) M Mi,Ma,... . i However it is possible that the size of € £ can still be quite large, leading us to considering more efficient solutions, including several effective graphical model optimization algorithms proposed in machine learning com-munity [9], e.g., graph-cuts, and several heuristic searching methods, e.g., A* algorithm. In the implementation of SeMap, we apply the A* algorithm [8] to this optimization problem. A* algorithm is a graph search algorithm that 53 Chapter 4. SeMap System finds a path from a given initial node to a given goal node (or one passing a given goal test). It employs a heuristic estimate that ranks each node by an estimate of the best route that goes through that node. It visits the nodes in order of this heuristic estimate. In our implementation, each schema el-ement is considered as a node, and the candidate matches are the possible paths to the schema elements on the other side. The value of each path is heuristically estimated, and a globally optimal assignment is selected. The discussion above focuses on match selection for an arbitrary set of elements £ . In the context of complex match, the perspectives from source and target schemas could be significantly different. For example, in Figure 3.3, starting from the side of Schema T, one may discover that 'grad TA' is a specialization of 'TA' , but miss the fact that 'TA ' consists of both 'grad TA ' and 'ugrad TA' . To simplify the problem, one is willing to distinguish the perspectives of source and target schemas, i.e., treat the match selection for source and target elements separately. Thus instead of searching the best matches for source elements £ s or target elements ^separately, as in previous work, e.g., i M A P [2], one runs the optimization algorithms for both source elements £ s and target elements £t, i.e., bidirectional search. The result of bidirectional search are two sets of matches Maps and Mapt- A n example of Mapa and Mapt is shown in Figure 4.2, where from the side of S, the element 'TA ' is assigned the correct match, however 'instructor' is not; while from the perspective of T , the case is just opposite. Maps and Mapt wil l then be merged to form a final complete mapping. The merge operation wil l be discussed in Section 4.3. 54 Chapter 4. SeMap System schema S schema T schema S schema T » — J Figure 4.2: Partial match assignments from the perspectives of source and target schemas respectively. 4.2.3 Modeling user interaction Capturing user feedback is crucial for improving mapping quality. Though user feedback is used for error correction, mapping refinement, etc. [32], modeling user interaction is a comparatively under-studied problem. Fol-lowing we show that the probabilistic framework discussed above can be extended to modeling user interaction. The key to modeling user interaction is to identify those critical points where feedback is maximally helpful, so.that user attention can be focused on important problems, and user workload is minimized. We present an active learning solution to simulating the effect of user's selection of candidate match for schema elements. This is an extension of the approach discussed in [31], where active learning is applied to learning the optimal parameters in matching web interface. In our approach, the model elements are ranked based on their potential information value, and user feedback is asked for on those most informative ones. A natural measurement of information value is entropy, which represents the uncertainty about a signal or random invents. Formally, for a variable 55 Chapter 4. SeMap System x with distribution P(x), its entropy H(a;) is denned as H(x) = - 53 p(x) l o S p ( x ) X Intuitively, entropy is a measure of randomness or uncertainty. The higher the uncertainty, the larger the entropy value. In the context of match se-lection, the entropy corresponding to each schema element measures the uncertainty about its match assignment, which however does not take ac-count of its influence on match assignment of other schema elements. This deficiency of entropy leads to another metric, mutual information (IM): I(x,y) = M(x) - W(x\y) = M(y) - W(y\x) Intuitively, the mutual information of two random variables x and y, I(x, y) expresses the reduction in the uncertainty about x by virtue of being in-formed the value of y (or vise versa). Hence one seeks the element having the maximum mutual information with other elements, i.e., the one whose match assignment can maximally help identify the best match for others. Formally, one seeks the most informative element e that maximizes X]e'e£ ^(e'e')' ^ e joint probability of P(e, e'\C) can be calculated according to Equation 4.2.1, which will yield the most informative element. Once the most informative element has been selected, this element should be disambiguated by the user, which is the key tenant of active learning: the system should prompt the user to disambiguate cases where the user's input will provide the most information to the system. Once the user's selection on 5 6 Chapter 4. SeMap System the most informative element is obtained, the setting of the match selection problem needs to be updated accordingly. Specifically, assuming that the match assignment of an element e is fixed, e.g., e = m*, then P(e = m") — 1, and P(e = m) = 0 (m ^ m*). Based on the new belief about e, the joint probability of variables is updated accordingly. Note that only the joint probability of the subset of £ involving e needs to be updated. In this section, we present a novel probabilistic framework, which incor-porates both match uncertainty and domain constraints in a uniform way. The match selection is solved as a constrained optimization problem, and user interaction is significantly reduced by identifying those most critical points. 4.3 Mapping Assembler The bidirectional search (Section 4.2.2) produces two sets of matches Maps and Mapt, representing the correspondences from the perspectives of source and target schemas respectively. The following component of the mapping construction process is the mapping assembler, which is the final phase of SeMap system, as shown in the architecture diagram, Figure 3.5. In the mapping assembler, Map3 and Mapt are combined to form a final generic semantic mapping. Specifically, in this process, we aim to solve the fol-lowing problems: select an optimal set of matches from both mappings (Section 4.3.1); identify the relationship implicit in the selected matches (Section 4.3.2); assemble these matches together to form a final, generic semantic mapping (Section 4.3.3). 57 Chapter 4. SeMap System 4.3.1 Combining Maps and Mapt By using our novel bidirectional searching to create Maps and Mapt, we have considerably narrowed the number of matches that must be considered in order to build our generic semantic mapping. Thus, it results in a much smaller search space to be examined in determining semantic relationships (such as the ones in Figure 3.3). To merge the Maps and Mapt to form a final mapping, we present a heuristic approach, which our preliminary tests have shown to work effectively in practice. Let M. denote the set of matches in Maps and Mapt- For each match m £ M, one calculates the reward of including it in the final mapping: R(m) = Sim(m) ( ^cE m ' e ^ -a cn e(m,m')) Intuitively, this reward function takes account of the similarity score of the match, and the constraints it violates together with other matches in M. One then filters those matches with reward R(m) lower than certain thresh-old e. 4.3.2 Identifying relationships One key step of SeMap system is identifying the relationship implicit in the selected matches. Since we intend to create generic semantic relation-ships, there is little previous work done on this problem. In SeMap, we propose a rule-based method to identify the implicit relationships. Because the matches are identified by various schema matching techniques, based on different semantic evidence, a uniform solution is hard to obtain. Instead, 58 Chapter 4. SeMap System for each type of semantic evidence, we define specific rules to extract the generic re lat ionship embedded inside. Fo l lowing, we show how to ident i fy the four specific relat ionships, Equivalent, Has-a, Is-a and Associates w i t h the help of semantic evidence. We classify the semantic evidence into three categories, schema-level (e.g., label , type, structure) , semantic-level (e.g., sense), instance-level(i.e., data-level), and ontology. Usua l l y a match is associated w i t h semantic ev-idence f rom mult ip le categories. We describe each category i n more detai l below. Schema-level evidence includes labe l , type, and structure in format ion , etc. Genera l ly speaking, schema-level in format ion alone is insufficient to determine the embedded generic re lat ionship. However, i t provides support for the results c la imed by other semantic evidence, f rom which it can further help deduct imp l i c i t relat ionships. In our implementa t ion , corresponding to each k i n d of lineage in format ion , we appl ied the fol lowing heurist ic rules: • Labe l . T w o elements w i t h Equivalent re lat ionship are l ike ly to share s imi lar names; T w o elements w i t h Is-a or Has-a re lat ionships tend to have labels w i t h pref ix/suff ix re lat ionships; e.g., For example, f rom the fact that 'grad-TA' has a substr ing/suff ix of ' TA ' , we know that probably that 'grad-TA' is a part or a specif icat ion of 'TA ' , which depends on other type of semantic in format ion ; • Type . T w o elements w i t h Equivalent or Is-a re lat ionship probab ly share the same da ta type, whi le i f one element has a da t a type as a subcomponent of that of the other, i t is l ike ly that they have Has-a 59 Chapter 4. SeMap System relationship; • Structure. Two non-leaf nodes, or two leaf nodes possibly have the relationship of Equivalent or Is-a, while one non-leaf node and one leaf node are likely to be of Has-a relation. Semantic-level evidence is the sense of the label (name) of the schema element. Based on the lineage information produced by the sense matcher, one can infer the embedded semantic relationships. The heuristic rules are as follows: • If two elements share some same sense, it is highly possible that they are semantically similar or equivalent; • If some senses of one element act as the hypernym of that of the other, then they may be in a Is-a relationship; • If some senses of one element are the hyponyms of that of the other, the two elements probably are of Is-a relation; • If some senses of one element appear as the holonym or merohym of that of the other, they can be in a Has-a relationship; Instances (i.e., data) give the entity-level clues about the relationship between schema elements. Hence instance-level evidence usually precisely characterizes the actual content of schema elements. B y studying the sub-sumption, or distribution similarity of the data instances of schema elements, one can discover the relationship that is difficult to identify on the schema-level, due to the differences of schema designs. 60 Chapter 4. SeMap System • If two elements have similar data distribution, they are likely to be 'equivalent'; • If the instance of one element x subsumes that of another element y, it is likely that x 'has-a' y as its member; • If the instances of x and y intersects, it is possible that x 'associates' with y. A domain-specific ontology provides alternative representations of con-cepts in the domain, and their possible relationships. Since they directly provide the information about the relationships of the identified matches, it is trivial to refine these relationships to generic ones. For example, one can simply follow certain mapping between the two relationship classification systems, e.g., that between ontology modeling and meta-data management, and convert the relationship from ontology/corpus to our generic relation-ship types. Note that a match is usually associated with various semantic evidence, which possibly indicate however different types of generic relationships. In combining the results suggested by various semantic evidence, we follow a voting scheme: each type of evidence is associated with a weight, and a unified conclusion is obtained by linearly combining these results as in the case of schema matcher (Section 4.1.5). Generally speaking, the weights are set according to the relative plausibility of semantic evidences. In our system, we regard the types of semantic evidence has the following rank: ontology > instance > sense > schema information, and set the weights accordingly. 61 Chapter 4. SeMap System (=) Is-a, m(,i mi, (=) ls-aN m&2 M Figure 4.3: Mapping assembling for matches of different types. Each 1-1 equivalence match corresponds to one mapping element, while each element of complex match is associated with one mapping element. As an example in Figure 3.3, for the match of 'professor' and 'faculty', the sense evidence suggests Is-a relationship, the type evidence indicates Equiva-lent or Zs-aequally, while the label and structure evidence provide no advice. Assuming the four types of evidence have weights 0.4, 0.3, 0.2 and 0.1 re-spectively, the voting result correctly identifies the match 'professorVfaculty' as Is-a relationship. 4.3.3 Assembling mapping Finally, given that we have discovered the semantic relationships as in Sec-tion 4.3.2, we must create the final mapping. In order to ensure that the final mapping meets the specifications in the problem definition (Chapter 3), we apply the following rules to create mapping elements: (1) For 1-1 equivalent match, one mapping element is sufficient to represent them in the mapping; (2) For 1-1 non-equivalent match or complex match, one mapping element is created for each element of the match; (3) For each of other input element not involving in match, one corresponding mapping element is created. All 62 Chapter 4. SeMap System these cases are illustrated in Figure 4.3. Note that it is possible to iden-tify the relationships between the mapping elements belonging to different matches, in order to form a model-like complete mapping, using approach similar to the merge operator [21], though it is not the focus of this thesis. 63 Chapter 5 Experimental Analysis To evaluate the effectiveness of our schema matching system, we applied SeMap to several real-world domains. Specifically, the experiments are per-formed with two main goals: • To evaluate the matching accuracy of our schema matching system. Since our goal is to detect the complex relationship existing between schema elements, the matching accuracy includes not only the corre-spondences as measured in most previous schema matching systems, but also the detected relationship types. • To measure the relative contribution of different system components to the result. Specifically, we are interested in measuring the performance gain from (1) different base matchers (2) match selector including user interaction. This chapter presents the empirical analysis of the SeMap system. We first describe the experimental setting in Section 5.1. This includes (1) the dataset used in experiments (Section 5.1.1); (2) the expert mappings (Sec-tion 5.1.2); (3) the metrics to evaluate matching results (Section 5.1.3) and (4) the experimental methodology (Section 5.1.4). The evaluation results of matching accuracy are then presented in Section 5.2.1. Next we show 64 Chapter 5. Experimental Analysis the relative contribution of each system component to the final matching results in Section 5.2.2. Finally, based on the evaluation results, we analyze the strengths and weaknesses of our approach in Section 5.2.4. 5.1 Experimental Setting 5.1.1 Data Set We evaluated SeMap on both synthetic and real datasets. The synthetic dataset is the example shown in Figure 3.3. The real-life datasets are from two domains, namely Real Estate, and Course Information. A l l the real datasets are imported from the Illinois Semantic Integration Archive [29]. Specifically, the Real Estate dataset is a set of schemas describing the infor-mation of houses for sale; And the Course dataset is a set of schemas on the information of courses offered across different universities. A l l the schemas used in evaluation are included in the Appendix A of this thesis. Some of these schemas are associated with data instances (Real Estate and Course Info). In our experiments, for the schemas with data instances, we exploit the exact data format of the elements by looking at their data instances in addition to the schema-level information. For schemas without data in-stances (Synthetic), we evaluated our approach on schema-level information only. Since all the original real-life schemas are in D T D format, we performed a converting operation to import the schemas into our model representation. We faithfully mirror the structure and terms from source schemas. Because in the model representation, every edge between two elements is of certain 65 Chapter 5. Experimental Analysis domain schema # leaf/non-leaf # relationship Homeseekers 25/3 27 Real Estate Texas 31/3 33 Yahoo 23/2 24 Reed 12/3. 14 Course Info Rice 11/4 14 UWM 15/4 18 Synthetic course 5/1 5 class 5/1 5 Table 5.1: Characteristics of the input schemas. relationship type, while it is not the case in DTD, we set all the relationships as Has-a by default. For example, if the schema in Figure 3.1 is represented as DTD, which has no relationship type between the element 'course' and any other elements, SeMap by default regards all these links are of Has-a relationship. In preparing the data, it is necessary to apply some trivial data cleaning operation, such like splitting 'custno' into 'cust' and 'no'. This is not the focus of our SeMap system, so we will not discuss it further in the thesis. More detailed discussion of data cleaning can be found in [23]. 5.1.2 Expert Mapping In preparing the real-life datasets, we extracted three schemas from each domain. We chose the schemas with complex structure, and among which complex relationships can exist. The characteristics of these schemas after preprocessing step are shown in Table 5.1, including the number of elements (leaf and non-leaf), the number of relationships, the maximum depth of the tree, etc. 66 Chapter 5. Experimental Analysis domain schema S/T # relationship Equivalent Is-a Has-a total Real Estate Homeseekers/Texas 18 6 12 36 Homeseekers /Yahoo 20 0 11 31 Course Info Reed /Washington 11 0 7 18 Reed/WSU 18 1 8 27 Synthetic course/class 2 4 3 9 Table 5.2: Characteristics of the expert mappings. For each pair of schemas in the same domain, we created an expert mapping, which act as the 'standard answer' to the matching-problem. The characteristics of the expert mappings are shown in Table 5.2. The table shows the total number of matches, the number of matches of each specific relationship type, also the percentage of elements involved in mapping from both source and target schemas. Note that in expert mapping, we do not consider the matches with the relationship Associates, because of its weak semantics, though in matching, SeMap may find the Associates relationship, when it lacks sufficient semantic evidence. 5.1.3 E v a l u a t i o n M e t r i c s Following previous work on schema matching, we evaluated the performance of our approach on three metrics: recall, precision, and F-measure [25]. Pre-cision P represents the percentage of correctly identified matches over all matches identified by the system; Recall R is the percentage of correctly identified matches over the all correct matches in the given expert map-ping. Formally, let ^detected be the number of correct matches detected, ^mapping the total number of correct matches in the expert mapping, and 67 Chapter 5. Experimental Analysis ^result the total number of matches in the results, the recall R and preci-sion P is defined as: ^detected ^ _ ^detected ^mapping ^result Recall and precision are inversely related, hence it is desirable to have one measurement to incorporate both recall and precision. F-measure F (pre-cisely F\ measure) and equally weights recall R and precision P: P + R As discussed above, we deal with not only correspondences, but also the implicit relationships. Thus a correct match means (1) the found corre-spondence is correct, (2) the identified relationship is exactly as that in the expert mapping. So we measured the matching accuracy in two-fold: (1) the total number of correct correspondences detected; (2) the number of correct matches for each type of relationships, e.g., Equivalent, Is-a, etc. 5.1.4 Experimental Methodology For each domain, we performed three sets of experiments, i.e., between each possible pair of schemas. We first evaluated the matching accuracy of SeMapior each sets of data, and investigated how sensitive it is against the setting of parameters; We then evaluated the relative contribution of each component and user interaction to the final mapping. Currently, the parameters needed to be tuned in our system are the 68 Chapter 5. Experimental Analysis weights of different semantic evidences as produced by the various match-ers, i.e., label (A;), type (At), sense (A 3 ) , etc. The influence of these weights parameters are referred to Section 4.1.5. The specific setting of parameters for different datasets are as follows: (1) For the Real Estate dataset, the parameters are set as A( = 0.3, At = 0.5, and Xa .= 0.2, reflecting the ob-servation that in the Real Estate dataset, the name, type and sense are all important to identifying the implicit relationship, however due to a number of abbreviation, e.g., 'ac' for 'air conditioner', the type information is espe-cially important to finding the hidden correspondences; (2) For the Course Info dataset, A; = 0.4, At = 0.4, and A s = 0.2, which means the name and type convey equally significant semantic information, and the sense carries less importance weight due to a small quantity of synonyms, hyponyms, etc in this dataset; (3) For the synthetic dataset, A( = 0.4, At = 0.3, and A s = 0.3. That is all types of semantic information carry equally importance weights. 5.2 Experimental Result We present in the following the matching performance of the SeMap system on the synthetic and real datasets. We first show the matching accuracy of the system exploiting all the semantic information, and then analyze the contribution from different types of semantic evidences. Finally we measure the performance gain from incorporating user interaction into the match selector phase. 69 Chapter 5. Experimental Analysis Matching Accuracy EES lyiMi 100 80 60 40 20 0 l l l l l Synthetic Course Info 1 Course Info 2 Figure .5.1: Matching accuracy of SeMap. The three plots show the recall, precision and F-measure of the matching results for the three relationship types Equivalent, Has-a, Is-a and total correct matches respectively. 5.2.1 Matching Accuracy Figure 5.1 shows the matching results of SeMap over the five datasets, mea-sured using the metrics recall, precision and F-measure. As discussed above, we measured both the accuracy of identified matches, and identified rela-tionships. The four bars (from left to right) show the matching accuracy for the three relationship types Equivalent, Has-a, Is-a (correct match and correct relationship) and total number of correct matches (not including the relationship) respectively. The result shows that SeMap achieved a high average matching accuracy not only in detecting the correct correspondences (total number of correct 70 Chapter 5. Experimental Analysis matches), but also in detecting the implicit relationships. Take F-measure as an example, the percentage of correct correspondences range from 70% to 100%, and the average accuracy of detected relationships are 79%, 69% and 64% for the three relationships respectively. The accuracy of correspon-dences detected is comparable to the results claimed in i M A P [2] (Tested on Real Estate dataset, 77-100% for 1-to-l matches and 50-86% for 1-to-n matches), and that produced by [5] (Tested on Real Estate dataset and Course Info dataset, 73% recall and 67% precision without domain ontology, and 94% recall and 90% precision with domain ontology). Note that our system did not resort to any domain ontology or data frames as used in [5]. Moreover, the problem we are tackling is more dif-ficult than these works in the sense that we have to not only identify the correct correspondences, but also extract the generic semantic relationships implicit in the correspondences. In Section 5.2.4, we wil l identify the reasons that prevent SeMap from achieving higher accuracy in identifying correspon-dences and extracting implicit relationships. Looking at these results in further detail, we note that in the Real Es-tate 1 and Course Info 2 datasets however, SeMap has low precision (about 40% for Real Estate 1 and 20% for Course Info 2) in identifying the Has-a relationship. To further explore why this is the case, we provide more de-tailed analysis of detailed composition of the matches identified by SeMap, i.e., of the matches identified as each type, how many are (1) correct corre-spondence and correct relationship (2) correct correspondence but incorrect relationship (3) incorrect correspondence (non-match). The result is shown in Figure 5.2, where for each type of relationship, we look into the compo-71 Chapter 5. Experimental Analysis sition of matches as identified by SeMap. It can be noticed in both case of Real Estate 1 and Course Info 2, incorrect correspondences are the main cause of the low precision in identifying matches of Has-a relationship. B y barring those incorrect correspondences, the precision of SeMap in identi-fying semantic relationships is much more higher. The accuracy of SeMap after barring incorrect correspondences is shown in Figure fig:impaccuracy, where the precision reaches near 100% over most datasets. Note that the synthetic dataset is very small, despite its low precision in this case, SeMap misclassified only one match. We would like to argue that finding correspon-dences is the focus of schema matching, and our SeMap system is built on a moderate set of current schema matching techniques, it can be expected that with more powerful schema matching techniques or domain knowledge, the number of incorrect matches can be significantly reduced. In the de-tailed error analysis, one can also notice that on average SeMap has higher accuracy in identifying Is-a relationship than Has-a relationship. This can be explained by the fact that the thesaurus employed in SeMap (WordNet) returns comprehensive information of meronym/holonym relationships be-tween two words, even though the senses may not be exactly that appearing in the context of schemas. The way to improve the accuracy of identifying Has-a relationship can be (1) prune the noise senses in using the thesaurus, which will be explored in our future work; (2) lowering the weight of the sense matcher, which however may affect finding other types of relationships. 72 Chapter 5. Experimental Analysis composition schemas relationship Equivalent Is-a Has-a non-match Real Equivalent 13 0 0 4 Estate 1 Is-a 0 2 0 0 Has-a 4 2 8 7 Real Equivalent 9 0 0 3 Estate 2 Is-a 0 5 0 3 Has-a 3 0 7 4 Synthetic Equivalent 2 1 0 0 Is-a 0 1 0 0 Has-a 0 0 4 0 Course Equivalent 11 1 0 1 Info 1 Is-a 0 2 0 0 Has-a 0 0 0 0 Course Equivalent 12 0 0 0 Info 2 Is-a 0 4 0 1 Has-a 0 0 1 4 Figure 5.2: Error analysis of the resulting mappings. Figure 5.3: The precision of SeMap after pruning incorrect matches. The bars from left to right shows the matching results for the three relationship types Equivalent, Has-a, Is-a respectively. 73 Chapter 5. Experimental Analysis 5.2.2 Component Contribution In this section, we studied the relative contribution of different types of semantic evidences to the matching results. Specifically, we tested three types of semantic evidences, element label (name), element type and element sense, which are available in most of schemas. In each test, we left out one type of semantic evidence, and used the rest two to identify the correspondences and the associated relationship types. The weights of the two types of semantic evidences were both set as 0.5, reflecting a neutral assumptions that both evidences are equally important. The two plots of Figure 5.4 show the F-measure of identified matches (correspondences) and identified relationships respectively. Within each plot, the bars from left to right represent the results produced by SeMap without label evidence, without type evidence, without sense evidence and complete SeMap respectively. It is observed that in almost all the cases, each type of semantic evidence contributes to the overall performance, ex-cept for the dataset of Real Estate 2, where the F-measure of relationships identified by a complete system is worse than that by a system without type evidence. This can be explained by the fact that in some cases, different semantic evidences are in conflict in determining the implicit relationships, and some kinds of evidence may dilute the correct diagnosis. 5.2.3 Incorporating User Feedback We also studied the performance gain by incorporating user feedback into SeMap system. We applied the user interaction mechanism as discussed in 74 Chapter 5. Experimental Analysis Matching Accuracy Liil I semap w/o label I semap w/o type EE] semap w/o sense • complete semap ST Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 2 Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 2 Figure 5.4: Relative contribution of different types of semantic evidences to the matching results of SeMap. The two plots (from up to down) show the F -measure of identified matches (correspondences) and identified relationships respectively. Chapter 4: Each candidate match of a schema elements is associated with an uncertainty estimate, based on its similarity score and domain constraints. At each iteration, the schema element whose candidate matches have the largest mutual entropy with other elements is selected, and user is asked to provide the correct answer for this element. It then updates the uncertainty estimation. The procedure repeats until a threshold is reached. In this test, we measured the number of correct matches needed to be provided before a perfect set of matches is reached. Note that we only tested on the accuracy of identifying the correspondences, but not extracting the implicit relationships. F-measure of correct correspondences versus the 75 Chapter 5. Experimental Analysis amount of user interact ion needed (percentage of expert matches provided over the to ta l number of matches) is shown i n F igure 5.5 (the synthetic dataset is sk ipped, since 100% correct correspondences is reached wi thou t any help of user interact ion) . It is clear that over the datasets R e a l Es ta te 1, R e a l Es ta te 2 and Course Info 2, F-measure reaches its m a x i m u m value when about 20% of expert matches is provided; W h i l e over the datasets Course Info 1, about 10% of expert matches lead to a perfect set of matches. T h i s result suggests that SeMap can effectively incorporate user in teract ion, that is i t needs on ly a few equal i ty constraints provided by users to achieve high-accuracy matches. Note that the m a x i m u m possible F-measure may not necessarily be 100%, this can be explained by the fact that F-measure incorporates bo th precision and recal l , and i t is not always possible to achieve 100% recal l (al l correct matches are detected), due to the l i m i t of the current match ing techniques that SeMap takes as input . 5.2.4 Discussion In this section, we discuss the difficulties that prevent SeMap from achiev-ing better performance i n ident ifying b o t h correspondences and i m p l i c i t re-lat ionships. Specifically, there are three reasons facing the current SeMap system: • M o r e types of base matchers are needed i n order to fully exploi t the possible schema-level informat ion. In our current implementa t ion , on ly four types of base matchers are employed, wh ich sometimes leads to low precision, as discussed i n Sect ion 5.2.1, and i t can be expected 76 Chapter 5. Experimental Analysis Real Estate 1 Real Estate 2 e LL 85 20 40 60 80 100 provided matches (%) Course Info 1 20 40 60 80 provided matches (%) 20 40 60 80 100 provided matches (%) Course Info 2 20 40 60 80 provided matches (%) Figure 5.5: F-measure of correct correspondences versus the amount of user interaction (percentage of expert matches provided over the total number of matches). The curves for four datasets (Real Estate 1/2, Course Info 1/2) are shown. that more types of evidences can lead to finding more hidden informa-tion; • Data instances should be taken into consideration. It is only on the data-instance level that one can understand the exact format and se-mantics of schema elements. • Parameter tuning is critical for schema matching. In SeMap system, there are five parameters (three for weights of evidence, two thresh-olds), which interact in a complicated way. A systematic (e.g., learning approach [27]) tuning scheme can significantly improve the results. 77 Chapter 6 Conclusion & Future Work In this thesis we presented an approach of identifying generic, semantic re-lationships between the elements of two models (e.g., database schemas, ontologies, web interfaces, etc) based on initial match information provided by current schema matching techniques. Our main contributions include (1) we point out the importance of the problem of identifying generic semantic relationships between schema elements; (2) we designed an architecture for semi-automatically constructing generic semantic mappings based on initial correspondence information; (3) we created a novel probabilistic framework that transforms match selection to a well-defined mathematical optimization problem; (4) we effectively modeled the user interaction to help focus user at-tention and minimize user effort, by detecting critical points where feedback is maximally useful; (5) we proposed effective solution to extracting rela-tionship implicit in matches based on various types of semantic evidences; (6) we implemented a prototype system embodying the innovations above and a set of experiments to illustrate the effectiveness of our approach. We envision several future directions. The first is to incorporate our system into a model management system [17], and explore the new possi-bility in meta-data management brought by generic, semantically rich map-78 Chapter 6. Conclusion