Supporting Domain Heterogeneous Data Sources for Semantic Integration by Jian Xu B. Computer Science, Nanjing University of Aeronautics and Astronautics, China 2002 M. Computer Science & Engineering, University of New South Wales, Australia 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) August 2011 c Jian Xu, 2011 Abstract A SEMantic Integration System (SemIS) allows a query over one database to be answered using the knowledge managed in multiple databases in the system. It does so by translating a query across the collaborative databases in which data is autonomously managed in heterogeneous schemas. In this thesis, we investigate the challenges that arise in enabling domain heterogeneous (DH) databases to collaborate in a SemIS. In such a setting, distributed databases modeled as independent data sources are pairwise mapped to form the semantic overlay network (SON) of the SemIS. We study two problems we believe are foremost to allow a SemIS to integrate DH data sources. The first problem tackled in this thesis is to efficiently organize data sources so that query answering is efficient despite the increased level of source heterogeneity. This problem is modeled as an “Acquaintance Selection” problem and our solution helps data sources to choose appropriate acquaintances to create schema mappings with and therefore allows a SemIS to have a single-layered and flexible SON. The second problem tackled in this thesis is to allow aggregate queries to be translated across domain heterogeneous (DH) data sources where objects are usually represented and managed at different granularity. We focus our study on relational databases and propose novel techniques that allow a (non-aggregate) query to be answered by aggregations over objects at a finer granularity. The new query answering framework, named “decomposition aggregation query (DAQ)” processing, integrates data sources holding information in different domains and different granularity. New challenges are identified and tackled in a systematic way. We studied query optimizations for DAQ to provide efficient and scalable query processing. ii The solutions for both problems are evaluated empirically using real-life data and synthetic data sets. The empirical studies verified our theoretical claims and showed the feasibility, applicability (for real-life applications) and scalability of the techniques and solutions. iii Preface Materials in this thesis represent the research work done during the PhD study of the author at UBC. Part of the materials are published with collaboration of other researchers. At the time of this thesis writing, the following article is accepted to publish. Jian Xu, Rachel Pottinger. Optimizing Acquaintance Selection in a PDMS. International Journal of Cooperative Information Systems vol 20, No. 1 (2011) pages 39-81 Research work in the above publication was carried out by me under the supervision of Rachel Pottinger. I worked on all the technical parts and empirical studies and discussed intensively with Rachel. We collaborated in the writing of the article. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Domain Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 The JIIRP case study . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Characteristics of domain heterogeneity . . . . . . . . . . 5 1.1.3 Domain heterogeneity in other applications . . . . . . . . 8 1.2 1.2.1 Semantic integration architectures . . . . . . . . . . . . . 11 1.2.2 Two questions from domain heterogeneity on PDMSs . . 14 Overview to Thesis Contribution . . . . . . . . . . . . . . . . . . 15 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 PDMS Query Answering . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 18 1.3 2 Semantic Integration and Peer Data Management Systems (PDMSs) 11 Mapping and query translation . . . . . . . . . . . . . . . v 2.1.2 Relational data model and Datalog language . . . . . . . . 19 Cost of Operations . . . . . . . . . . . . . . . . . . . . . . . . . 21 ACQUAINTANCE SELECTION . . . . . . . . . . . . . . . . . . . . 24 3.1 The Task of Acquaintance Selection . . . . . . . . . . . . . . . . 24 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Graphical model . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 MAP estimates . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.3 Expectation maximization (EM) . . . . . . . . . . . . . . 29 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 The Acquaintance Selection Framework . . . . . . . . . . . . . . 31 3.4.1 Acquaintance selection operations . . . . . . . . . . . . . 31 3.4.2 The mapping effectiveness measure . . . . . . . . . . . . 32 3.4.3 The acquaintance selection criteria . . . . . . . . . . . . . 38 Acquaintance Selection Schemes . . . . . . . . . . . . . . . . . . 38 3.5.1 Chordless paths for all candidates . . . . . . . . . . . . . 39 3.5.2 Max-min approximation for S(CM) . . . . . . . . . . . . 43 3.5.3 The one-shot selection scheme . . . . . . . . . . . . . . . 44 3.5.4 The two-hop selection scheme . . . . . . . . . . . . . . . 54 3.5.5 Choosing between one-shot and two-hop . . . . . . . . . 61 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.1 Experimental settings . . . . . . . . . . . . . . . . . . . . 62 3.6.2 Evaluation of chordless path finding algorithm . . . . . . 65 3.6.3 Evaluation of the one-shot selection scheme . . . . . . . . 66 3.6.4 Evaluation of the two-hop selection scheme . . . . . . . . 71 THE DECOMPOSITION AGGREGATION QUERY . . . . . . . . 76 4.1 Decomposition Aggregation Queries . . . . . . . . . . . . . . . . 77 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.1 Mapping tables . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.2 Open world and closed world assumptions . . . . . . . . 80 4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4 DAQ Formalization . . . . . . . . . . . . . . . . . . . . . . . . . 83 2.2 3 3.5 3.6 4 vi 4.5 4.6 4.7 5 4.4.1 The semantics of DAQ answers . . . . . . . . . . . . . . 83 4.4.2 The decomposition mapping . . . . . . . . . . . . . . . . 84 4.4.3 The aggregate binding . . . . . . . . . . . . . . . . . . . 86 4.4.4 The aggregation rewriting . . . . . . . . . . . . . . . . . 87 DAQ Processing with 3-role Structure . . . . . . . . . . . . . . . 90 4.5.1 Overview of DAQ answering . . . . . . . . . . . . . . . . 90 4.5.2 Planning the aggregation rewriting . . . . . . . . . . . . . 92 4.5.3 The basic aggregation rewriting algorithm . . . . . . . . . 93 4.5.4 Analysis 98 . . . . . . . . . . . . . . . . . . . . . . . . . . Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.6.1 Collaboration of multiple a-nodes . . . . . . . . . . . . . 101 4.6.2 Preprocessing for aggregation rewriting . . . . . . . . . . 103 Evaluate the Query Rewriting Algorithms . . . . . . . . . . . . . 105 4.7.1 Evaluation of the aggregation rewriting . . . . . . . . . . 106 4.7.2 Impact of mapping rule sizes . . . . . . . . . . . . . . . . 106 4.7.3 The benefit of “fetch on demand” . . . . . . . . . . . . . 107 DAQ QUERY OPTIMIZATION . . . . . . . . . . . . . . . . . . . . 109 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1.1 Hierarchical aggregate processing . . . . . . . . . . . . . 110 5.1.2 Bootstrap sampling . . . . . . . . . . . . . . . . . . . . . 110 5.1.3 Kernel density estimation . . . . . . . . . . . . . . . . . 112 5.1.4 Distance measures for probability distributions . . . . . . 113 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3 Query Optimization for Phase 1 Answering . . . . . . . . . . . . 115 5.4 5.3.1 Motivating source selection . . . . . . . . . . . . . . . . 115 5.3.2 The source selection optimizations . . . . . . . . . . . . . 118 5.3.3 Mapping direction and MCI . . . . . . . . . . . . . . . . 120 5.3.4 An implementation to MCI . . . . . . . . . . . . . . . . . 122 5.3.5 Optimize the cover-finding step . . . . . . . . . . . . . . 123 5.3.6 Modifications to the IGJB algorithm . . . . . . . . . . . . 126 5.3.7 Optimize the partition step . . . . . . . . . . . . . . . . . 127 Query Optimization for Phase 2 Answering . . . . . . . . . . . . 130 vii 5.5 6 5.4.1 Motivating phase 2 answering . . . . . . . . . . . . . . . 130 5.4.2 Overview of phase 2 answering . . . . . . . . . . . . . . 133 5.4.3 Sampling and kernel density estimation . . . . . . . . . . 134 5.4.4 High coverage intervals and optimization . . . . . . . . . 137 5.4.5 The stability score for query answers . . . . . . . . . . . 141 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.5.1 Empirical study for phase 1 answering . . . . . . . . . . . 147 5.5.2 Empirical study for phase 2 answering . . . . . . . . . . . 154 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 viii List of Tables 1.1 The cell and seismic damage schemas . . . . . . . . . . . . . . . 3 3.1 Notations in empirical study for acquaintance selection . . . . . . 62 3.2 Default values for one-shot selection . . . . . . . . . . . . . . . . 66 3.3 Required D to keep misclassification below 10% . . . . . . . . . 69 3.4 Estimation of mean and variance using the one-shot estimator . . . 70 4.1 Aggregate bindings for relations in Table 1.1 . . . . . . . . . . . . 86 4.2 Two valid partitions favoring S3 and S4 . . . . . . . . . . . . . . 103 5.1 The 4 data sets used in empirical study . . . . . . . . . . . . . . . 149 5.2 Parameters in experiments . . . . . . . . . . . . . . . . . . . . . 149 5.3 Data set details . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.4 Parameters in empirical study . . . . . . . . . . . . . . . . . . . . 155 5.5 Bootstrap improves confidence interval . . . . . . . . . . . . . . . 155 5.6 Savings on required sample size . . . . . . . . . . . . . . . . . . 156 5.7 Approximation ratio of the greedy CIO algorithm . . . . . . . . . 158 ix List of Figures 1.1 A room design and price quote database . . . . . . . . . . . . . . 8 1.2 An example of the impact of acquaintance to query answering . . 10 1.3 The three architectures: PDMS. . . . . . . . . . . . . . . . . . . 12 3.1 An example of graphical model . . . . . . . . . . . . . . . . . . . 28 3.2 An example of chordless paths:. . . . . . . . . . . . . . . . . . . 36 3.3 The process of adding in a new vertex. . . . . . . . . . . . . . . . 40 3.4 Variables and dependencies in one-shot estimator. . . . . . . . . . 44 3.5 An example for two steps in a SEG:. . . . . . . . . . . . . . . . . 47 3.6 An example of reordering 20 peers . . . . . . . . . . . . . . . . . 49 3.7 The automata that generates non-equivalent assignment series. . . 51 3.8 A motivating example for heuristics in two-hop. . . . . . . . . . . 59 3.9 Testing the CP finding algorithm. . . . . . . . . . . . . . . . . . . 65 3.10 OSME accuracy on the D-regular topology. . . . . . . . . . . . . 67 3.11 OSME time on the D-regular topology. . . . . . . . . . . . . . . . 67 3.12 OSME on the scale-free topology. . . . . . . . . . . . . . . . . . 68 3.13 Requirements on initial information for the one-shot estimator. . . 69 3.14 Maximal pairwise shortest distance . . . . . . . . . . . . . . . . . 71 3.15 Hit ratio for two-hop on the D-regular topology where t = 10%. . 72 3.16 Hit ratio for two-hop on the scale-free topology where t = 10%. . 73 3.17 Two-hop vs. random on the D-regular topology. . . . . . . . . . . 73 3.18 THME time on the D-regular topology. . . . . . . . . . . . . . . . 74 4.1 A 3-role structure for Example 1.1.1. . . . . . . . . . . . . . . . . 77 4.2 Two major operations in DAQ processing:... . . . . . . . . . . . . 85 x 4.3 General steps to process a DAQ . . . . . . . . . . . . . . . . . . 90 4.4 A 5-source example 3-role structure . . . . . . . . . . . . . . . . 92 4.5 Query translation time v.s. mapping size . . . . . . . . . . . . . . 106 4.6 Testing how the size of the q-rule, d-rule, and a-rule affect the rewriting time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.7 Fetch on demand’s smoothing effect . . . . . . . . . . . . . . . . 108 5.1 a hierarchical aggregation scheme with 5 nodes: nodes B, D, E sends partial sum aggregates up the hierarchy; the full aggregate (for ids 1-5) is computed at root node A. . . . . . . . . . . . . . . 111 5.2 The workflow of estimating viable answer distribution . . . . . . 112 5.3 A DAQ processing scenario . . . . . . . . . . . . . . . . . . . . . 116 5.4 Data instances for Example 5.3.1 . . . . . . . . . . . . . . . . . . 116 5.5 Overview of the source selection flow. The source selection is positioned between decomposition and query rewriting in DAQ processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.6 Two different-direction mappings . . . . . . . . . . . . . . . . . . 120 5.7 An example distribution and outputs of answer estimates . . . . . 133 5.8 UniS sampling results for the selection paths of {A, B, E} and {B, A, E} respectively. Mi shows which components still need to be covered, Li shows which sources have been used so far, and Ai shows which components are chosen from the given node/data source. The arrows show the different selection paths for Mi and Li . 136 5.9 Two distributions with the same mean (5.0) and variance (5.0), but having very different shapes. . . . . . . . . . . . . . . . . . . . . 137 5.10 Aggregation over 800 and 1000 buildings . . . . . . . . . . . . . 149 5.11 Number of sources needed in cover-finding... . . . . . . . . . . . 150 5.12 Total duplication observed in cover-finding... . . . . . . . . . . . 151 5.13 Cover-finding performance . . . . . . . . . . . . . . . . . . . . . 151 5.14 Partition: ratio of adjusted sources . . . . . . . . . . . . . . . . . 152 5.15 Time used in cover-finding . . . . . . . . . . . . . . . . . . . . . 153 5.16 Multi-mode distributions and high coverage intervals . . . . . . . 157 xi 5.17 Deviations of empirical means (of sample size 400) in simulations taking out 1 data source: the four deviation figures correspond to the four distributions in Figure 5.16. Numbers indicate the relative changes (0.02 = 2%) from the distribution mean when no data source is removed. . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.18 Time breakdown of operations, showing the total time needed for repeating each operation 50 times. . . . . . . . . . . . . . . . . . 160 xii Glossary SemIS SEMantic Integration System DH domain heterogeneous SON semantic overlay network AcqSel acquaintance selection DAQ decomposition aggregation query PDMS peer data management system JIIRP Joint Interdependent Infrastructure Research Program SPJ Select-Project-Join xiii Acknowledgements I would like to first acknowledge my supervisor Professor Rachel Pottinger for her consistent support, encouragement and patient supervision to my research work in UBC. Without her hard work, all the guidance and instructions, it is impossible for me to make all these achievements. Also I would like to thank the whole JIIRP research group for providing the opportunities for me to motivate and discover the main research problems in this thesis. Special thanks to Prof. Jose Marti the principal investigator of the JIIRP project and the principle designer of the I2SIM system; the I2SIM team (Lucy Liu, Quanhong Han, Hafiz, Detao Mao et,al.); the civil engineering team (Dr. Jorge A. Hollman, Hugon Juarez, Kate Thibert et,al.) for contributing the seismic and damage assessment data set; the geographic information team (Prof. Brian Klinkenberg and Alejandro Cervantes) for the collaboration on GIS system and data sets. Also I would like to thank UBC’s Campus Planning/Operation for providing access to information of infrastructures on campus. Besides direct support of data sets I received from the research teams and people mentioned above, the meetings and discussions we had together provided me with valuable first-hand experience on integration of domain heterogeneous data sets and knowledge. It is through those meetings and many iterations of manual integration of data that I realized the importance and challenges in integrating domain heterogeneous databases. Special thanks are owed to my family members: my grandmother, my parents and my brother. The love from family members and my beloved fiancee Xin has always been my strongest support. The PhD study is a long journey in my life. I received a lot of help from professors and my colleague students in the Computer Science Department. I thank Prof. xiv Laks V.S. Lakshmanan and Prof. Raymond Ng from the DMM research group, Prof. David Kirkpatrick and Prof. Holger Hoos from the β -lab, the Graduate Program Administrator Ms. Joyce Poon, my colleagues Lin Xu, Shuang Hao, Xiaofei Wang, Hongrae Lee, Chao Yan, Michael Lawrence and all other friends for their help. I thank the Computer Science Department of UBC for providing me the opportunity to study here and the financial support I received through teaching assistantship. The Natural Sciences and Engineering Research Council of Canada (NSERC) provided research funding. My research also received financial support from Joint Infrastructure Interdependencies Research Program (JIIRP) by Public Safety Canada and NSERC. xv 1 INTRODUCTION 1.1 Domain Heterogeneity Many applications require querying multiple databases with heterogeneous schemas. We refer to systems that translate queries over databases and thus support querying multiple, heterogeneous, independently maintained databases as SEMantic Integration System (SemIS). Many semantic integration architectures exist, including data integration (e.g., [51, 52]), peer data management systems (PDMS, e.g., [7, 53]), and dataspaces [45]. Existing semantic integration approaches generally focus on cases where the schemas are heterogeneous, but the entities in the sources are from the same domain — the data sources are domain homogeneous. For example, querying independently maintained bibliography databases is a widely used example for integrating domain homogeneous data sources. Although bibliographic records may have different schemas in different data sources, the schemas represent the same kind of objects. Semantic integration must also manage domain heterogeneity. For example, the 50+ Amazon public data sets [5] are categorized into 8 domains and the 7, 144 databases on Freebase [46] (as of April 2010) belong to 86 domains. Integrating domain heterogeneous schemas is more challenging than working with schemas in the same domain since it requires semantically connecting objects in different domains, where the objects and their attributes often are defined differently and serve different purposes. Although existing semantic integration systems 1 support simple transformations such as concatenating first name and last name in one schema to form a full name in another schema, integrating domain heterogeneous data requires additional support. We first present a case study which has roots from the Joint Interdependent Infrastructure Research Program (JIIRP) where people make preparations for possible natural disasters. We can see that even for a small scale study, the problem of integrating DH databases surfaces. 1.1.1 The JIIRP case study JIIRP [58, 64, 65] aims to provide decision support information upon the occurrence of natural disasters such as earthquakes and tsunamis. The researchers built a simulation system (named “I2SIM”) [85, 86] which models the interdependencies between the crucial lifeline infrastructures in an area and up to city and nation wide. For example, JIIRP combines crucial information to simulate such events as “how much functionality remains in a hospital within 24, 48 and 96 hours after the city of Vancouver is hit by an earthquake of intensity 8.0?”. To this end, information from outside of the JIIRP domain such as geographical information, and building damage assessment information are collected. While in JIIRP’s domain of interdependency research, objects that abstract the infrastructure functionality and interdependencies are created and used in the simulation model as well as the simulator software. For example, a type of object called “Cell” is widely used in I2SIM to represent the functional infrastructures that provides lifeline resources and services [75]. The cells are defined by the JIIRP researchers to focus on specific resources or services. For example, the hospital at the University of British Columbia is modeled as a “cell” that provides medical services. While cell objects in the JIIRP model are created by researchers who identify its functionality (for hospital, providing medical services), the physical properties of the cells need to be retrieved from the aforementioned geographical and damage assessment databases belonging to separate domains, as demonstrated in Example 1.1.1: Example 1.1.1. [Integrate Cells and Buildings] Table 1.1 shows a snapshot of cells and building damage assessment. A cell’s damage (Cell.damage) is estimated 2 by averaging the damage to the constituent buildings (avg(BdnDmg.Damage)). Monetary loss (Cell.loss) is estimated by summing the buildings’ losses (sum(BdnDmg.Loss)). Lacking systematic integration, the Cell table is manually populated by aggregating records in the BdnDmg table for each cell (e.g., C1 and C2). This is challenging because domain experts in infrastructure interdependency are unfamiliar with seismic damage assessment [123]. Additionally, the laborious process to populate one cell (e.g., a residence block consisting of about 20 buildings) discourages users from defining new cells or updating cell attributes. Our goal is to systematically transform queries on cells into aggregate queries over building seismic assessments and process the translated aggregate queries in a SEMantic Integration System (SemIS) (i.e., to compute and return the “unknown”s in the Cell table in Table 1.1). Domain A: Infrastructure Interdependency. Relation: Cell cellid cellname shape damage loss C1 Hospital s1 “unknown” “unknown” C2 Power House s2 “unknown” “unknown” Domain B: Seismic Assessment. Relation: BdnDmg DmgBid Name Intensity Damage Loss (dollars) 3 Koerner VIII 0.4 9,329,501 4 Purdy VIII 0.35 3,574,677 ... ... ... ... ... 158 Power House VIII 0.65 545,833 159 Meter Station VIII 0.4 1,324,292 ... ... ... ... ... Table 1.1: The cell and seismic damage schemas In Example 1.1.1, buildings and cells are defined independently of concepts belonging to two different domains. Researchers working for JIIRP focus on the functionality of the physical infrastructures and do not necessary know how physical damage to building structures are assessed by seismic scientists. The hospital cell is defined to include the buildings where medical services are provided as well as the surrounding supporting facilities such as the hospital’s parking lot and 3 backup generators. The damage assessment is done having no prior knowledge of how in the future a “hospital cell” will be defined over the buildings in the assessment; and damage to parking lots and backup generators are assessed in a separate assessment than that for the buildings. The cell and the buildings are connected via the cell definition. In the JIIRP case, the hospital cell refers to a GIS system1 for the buildings and other facilities to include in it. The requested damage estimates to cell functionalities and monetary loss (e.g., C1, the hospital cell) can only be manually computed by running the corresponding aggregates over the constituent buildings in the building damage database. We want to investigate how this process can be automated so that for any cell defined in the domain of infrastructure interdependency, the required attributes such as “damage” and “loss” can be automatically retrieved from databases in, and not limited to the domain of seismic assessment. The challenges here are twofold. First, an aggregation that answers the above questions needs to make use of knowledge belonging to different databases in separate domains; the current semantic integration models do not maintain all the necessary meta-information for the completion of rewriting and processing an aggregate query. For example, users of the Cell relation in Table 1.1 only know about the cells themselves; the users may not know or care that the cells are defined as a set of buildings or there are mappings that involve multiple databases. Additionally, the information describing how damage of individual buildings is aggregated will need to be maintained by the SemIS, instead of relying on users’ prior knowledge. Second, for a user to fully specify such aggregations, not only it is unreasonable to assume that the users know the schema and availability of all participating databases of an SemIS, they are unable to make informed choices of which databases to use especially when the SemIS grows large. Continuing our running example, damage assessment data for buildings is managed in multiple databases with different schemas. In this case, completing the aggregation to answer the query for the damage of cell “C1” in Table 1.1 requires more than one database to form an aggregate network for the task. This kind of domain heterogeneity is widely observed in JIIRP. The I2SIM object “channel” is defined as a resource transmitting paths among cells, which a 1 ArcGIS is used to manage geographical information in JIIRP. 4 channel has its physical correspondence similar to what cells have. Another simulation object “distributor” needs to care about both the physical ability of distributing resources and the disaster responding decisions — a need of integration of even more abstraction levels. Query answering with domain heterogeneous data sources more frequently integrates data objects at different granularities, thus more complicated semantic connections are required to be maintained by the SemIS. The SemIS must be able to use the extended cross-domain, cross- granularity knowledge to answer queries. We as the SemIS designers need to answer “how to integrate objects from different domains and different granularities by generating proper aggregations to answer queries?”. Furthermore, we need to explore ways to efficiently answer such queries. We need to provide answers to “how hard is the problem of processing aggregations in semantic integration?” 1.1.2 Characteristics of domain heterogeneity It is surprisingly difficult to apply existing data integration techniques to answer the seemingly straight-forward question in Section 1.1.1. To compute the cell damage and monetary loss, a query over cells needs to be rewritten into a number of aggregate queries over buildings. As will be analyzed in more detail in Chapter 4, query translation of this kind requires the SemIS to support the new type of heterogeneity: the query translation needs to handle the gap between objects in two domains and of different granularities. Current semantic integration solutions neither are designed to answer questions of this kind nor have the ability to do so. Therefore we give a name “domain heterogeneity” to the semantic integration problems where there is integration carried over multiple domains such as the one revealed in the JIIRP study. This thesis is devoted to solve some of the imminent problems rooted in domain heterogeneity. The JIIRP case study in Section 1.1.1 motivates our interest to study the impact of this extended heterogeneity in semantic integration. We now compare the “domain heterogeneity” we learned from the JIIRP case study with the “schema heterogeneity” that has been studied and problems solved for many years (and new problems still being studied now). Schema heterogeneity refers to the way that two databases describe the same 5 type of objects. This includes the different representations of an object and the different associations between objects as made by the two databases. Domain heterogeneity, on the other hand, refers to the fact that different sets of types are maintained in two databases. Databases which are domain homogeneous but schema heterogeneous are discussed intensively in existing data integration applications e.g., the cases of integrating bibliography records in different online publication databases. There are many fewer systems that focus on applications which require integration of domain heterogeneous data sources. JIIRP is one typical case that requires this level of integration. In JIIRP, we need data that describe physical properties of land, buildings and critical facilities; we need damage assessment databases for various lifeline infrastructure including electricity, water, gas, communication; we need data that describe service information for security, public health and emergency response policies. All the data needs to be integrated in order to be useful for JIIRP, e.g., to run an earthquake simulation. A query (e.g., from the JIIRP simulator) usually involves integrating knowledge from cross domains like the example in the case study. In our experience, schemas from different domains are used in different areas of human endeavour and they act as representations of different specialized disciplines. While many concepts are still reflected or derived from the same physical objects in the real world, their definition, representation, exact semantics and their semantic connections to other concepts are different. The challenges brought by the heterogeneity between cross-domain schemas are directly observed in semantic integration: 1. concepts using the same name carry different semantics in different schemas; 2. the same real-world concept is modeled differently in different schemas; 3. many concepts have no direct equivalents across domains since they do not appear in both domains; even if different domains represent similar concepts, the concepts may be at different levels of granularity; 4. one-to-one mappings between schemas are often insufficient; 5. aggregations and transformations are needed to map schema elements; 6 6. knowledge from multiple domains are needed to establish mappings and to obtain other meta-data information; 7. queries more often target attributes of pre-defined objects therefore mappings that connect data instances are desired. Objects from different domains usually do not very often need exchange information; and this may explain why the data integration community has not paid enough attention to integrating domain heterogeneous data sources. The field of infrastructure functionality study did not intersect with the field of seismic study until experts from JIIRP identified the need of connecting them to perform interdependency study for disaster management. In such a case, efforts are made to establish semantic connections between domain heterogeneous data objects. We need to answer queries on objects from one domain by extracting information in objects from another domain using the knowledge embedded in the connections established by experts from both seismic research and interdependency research. In addition to the query answers to be computed for cells and channels, we are also interested in the connections between domain heterogeneous objects, which carry more information than schema mappings between domain homogeneous databases. In this thesis, we identify and resolve challenges for integrating domain heterogeneous data sources such as those in JIIRP. We believe the more problems we study for domain heterogeneity, the closer we are to the true meaning of this higher level of heterogeneity. Therefore, instead of trying to give a definition to domain heterogeneity, we study newly emerged problems and leave the definition open. This does not prevent us from solving the newly identified challenges. When we eventually gain enough knowledge to clearly define this higher level of heterogeneity, many problems in integrating domain heterogeneous data sources can already be answered. In this thesis, we answer the following two questions: 1. How do domain heterogeneous data sources affect query answering in a SemIS? If there are negative impacts to query answering, how can we identify, avoid or reduce the negative impacts? 7 2. How can we answer queries using objects from different domains, assuming they are modeled in different granularities? We noticed that the above two questions also apply to a SemIS that has only domain homogeneous data sources. For example, the acquaintance selection technique in Chapter 3 also helps to better organize data sources in the same domain; the DAQ query processing framework in Chapter 4 can also help answer queries among databases in the same domain as long as objects at different granularities are to be integrated. We emphasize the domain heterogeneity mainly because the above problems are more pervading when databases from different domains are put together. 1.1.3 Domain heterogeneity in other applications Domain heterogeneity is found in many other applications besides JIIRP; integrating this higher level of heterogeneity is useful in general. In another research project that involves estimating building costs, we find the following scenario: External Wall Window CSIcode Description Unit Cost 81000010 Door, Hollow Steel, Single each 1500.00 82000010 Door, Hollow Core, Single each 1000.00 82000040 Door, Solid Wood, Double each 2500.00 87000010 Window, 2x3 glazed, Single each 400.00 99000220 Walls, Dry Wall sqft 0.75 ... ... ... ... Internal Wall Single Door Double Door Figure 1.1: A room design and price quote database Example 1.1.2. [Room-Component Example] An architect needs to know how much it will cost to build a room according to its design blueprint. Figure 1.1 shows a blueprint of a room and a component price database from a provider. The cost of building a room is estimated by summing up the cost of the components used for the room. Currently this is done manually although both the blueprint and the price information are managed by databases. In current practice, 8 an architect needs to submit the blueprint to the general contractor and wait for the cost to be estimated. It not only takes a long time and a lot of labor to estimate the cost for a complex building, but is also very inconvenient when the architects modify the design and need to have the cost re-estimated. In this example, the architect works in the domain of architecture which focuses on constructing a room and choosing the right combination of materials to meet the functionality and design requirements; the general contractor just offers services to quote the cost of materials chosen by the architects. The two domains are apart but need integration when we want to automate the cost estimation process. Again, helping the architect with the cost estimation task is non-trivial. Although in this case it is easy to choose sum as the aggregate function and the components of the room are clear — they are explicitly marked in the architect’s design. The aggregation this time is not computable from one single contractor’s database: an individual contractor usually does not provide all the components in a room. Further, as the prices of the same component are different when quoting from different contractors’ databases, the semantics of the final value to use as the “cost estimate” require more precise definitions. Both the scenarios in Example 1.1.1 and Example 1.1.2 demonstrate the need to answer queries over an object using its constituent components. This is very common in domain heterogeneous situations as objects are not created in a unique taxonomy and they in many cases are different in terms of granularity. Therefore in a domain heterogeneous environment, it is common to find cases where information for objects in one granularity needs to be computed by composing information from another type of objects in a finer granularity. Next we give another example showing another challenge for a system to support DH data sources. In this example, we see that domain heterogeneity does not always bring new integration opportunities, and special care needs to be taken to ensure that the SemIS is still able to answer queries as it does for domain homogeneous data sources. Example 1.1.3. [DH Sources] Consider the SemIS in Figure 1.2. In response to a recent earthquake, four cell phone companies (A LPINE ( C .A), B OLD ( C .B), C HEAP ( C .C) and F OREVER ( C .F)), a landline phone company (D ECENT ( P.D)), 9 a home electric company (E LECTRIC ( E .E)) and a TV cable company (G REAT ( TV.G)) have created a small number of mappings (shown as edges between nodes in Figure 1.2; the edge thickness represents the amount of schema elements mapped between the two sources). Peers that are directly connected to each other through such semantic mappings are called acquaintances; e.g., TV.G’s only acquaintance is C .F. Peers may vary substantially w.r.t. the schema and data they hold. For example, C .C has a lot in common with C .B, and can also be well mapped with P.D. The peers E .E and P.D are mapped on their shared utility poles. Queries are translated over the established mappings by the data sources. For example, a query from TV.G is translated through C .F before it can be processed at E .E. Electric Home (e.E) Forever Cell (c.F) ? Decent Phone (p.D) Cheap Cell (c.C) Bold Cell (c.B) Great Cable (tv.G) Alpine Cell (c.A) Figure 1.2: An example of the impact of acquaintance to query answering Figure 1.2 shows the importance of selecting acquaintances and its impact on query answering. While in a domain homogeneous scenario different choices of acquaintance also affects query answering, in a domain heterogeneous setting the query answering ability of the SemIS is much more easily affected by the selection of acquaintances. Carelessly choosing acquaintances may greatly retard query answering even though data sources are still seemingly well mapped to other sources (e.g., the situation of C .F). We thus need a mechanism to optimize the mapping “topology” so that the overall query answering ability of a SemIS can be optimized. Next we briefly survey the existing SemIS architectures and we will see that the PDMS we choose to integrate domain heterogeneous data sources does need to consider the acquaintance selection problem as shown in Example 1.1.3. 10 1.2 Semantic Integration and Peer Data Management Systems (PDMSs) In this section, we briefly review the existing systems and architectures for semantic integration. While Chapter 3 and Chapter 4 have their own sections of closely related work surveyed, this section provides a better understanding of the new challenges domain heterogeneity brings and its impact on different architectures. 1.2.1 Semantic integration architectures People have proposed a number of different architectures for semantic integration. We can roughly classify the architectures into three categories based on their different treatment of mappings and query answering. The first category is an architecture that uses a mediated (i.e., globally agreed upon) schema or a hierarchy of schema mediations. All the databases map their schemas to the mediated schema so that a query to one database can be translated to queries on other schemas with the help of the schema mediation. The benefit of such a system is that a query translation always takes 2-hops from one database to any another database in the system. The difficulty of this architecture is the creation, maintenance and synchronization of mediated schemas among distributed data sources. Although in theory it is possible to extend this architecture to organize data sources with a hierarchy of mediated schemas (e.g., the structure in Figure 1.3), the high maintenance cost and inflexibility in system structure often limits the scalability and usability of such an architecture. The second category is the architecture that is built on pairwise mapped data sources, called a peer data management system (PDMS). The Piazza [53] peer data management system was initially designed to organize data sources in a freeform, pairwise-mapped network2 to achieve high flexibility and scalability. In this architecture, data sources are more freely mapped together and form a graph topology and a query is translated along the pairwise schema mappings between data sources and usually it takes multiple hops to translate a query from its originating data source to a “faraway” data source where answers are retrieved. Although query translation seems to be more complicated than in a mediated approach, a 2 Later it also adopted a certain level of schema mediation. 11 PDMS allows data sources to form an arbitrary graph topology and is free from schema mediation overhead. Hyperion [7] is a PDMS system that allows mappings to be created both on the schema level and on the data instance level. This is the architecture we have chosen to solve the domain heterogeneity problems. The third architecture is rooted from data exchange where the goal is to synchronize data between data sources. Orchestra [57] is a system adopting this approach. In Orchestra, queries are translated among data sources to update and keep data synchronized. Figure 1.3 illustrates the three architectures and their query answering mechanisms. Data updates Query translation routes data source wrapper database PDMS Federated Data Exchange Query Node Answer Node Figure 1.3: The three architectures: PDMS, federated hierarchy and data exchange compared: the PDMS supports flexible topology and answers query in multi-hop query translations with the help of other data sources; the federated approach uses a hierarchy of mediated schema to help query answering; the data exchange approach focus on keeping data sources synchronized on their contents. We choose the PDMS architecture to start integrating domain heterogeneous 12 data sources. Comparing the 3 architectures, we believe the PDMS architecture is a good choice to host domain heterogeneous data sources, for the following reasons: 1. Domain heterogeneity makes it hard to perform effective schema mediation, therefore decentralization brings convenience. Domain heterogeneity enlarges the semantic gap between schemas in different databases, thus making it more difficult to create or maintain a mediated schema that satisfies the needs of all data sources. For example, the same attribute name may carry very different meanings in a domain heterogeneous environment3 , thus a mediated schema will have to rename the conflicting names but new names for the purpose of mediation are difficult to be accepted. A cross-domain mediated schema will need to combine domain knowledge of all the domains, which is very hard to create in practice. Moreover, as a data source can often be better mapped4 to data sources in the same domain, and have only a small fraction of schema elements to map to schemas in other domains, synchronizing a big, cross-domain mediated schema is infeasible or inefficient in practice. A PDMS architecture that does not rely on a mediated schema, on the other hand, is much more flexible to deal with cross-domain knowledge. A PDMS does not require mappings to exist between all pairs of domain heterogeneous data sources, instead, only a small number of such mappings are created and maintained between such “bridging” sources. 2. The flexible, incremental, “pay-as-you-go” mapping mechanism is easy to implement on the PDMS architecture. The authors describe a pay-as-you-go data integration mechanism in [108] to reduce the cost of deploying a semantic integration by amortizing the high cost mapping operations to the whole lifespan of the integration system. In this model, mappings are incrementally added between data sources when needed, or when there are resources to do so. In a PDMS, each individual data source can independently decide when and what mapping to create, thus is very suitable for implementing the 3 For 4 For example, “K” stands for 1000 in physics but 1024 in computer science. example, measure the effectiveness in term of the fraction of schema elements get mapped. 13 pay-as-you-go mechanism. Also, we observed that creating mappings between schemas belonging to different domains is more difficult and usually takes more time and effort than for schemas in the same domain. Therefore, the PDMS is a good architecture to allow mappings between domain heterogeneous data sources be incrementally created and maintained. 3. Applications like disaster response require fast collaboration of data sources assuming no pre-determined hierarchy or internal structures. In JIIRP where data is integrated for decision making during or after disasters, we cannot always assume that the networking and communication infrastructures are fully functional or stable connectivity is present. In this case, the data sources may have to quickly organize into an ad-hoc peer to peer network at the networking level. The PDMS architecture is a natural fit for scenarios such as these. Based on the analysis above, the PDMS architecture is chosen as the platform to study the integration of domain heterogeneous data sources. 1.2.2 Two questions from domain heterogeneity on PDMSs An immediate task is to identify the new challenges that domain heterogeneity brings to semantic integration. We find that the following two questions require immediate answers: 1. When a PDMS has domain heterogeneous data sources and mappings are created between data sources in the same domain as well as in different domains, will the topology formed by the data sources and mappings affect the query answering ability of a PDMS? And if it does, how can we obtain a “good topology” and avoid “bad” ones? 2. Are the queries supported in the current PDMS sufficient to integrate domain heterogeneous data sources? If not, what new query types should be supported, and how should we process them under the PDMS architecture? The first question comes from the observation that when data sources come from different domains, the mappings and the process of creating mappings become considerably different from the case that only domain homogeneous data 14 sources are present. For example, the fraction of schema elements that can be mapped between two data sources from the same domain is considerably larger than that between two domain heterogeneous data sources. The increased “variance” of mappings’ effectiveness makes it more important to maintain a SON topology that promotes the overall query answering ability of the SemIS. The second observation obtained from the case studies we have seen in Section 1.1.1 is that we need to support a new type of query which requires integration of objects from different domains and in different granularities. The work presented in this thesis provides answers to the above two questions. We first give a quick overview of the exact problems we investigate and a sketch of the solutions. Then we discuss in the following chapters in greater detail. 1.3 Overview to Thesis Contribution In this section, we give an overview to the solutions we give for the questions raised in Section 1.2.2. We provide answers to the two questions via the study in two projects. The first project is called “Acquaintance Selection” in which we investigate the problem of how to guide domain heterogeneous data sources in selecting acquaintances to create schema mappings with, so that the overall query answering ability of the PDMS can be improved in two ways: 1. providing good suggestions of acquaintances to new data sources joining the PDMS and 2. helping a data source already in the PDMS to select or switch to better acquaintances than it currently has. We propose two algorithms that help to predict the benefit query answering will receive if two data sources are mapped, before the actual efforts are spent to map the data sources. The acquaintance selection operation essentially adjusts the topology of the data sources in a PDMS so that the negative effect that domain heterogeneity has on query answering is minimized. Chapter 3 is devoted to the techniques developed in this project. The second project is called the “decomposition aggregation query (DAQ),” which refers to the introduction and processing of a new type of query called the decomposition aggregation query (DAQ) which helps integrate domain heterogeneous data sources. We investigate, through answering queries by using aggregates in a PDMS integration system, the possibility and challenges of integrating objects 15 defined in different granularities. We developed query rewriting algorithms to establish the semantic equivalence between the attributes of objects and aggregations over their constituent components. Chapter 4 and Chapter 5 are devoted respectively to the query rewriting techniques and query optimization techniques we have developed. The techniques described in this thesis focus on the relational data model, i.e., the databases being integrated are relational databases. We choose to focus on relational data model for the following reasons. First, large scale data is more often managed in relational databases, and aggregates are more desired for analytical processing (e.g., in applications requiring OLAP) over the data. Second, Datalog is widely used to formalize and represent queries and mappings. The Datalog language has an inherent connection with the relational model, i.e., the syntax of Datalog directly and naturally connects with the relational model. Hence our query rewriting algorithm is also developed using Datalog and operations over Datalog rules. Third, the data in the JIIRP project where the research problems are motivated is managed in relational databases and spatial data (e.g., the maps of the buildings) are also managed in a relational fashion by the GIS. We acknowledge that domain heterogeneity exists beyond the relational model and relational data and our techniques, in fact, can be adapted to handle other types of data. For example it is possible to create wrappers for data sources that manage data in XML and allow relational queries to make use of these sources. This thesis is organized as follows. In Chapter 2, we describe preliminaries, including terminology and existing technologies, models and existing results used or related to our later development. Chapter 3 describes in detail the “Acquaintance Selection” project. Chapter 4 describes the novel decomposition aggregation query framework and the query rewriting techniques and Chapter 5 describes the query optimization work. Chapter 4 and Chapter 5 together reports our research for the “Decomposition Aggregation Query” project. We embed empirical studies and summarization of related work inline with corresponding chapters that present details of our techniques developed. And finally in Chapter 6 we conclude the thesis and discuss some possible future work. 16 2 PRELIMINARIES In this section, we cover the preliminaries for the fundamentals of query answering in PDMSs. Readers who are familiar with query answering in PDMSs can safely skip Section 2.1. In Section 2.2, we summarize the cost of operations in a PDMS with domain heterogeneous data sources. This section can be used to obtain an overview idea of why and how our efforts are spent to improve the performance of semantic integration. 2.1 PDMS Query Answering We start the introduction of PDMS query answering by defining the fundamental concept “data source”: Definition 2.1.1. [data source] A data source (ds) is a tuple (S, D, M) where S is a relational schema; D is a database in schema S that contains a set of relations D = {R1 ..Rm }; and M is a set of pairwise mappings (Definition 2.1.4) from ds to other data sources. ds has the ability to (1) process relational queries written in schema S using database D; and (2) translate queries written in schema S to queries in an acquaintance’s schema using M. A data source is an abstraction of the general term “database”; we also assign it the ability (and the responsibility) of performing query answering operations such as query translation and query processing with local databases. With data sources taking the definition above, the entire SemIS is decentralized so that query 17 answering does not depend on any other entity or a particular one or a set of data sources. 2.1.1 Mapping and query translation Different data sources have different schemas; processing queries requires translating them from one schema to another. Semantic mappings relate elements in one schema to elements in another. Some semantic integration architectures use schema mappings [69, 73], while others data mappings [7]. To be as general as possible, we allow both. The concepts are formally defined as follows: Definition 2.1.2. [schema mapping] A schema mapping from a source schema σ to a target schema τ, denoted as sm(σ , τ) is a first order formula in the form: ∀x∀yϕσ (x, y) → ∃zψτ (x, z) where ϕ and ψ are first order formulas and x, y, z are sets of variables and constants [69, 73]. Definition 2.1.3. [data mapping] A data mapping from schema σ to schema τ, denoted as dm(σ , τ), is a mapping table in the form of a relation M(α, β ) where α and β are sets of attributes from σ and τ respectively [7]. Definition 2.1.4. [(pairwise) semantic mapping] A semantic mapping from (source) schema σ to (target) schema τ, denoted as m(σ , τ) is either a schema mapping sm(σ , τ) (Definition 2.1.2), or a data mapping dm(σ , τ) (Definition 2.1.3). We also adopt the definition of mapping composition in [80]: a composition of MA B and MB C , denoted as MA∗ C = Comp(MA B , MB C ) is a mapping to directly translate a query written in schema A to schema C without the need of intermediate schema B. Note that a composed mapping is not equivalent to a direct mapping (e.g., mapping MA C ) — the effectiveness of the composed mapping is affected both by the base mappings (e.g., MA B and MB C ) and the mapping composition algorithm. We assume that the PDMS is unstructured (i.e., the network is not required to conform to a particular topology). A peer queries the PDMS by flooding its query to the network. Queries are re-written along established schema mappings in the network and are processed on every peer. For example in Figure 1.2 (Section 1.1.3), a query q from C .C will be re-written to q in B’s schema using the 18 mapping between them. Then C .B will both answer q and forward q to C .A by re-writing it into q using its other mapping. 2.1.2 Relational data model and Datalog language Although SQL is by far the most commonly used relational query language, Datalog is a more convenient choice to perform inference and describe query translation algorithms. Therefore, in semantic integration, especially for relational databases, query and mapping rules are often expressed in Datalog. We briefly introduce this language here. A Datalog rule in the form of head :−body is named after its head. For example, in a rule Q: Q(x, y) :−R(x, y), x > 3, Q(x, y) is the head and we use Q.head to denote it and use Q.body to refer to its body (R(x, y), x > 3). Additionally, the body is a “condition” if it is “a conjunction of relational atoms and built-in predicates” [28]. A rule defines how we can compute tuples in the format of the head from the body expression. Continuing the above example, we can view R as a relation (or a table in a database) in a schema having two attributes. Rule Q retrieves the tuples of R in which the first attribute is larger than 3 and outputs both attributes of the tuples satisfying the condition. SQL queries can be written in an equivalent Datalog rule which often has a more compact representation. For example, the following SQL query: SELECT cellname,damage FROM Cell WHERE cellid = ’1001’ AND damage > 0.6 on relation Cell(cellid, cellname, shape, damage, loss) (Table 1.1) can be written as Q(x, y) :− Cell( 1001 , x, a, y, b), y > 0.6 The join operation in the relational model is represented in Datalog by using the same variable in the relations (called “atoms”) appearing in the body of a Datalog rule. For example, the following two rules generates tuples in the format of the 19 head by joining the two body relations R1 and R2 : Q1 (x, y) :− R1 (x, y), R2 (x, z) Q2 (x, y) :− R1 (x, y), R2 (u, v), x > u Q1 joins R1 and R2 using a equi-join on the first attributes of both relations and Q2 specifies the join condition (x > u) in the predicate part of the rule body. We use the Datalog extension in [28] for aggregate queries. An aggregation is specified in the head of the query, non-aggregated variables are by definition group-by conditions, and having clauses are expressed as predicates in the body. For example, q(sum(x), y) :−R(x, y), sum(x) > 5 is equivalent to the SQL: SELECT sum(x), y FROM R GROUP BY y HAVING sum(x) > 5 Our mapping rule set is also defined in Datalog. Although the following definition of mapping is not required by the technique in Chapter 3, we give the definition here. Definition 2.1.5. [mapping rule set] A mapping rule based schema mappings (Definition 2.1.2) is represented as a set of three Datalog rules. The general form is: Q1 (x) :− A1 (r1 ) Q2 (y) :− A2 (r2 ) M(x, y) :− Q1 (x), Q2 (y), pred(x, y) (r3 ) where A1 and A2 are conditions, x and y may contain variables and constants. pred is a conjunctive boolean expression where each clause is of the form ar1 (x) op ar2 (y) where arh1 and ar2 are arithmetic expressions, and op ∈ {<, >, ≤, ≥}. All queries in a three rule mapping are safe: variables in the arithmetic comparisons and the head of the query appear in subgoals [51] of the body. For example, the following mapping Mapping 2.1.1 expresses the idea that a cell is defined as the buildings in the specified area of the cell, assuming the relation BdnGIS is 20 BdnGIS(bulidingid, position), and in area(p, s) is a built-in predicate that holds when a position p is inside a shape s. Mapping 2.1.1. [Cell to Buildings] Q1 (x, s) :− Cell(x, s, , ) (r1 ) Q2 (y, p) :− BdnGIS(y, p), y < 1000 (r2 ) M(x, y) :− Q1 (x, s), Q2 (y, p), in area(p, s) (r3 ) The unification(∧) operation [122] on Datalog rules is required during translation of queries. The unification of two Datalog rules r1 and r2 , denoted as r1 ∧ r2 , has the following property: given any database D, the records in D that satisfy r1 ∧ r2 are the records that satisfy both r1 and r2 . Formally, we define the unification function uni f ication(r1 , r2 , B) where r1 and r2 are two Datalog rules, B is a set of substitutions also called the “unifier” and the function returns the body of a rule that is equivalent to r1 ∧ r2 . For more details on unification operator, see [122]. 2.2 Cost of Operations Query answering in a PDMS involves a number of operations; and each operation implies some cost to the system. While we always assume data sources are already mapped when processing a query, establishing mappings between data sources should be also considered as a cost of building a SemIS. We list below the cost considerations and we believe they would help to understand our optimization goals and quality/speed trade-offs later defined in the course of investigate individual problems. 1. Creating new mappings between data sources. Creating mappings is considered as a high cost task as it usually involves human effort in creating and/or verifying mapping rules for two schemas. Therefore, it is useful optimization to reduce the number of necessary mappings, or to maximize the query efficiency with a limited number of mappings. 2. Translating a query to a mapped data source. We first value the scope and accuracy in terms of the completeness. Is the translation equivalent or a max21 imal contained rewriting [100]? Does the algorithm support a wide class of queries? Then we seek query translations that run fast, e.g., query translation that has time complexity in P (w.r.t. the size of the query). 3. Translating a query several times via a long path of mappings. In a PDMS, a query is translated in multiple hops along the mappings between data sources to reach a source that is far away from the original querying source. Translating queries in multiple hops is an information lossy process. We wish to shorten the hop length between the querying source and the data sources that potentially answer the query. Example 1.1.3 shows such an case that requires optimization. 4. Sending relations in the network. Consider a disaster management case where data sources are inter-connected via a multi-hop, wireless ad-hoc network. Sending large relations across the network almost always costs significantly more than sending just queries. The operations of de-duplication and aggregation should be performed as locally as possible. We will see in Chapter 4 how the communication overhead of transmitting relations is reduced. 5. Performing joins on data sources. Join operations are usually the most expensive relational operation. With the help of indexes, the join cost can be considerably reduced. For relations sent over the network during query processing, data sources usually do not create indexes for them. In this case, we should try to reduce the number of joins, especially joins over the relations sent over the network. In Chapter 5 we show how to prevent unnecessary joins during query processing. 6. Using multiple data sources to answer a query. As data sources process translated queries in parallel, the total time needed for processing a query is not a big concern when multiple data sources collaborate to process a query, i.e., a query processed using more data sources does not take significantly more time than using few data sources. However, using more data sources will occupy the computational resources on these data sources and potentially also increase the networking overhead. Therefore, from the SemIS 22 point of view, a query is preferred to be processed using as few data sources as possible. In Chapter 5, we show how the number of data sources used to process a query can be reduced. 23 3 ACQUAINTANCE SELECTION In this chapter, we describe in detail our approach to address the first question asked in Section 1.2.2. Our research on optimizing the PDMS semantic overlay network (SON) shows that with domain heterogeneous data sources present, it is very necessary that a data source carefully chooses the acquaintances (the sources which a data source creates schema mappings to) to boost the query answering efficiency. This work was published in the International Journal of Cooperative Information Systems [127]. This chapter is organized as follows. Section 3.2 and Section 3.3 describe background and related work respectively. Section 3.4 describes the acquaintance selection framework including the primitive operations and effectiveness measures used in selection. Section 3.5 describes two selection schemes and analyzes them in detail. The empirical evaluation is presented in Section 3.6. 3.1 The Task of Acquaintance Selection As we have already introduced in Section 1.2.1 and Section 2.1, query answering in a PDMS involves translating queries along semantic mappings and there has been considerable work on decreasing the cost of constructing direct schema mappings between two peers in a PDMS. Since the mappings are inherently difficult to create fully automatically, they are typically created semi-automatically. While semi-automatic schema matching techniques decrease the costs (see [36] for a re- 24 cent survey), it still remains too expensive to create mappings between one peer and all other peers in a PDMS. In Example 1.1.3, though creating a mapping between TV.G and all other peers will yield the optimal query answering for TV.G, TV.G may not have the resources to do so — particularly in a disaster management situation, where time is critical. Therefore, the goal of acquaintance selection, motivated by the difficulty in creating excessive pairwise mappings either by mapping composition or full/semi-automatic schema matching, is to carefully choose a limited number of acquaintances so that peers’ ability/usage of the PDMS can be maximized. In our continuing example, C .F may have trouble answering queries because it lacks a cell phone company as an acquaintance. Although queries from C .F can still be translated along the established mappings through E .E and P.D to reach C .C,query answering is limited, e.g., cell-phone specific queries on route to C .C may have to be dropped at peers E .E and P.D which lack cell-phone specific schema elements. The best new acquaintance for a peer may not be the candidate that has the best potential mapping effectiveness, where “effectiveness” describes the additional gain in the ability of the system to answer queries from the peer if it establishes schema mappings that are under consideration. Consider the example in Example 1.1.3; assume that the mapping between C .A and C .C is more effective than the mapping between C .A and C .F. If C .A is considering creating a new mapping, its best choice may be C .F instead of C .C because it already has a highly effective semantic path to C .C via C .B while there currently is no effective route to C .F. This example suggests that a peer should choose acquaintances that makes query answering more effective in the PDMS, but not merely choosing peers that maintains similar schema or contents. The benefits can come from a high-effectiveness mapping that let queries to be translated to the acquaintance; they can be gained from newly formed query translation paths that enable more peers in the PDMS to contribute to the query answering; additionally, the benefits on query answering can be obtained by short-cutting query translation paths so a query can be more efficiently translated to target peers. Thus, a good selection criteria must consider the effectiveness gain when a candidate is chosen as an acquaintance. Because the queries that can be translated across peers may vary greatly de25 pending on the acquaintances selected, it is imperative that a peer can find potential acquaintances that are likely to be of the greatest help on its query answering without fully creating the mappings involved. We define this as the acquaintance selection problem: given a peer i which may already have some acquaintances, how can i choose new acquaintance(s) to maximize its querying effectiveness in a PDMS? As shown above, the following two estimations need to be made for acquaintance selection : (1) estimating the effectiveness of the new acquaintance to help answer queries (from the host peer) and (2) estimating query effectiveness without the proposed acquaintance. As the above example has shown, the best choice may not be the data source that has the highest estimated mapping effectiveness. In our continuing example, C .A may choose E .E, which has a lower mapping effectiveness over C .C because the gain of creating a direct mapping with C .C is low. While in general an acquaintance selection process is carried out at new peers when they join a PDMS and is performed periodically by peers already in the PDMS to improve their query answering, the two solutions proposed in this chapter suit different situations respectively: the “one-shot” scheme pre-processes all peers in the PDMS and thus the selection process afterwards virtually takes no extra time. The second solution, namely the “two-hop” scheme, explores the network in multiple rounds and performs acquaintance selection using the information available locally at each round. Whereas one-shot is quite efficient when we know roughly the number of clusters that the peers form, two-hop can be used when this information is unavailable and it scales better with an growing PDMS. Additionally, two-hop is more adaptive; it refines estimations more easily when new information becomes available. A theoretical foundation of query translation is mapping composition [44, 80, 117], which has been shown, for general tuple generating dependencies (TGDs), to be NP-complete for many cases and even not first-order definable for some cases. Hence the scale of a PDMS — regardless of how acquaintances are selected — is limited; unlike a typical file-exchange peer-to-peer network, which can easily scale up to thousands of peers, PDMSs are usually formed by no more than several hundred peers. Our empirical study shows that both our schemes — which do not rely on any specific mapping composition algorithm — effectively help acquaintance 26 selection and scale to hundreds of peers. We make the following specific contributions for acquaintance selection: • We introduce acquaintance selection in peer data management systems and propose a first set of solutions to the acquaintance selection problem. • We propose an acquaintance selection framework to allow multiple selection schemes to co-exist in one PDMS and flexible selection criteria be defined. • We propose the “one-shot” and “two-hop” schemes; both solve the acquaintance selection problem and they suit different scenarios. • We empirically evaluate the effectiveness and efficiency of the two acquaintance selection schemes. 3.2 Preliminaries In this section, we cover preliminaries that are only used in solving the acquaintance selection, the materials below is not listed with topics in Chapter 2. 3.2.1 Graphical model The graphical model is widely used in machine learning to represent the dependencies of random variables and hyper parameters in a system. A graphical model uses nodes to represent random variables and an arrow between them to denote a dependency relationship. If a variable in the system can be observed, which means its value is determined, we shade it (e.g., X1 and X2 in Fig. 3.1). An observed variable is also called “evidence”. Latent variables, which have values that are never observed, are shown as unfilled nodes (e.g., Z1 and Z2 in Fig. 3.1). Typically, the observed variables depend on the latent variables. The goal of the graphical model is to use the observed variables to find the values of the latent variables. Figure 3.1 shows a graphical model representation for part of the system in Example 1.1.3. The top three variables Z1 , Z2 and Z3 represent the schemas of A, B and C respectively. Variable X1 is the quality of the mapping between A and B. This quality depends on the latent variables Z1 and Z2 which indicates their schema characteristics. In most cases, unobserved variables follow a distribution which is controlled 27 Bold cell α Z2 Z1 Alpine cell β Z3 Cheap cell 111 1111 000 0000 000 111 0000 1111 000 111 X2 0000 X1 1111 000 111 0000 000 1111 111 0000 1111 000 111 Mapping effectiveness of established mappings Figure 3.1: An example of graphical model by a set of parameters. These parameters are called “hyper parameters” and are represented by Greek letters pointing to the variables. In Fig. 3.1, Z1 follows Poisson distribution with parameter α and Z2 ’s distribution parameter is β . Given a graphical model of a system, we know the dependencies among the variables. In Fig. 3.1, we know that X1 depends on Z1 and Z2 but is independent to Z3 . For a detailed description of graphical models, see [15]. 3.2.2 MAP estimates Maximum a posteriori (MAP) estimation is widely used in statistical analysis to estimate an un-observed variable based on some observed data. [15] Consider the example in Fig. 3.1; the MAP estimate for the values of latent variables Z1 and Z2 can be written as (Z1 , Z2 )MAP = arg max Pr(X1 , X2 |Z1 , Z2 )Pr(Z1 , Z2 ) Z1 ,Z2 where P(Z1 , Z2 ) is the prior knowledge on the (joint) distribution of Z1 and Z2 . Using the mode of the distribution instead of the mean (as in a maximal likelihood estimator (MLE)) frees the estimator from computing the full distribution of the 28 variable being inferred. As we will see in our one-shot estimator, MAP estimates open the opportunity for scalable local search. 3.2.3 Expectation maximization (EM) Expectation maximization (EM) is a family of algorithms used to infer parameters in a system [15]. Consider the system shown in Fig. 3.1. Given the value of X1 and X2 , we want to infer the values of α and β . The EM algorithm consists of two steps. The “E-step” computes P(Z1 , Z2 |X1 , α0 , β0 ) and P(Z2 |X1 , X2 , β0 ) using a initial parameter α0 and β0 . Then the “M-step” updates values of α and β according to some update function. (In most cases to maximize the expected (log)likelihood value). The EM algorithm repeats the above E-step and M-step until the parameter(s) converge. As we will see in Section 3.5.3, the one-shot estimator uses a modified EM algorithm to obtain the required properties of the clusters. 3.3 Related Work People have long noted that selecting good neighbors can reduce networking costs in structured (e.g., Chord [116]) and unstructured (e.g., Napster, Gnutella) P2P networks. In [104] and [120], the authors discuss peer selection and grouping strategies to lower networking costs. Selecting peers to form groups and exploiting the locality opportunity (by getting as much as possible from nearby neighbors) has also been studied in works [21, 90, 114, 119, 126]. In [77], clustering information is used to reduce large scale flooding in the network. Some structured P2P works [19, 22, 31] study neighbor selection based on proximity information to enable efficient routing. In [68], the authors describe an approach to improve query routing quality for information retrieval on a clustering-based architecture. The WISDOM project considers efficient query routing for PDMSs. [82] Queries are passed between peers following semantic routing indices to achieve a good balance between query answering quality and networking cost. In [50] ontology information is used for routing queries. In [111], the authors discussed routed query answering and useful indexing services. While the above also focus on peer selection, they do not address the problem 29 we have here for acquaintance selection. The previous neighbor selection techniques improve networking costs/efficiency or to better route queries, while our focus is maximizing the semantic querying ability of a peer (i.e., number of answers). More recent works that discuss (re)arranging peer topology on the semantic overlay network (SON) can be found in [98, 101–103]. In [98] the authors adopted a multi-layer SON where each SON contains peers whose contents are considered to be in the same domain. In [101], peers in a P2P network are clustered by the “similarity” of their interest. Although the idea of grouping peers that have similar interest in their contents together is similar to what we have for the one-shot estimator, the use of clusters is very different. In these work, when two peers are considered to have similar interest, they are immediately grouped but in the oneshot estimator, the clustering method is used to discover separations of mapping effectiveness among peers in the PDMS— the clustering does not imply the eventual selection of acquaintances. In [102, 103], the authors investigated the re-wiring operation that also targets on connecting to peers that maximizes information retrieval efficiency. In a rewiring, when new connections are established old ones are abandoned. However, in PDMS query answering, this is not the case. As we will show , all cordless paths potentially contribute to query answering. There are also a number of related works in machine learning and pattern recognition. Pairwise clustering algorithms [42, 97, 109] learn the pattern of a data set given pairwise distance information, which is directly related to our one-shot acquaintance selection scheme. Another pairwise clustering approach reported in [106] also uses EM in clustering. Our approach differs substantially from theirs on the basic assumptions of variable distributions, the function to optimize and the detailed optimizing method, which theirs was developed for motion-segmentation applications. Other works [4, 109] suggest looking into longer paths in the network than only the pairwise relations. This motivates our development of the two-hop acquaintance selection scheme. 30 3.4 The Acquaintance Selection Framework In Chapter 1 we informally introduced the general acquaintance selection problem and in Section 3.1, more detailed motivations have been presented. We now give a formal definition to the problem. Definition 3.4.1. [Acquaintance Selection] Given a PDMS, a mapping effectiveness measure S and a host peer p which already has some acquaintances in the PDMS, choose for p a new acquaintance which satisfies a selection criteria c. We give formal definitions to the effectiveness measure and selection criteria in Section 3.4.2 and Section 3.4.3 respectively. The framework for acquaintance selection allows various selection schemes to be applied. It is also flexible on different effectiveness measures (S) and selection criteria (c) for different optimization goals in acquaintance selection. In Section 3.5, we discuss two selection schemes that suit different semantic integration settings. 3.4.1 Acquaintance selection operations An acquaintance selection is carried out in the following two scenarios : (S1) when a data source joins the integration network and (S2) after the integration network has run for a period of time. When a data source first joins a PDMS, it first (randomly) choose some acquaintances. With these initial mappings, it performs the selection to pick new acquaintances. For a data source already in the PDMS, it tries to discover new acquaintances which help to improve query answering. In both scenarios, it is possible to use information from already established mappings between data sources to suggest acquaintances for the new incoming data source and sources already in the network (e.g., the MF measure introduced later). For scenario S2, statistics on historic queries and other runtime information could be used in mapping effectiveness measures (e.g., the QF and DC measures described later). The acquaintance selection process is independent from query answering in the network and it fits any semantic integration architecture that relies on pairwise semantic mappings between distributed data sources. An acquaintance selection operation is always carried out by a host data source that wishes to find an acquaintance. The process generally consists of the following 3 steps. 31 1. Probe : The host data source collects information about other data sources in the PDMS and mappings among them. 2. Estimate : The host data source estimates the query answering effectiveness gain of mapping to candidate data sources. This step uses estimators to rank candidates according to the selection criteria. 3. Pick Acquaintance: The host data source picks the top ranked candidate as an acquaintance and establishes a mapping with it. The host data source then computes the true effectiveness of the newly established mapping to replace the estimation made in the previous step. The above three steps are enforced in the framework and every selection scheme implements them. However, the implementation can be different. For example, consider the two schemes we will discuss in Section 3.5: the one-shot scheme probes the whole PDMS and the two-hop scheme only explores a local set of “neighboring” data sources. From a system point of view, it is necessary to fit selection schemes into such a selection framework to obtain the flexibility of switching from one scheme to another. The term “data source” is more suitable when talking about mappings and query answering and the term “peer” more often refers to a physically distributed entity which hosts databases and semantic integration programs. In this chapter, we use “data source” and “peer” interchangeably. 3.4.2 The mapping effectiveness measure To decide which peer to pick as an acquaintance, the host peer must estimate the effectiveness of a mapping, if established, from itself to a candidate peer. We define such effectiveness with a numerical mapping effectiveness measure. In general, a mapping effectiveness measure is a function S : M → R that measures a mapping with a numerical value. We require a selection scheme to be able to use virtually any kind of effectiveness measure, while at the same time a “good” effectiveness measure should find balance between accuracy and time/space/communication complexity. For example, taking the fraction of attributes that are mapped between two schemas (e.g., 32 the mappings studied in [39]), the following “mapped fraction measure (MF)” can be used : Definition 3.4.2. [MF] Let sch(i) denote peer i’s schema and |sch(i)| be the number of attributes in sch(i). Let xi j be the distinct number of attributes in sch(i) that appear in mapping Mi j . Then the mapping effectiveness S(Mi j ) is defined as S(Mi j ) = xi j |sch(i)| For example, schema i has 50 attributes in its schema and 35 of them are mapped in M (we count the distinct number of attributes that appear in the head of all the mapping rules), then the mapping effectiveness is 0.7. The MF measure is intuitive since the more attributes that are mapped between the two schemas, the more queries that are likely to be answered with the mapping. Effectiveness measures considering other factors can also be used. For example, historic query statistics may give information on how mappings help translate queries, thus a measure may assign higher value to those that help translate more queries. E.g., the “QF” (query frequency) measure Definition 3.4.3. [QF] S(Mi j ) = #Queries that use Mi j Total #queries f rom i For example, if statistics show that 50% of the queries from peer i are translated using mapping M, then the QF measure for this mapping w.r.t. peer i is 0.5. Note that more than one mappings could be used when translating a query. Similarly, historic contributions of mapping on the size of the results retrieved can be considered: e.g., in the “DC” (data contribution) measure, the more tuples returned from queries using a mapping, the higher the mapping effectiveness is. Definition 3.4.4. [DC] S(Mi j ) = # Answer tuples returned using Mi Total # answers received by i 33 j For example, among all the query answers, if 20% of the answer tuples are contributed from query processing paths involving mapping M, then the effectiveness of M w.r.t peer i is 0.2. Different effectiveness measures favor different features in query answering. The MF depends only on the mapping itself and is easy to compute. QF and DC consider other aspects in query answering but require additional information to be maintained. The acquaintance selection framework takes effectiveness measures as an input so users are allowed to choose appropriate measures to fit their own needs. The only hard constraint for a measure to be used in acquaintance selection is that it should maintain the property of weakly transitive monotonicity (WTM): Definition 3.4.5. [Weakly Transitive Monotonicity] Given six peers i, j, k, i , j , k , and a measure S over mappings M, if S(Mi j ) > S(Mi Pr(S(Mi k ) ≥ S(Mi k )) > 1 − ε for small ε (ε j ), S(M j k ) > S(M j k )→ 0.5), where Pr is for probability, then we say S is weakly transitively monotonic. In essence, an effectiveness measure satisfying WTM allows inferring the effectiveness of yet to be established mappings between two peers by examining the effectiveness of other mappings. As we take advantage of statistical inference, this probability is required “weakly”, so exceptions are allowed as long as the WTM property is satisfied by the mappings from a statistical point of view. That is, the monotonicity does not need to strictly hold on every mapping in the PDMS. It’s easy to see that the MF measure is WTM. Given the above 6 peers, if there is a higher fraction of attributes in i, k mapped to j than i , k mapped to j , then the mappings between i and k are likely to map more attributes than mappings between i and k . Moreover, if we set j and j to be the same peer, then MF asserts that i, j, k are likely to be better mapped to each other. The WTM property is implied by the schema heterogeneity of the peers and can be reflected by various such numerical effectiveness measures such as MF, QF, and DC. For QF, peers within the same domain are more likely to help query answering, thus giving a higher QF measure; for DC, peers are encouraged to map to data sources with a larger data repository. Because the number of acquaintances a peer has, the query pattern of the peer, and the database instances on a peer could all affect the absolute 34 value of a measure, the effectiveness value itself does not tell if a mapping is truly helpful in query answering — we have to compare the effectiveness of one mapping to other mappings on the peer to tell how effective the mapping is. Note that a mapping’s effectiveness is not comparable across different effectiveness measures; for example, 0.7 for MF suggests that two schemas have many attributes mapped but a mapping with DC valued 0.2 could also make a good contribution to query answering when the peer has a large number of acquaintances. In order to make our algorithms concrete, throughout this chapter we use MF (Definition 3.4.2) as our effectiveness measure. This does not mean that MF is the “best” of all effectiveness measures, or even that we have explored all possible effectiveness measures. The choice of effectiveness measure in a selection scheme largely depends on the semantic integration scenarios and may vary from one scenario to another. We chose MF because it is easy to compute — especially when the peers have just formed the PDMS and no further statistics (e.g., query history) are available. The definition of an effectiveness measure is general and uses information that can be found in almost all schema mappings. Therefore, it will not be very difficult for existing semantic integration systems (e.g., PIAZZA) to perform acquaintance selections. In query answering, especially when pairwise mappings are the only type of mappings in the PDMS, a query needs to be translated multiple times (called “hops”) to a destination peer for answering. Each hop of translation uses mappings between two peers and all these hops form a query translation path. This kind of multi-hop query answering is described in [80] as the motivation for mapping composition. Therefore, the query translation path affects the query answering effectiveness between peers that are not acquaintances. We now define the “current aggregate mapping” and expand the concept of effectiveness measure to be it. The current aggregate mapping, as its name suggests, can be regarded as a virtual mapping that takes into consideration all the existing query translation paths from a host peer to a candidate peer in the PDMS. We use its effectiveness as the impact of the candidate peer to the host peer without a direct mapping being established. To define it properly, we first define a chordless path. Definition 3.4.6. [Chordless Path] In a graph G, a chordless path p from vertex i 35 a c b s d t f e Figure 3.2: An example of chordless paths: p1 (s, b, c,t), p2 (s, d, e, c,t), p3 (s, d, e, f ,t) are chordless paths from s to t. to j is a path that satisfies: 1. (simple): p is acyclic 2. (chordless): any vertex strict subset of p do not form a path from i to j in G Figure 3.2 shows an example of chordless paths in a graph; arrows indicates mappings between peers. The three solid paths are chordless paths from s to t while the two dashed paths are not (e.g., chord (s, b) disqualifies the upper path (s, a, b, c,t)). Consider the triangle (s − a − b) in Figure 3.2 and suppose queries are translated through paths (s, a, b) and (s, b). We say path (s, b) dominates path (s, a, b) if all queries that can be translated through (s, a, b) can also be translated through (s, b). If a path is dominated, then removing it from the set of query translation paths does not affect the answer of a query. Path domination has several implications. First, queries will not be translated along dominated paths to avoid duplicated processing of the same query; second, translating a query only on chordless paths speeds up query answering, while at the same time lowers the burden of the intermediate data sources. In determining mapping contributions, we typically consider chordless paths as the most effective query translating paths. To represent the contribution a query translation path offers, we use M(p) to denote an equivalent mapping which could translate just all the queries that can be translated along 36 path p for the two end peers of that path. Hence, the “Current Aggregate Mapping (CM)” from peer i to peer j is defined as the mapping created by the union of M(p)s on all chordless paths (p) from peer i to j. Formally: Definition 3.4.7. [Current Aggregate Mapping (CM)] The current aggregate mapping from peer i to peer j is defined as CMi j = (M(p)) p∈P(i, j) where P(i, j) is the set of chordless paths from i to j. Computing M(p) is an involved operation, e.g., deriving M(p) using mapping composition may result in infinite number of rules [44]. Because only the effectiveness of CM (i.e., S(CM)) is required, a selection scheme only estimates S(M(p)) instead of computing M(p) explicitly. Considering chordless paths is not a mandatory requirement for the acquaintance selection to work, however computing chordless paths instead of all paths usually yields faster and more accurate calculation of S(CM). This is because (1) the number of chordless paths is usually much smaller than all paths between two peers; and (2) the chordless paths are usually the real query translation paths in query answering; thus they reflect the actual effectiveness of query answering. We consider only chordless paths in later discussions on mapping effectiveness estimators. Our experimental study in Section 3.6.2 shows that the chordless path finding algorithm is quite fast. While in the worst case the number of chordless paths is exponential in the network size, the number is still much smaller than the total number of all paths. Now we are ready to describe the acquaintance selection criteria. In Section 3.4.3, we use the definition of the current aggregate mapping effectiveness but we defer describing how to estimate it to Section 3.5. 37 3.4.3 The acquaintance selection criteria As the example in Section 3.1 shows, a peer’s goal is to maximize its query answering gain for each acquaintance it chooses. One possible way to quantify the effectiveness gain is to compute the difference between the direct mapping effectiveness S(Mi j ) and the current aggregate mapping effectiveness S(CMi j ). Thus, a selection criteria can be defined as: Definition 3.4.8. [selection criteria] Peer i selects j as its acquaintance if and only if ∀ j = j, S(Mi j ) − S(CMi j ) ≥ S(Mi j ) − S(CMi j ), where S is the effectiveness measure defined in Definition 3.4.2 and CM is defined in Definition 3.4.7. In a selection process, candidates are ranked by the selection criteria. For the above selection criteria, a host peer i needs to estimate S(Mi j ) and S(CMi j ). For example, peer i wants to select an acquaintance from j and j and it estimated S(Mi j ) = 0.8, S(CMi j ) = 0.6, S(Mi j ) = 0.5, S(CMi j ) = 0.1. Although S(Mi j ) is higher, the acquaintance selected is j because the effectiveness gain computed by the selection criteria (0.4 for j , 0.2 for j) is higher. Note that the above selection criteria is not the only one that can be used. For example, a direct mapping usually shortens the number of translations needed for a query to reach destination peers. If this is important, then the selection criteria will favor candidates that yield shorter query translation paths. Although selection criteria may vary, we believe that S(M) and S(CM) are two very fundamental and important factors to consider and most selection criteria should include them. Therefore, efficient estimation of S(M) and S(CM) is required. The two acquaintance selection schemes — “one-shot” and “two-hop” — share a common estimator for S(CM) and use different ways to estimate S(M). 3.5 Acquaintance Selection Schemes This section describes the one-shot and the two-hop selection schemes in detail. In Section 3.5.1, we describe how to compute chordless paths and in Section 3.5.2 we describe the estimator for S(CM), which is used by both schemes. The oneshot scheme is then described in Section 3.5.3, and the two-hop scheme follows in Section 3.5.4. 38 3.5.1 Chordless paths for all candidates By the definition of CM (Definition 3.4.7), computing the effectiveness S(CMi· ) for all peer i’s candidates requires that i first discovers all the chordless paths (Definition 3.4.6) from itself to the candidate peers. We formalize this problem as follows: Definition 3.5.1. [Chordless path set discovery problem] Given a (directed) graph G(V, E) where V and E represent vertex and edge sets respectively, and a source vertex s ∈ V , the chordless path set discovery problem computes the chordless path set P(s, j) for all j ∈ V, j = s. We developed an algorithm for the chordless path set discovery problem. The main result is presented in Theorem 3.5.1. Along with the corollaries, Theorem 3.5.1 shows that chordless paths can be efficiently computed and incrementally maintained. We first describe the algorithm. The chordless path (CP) finding algorithm We now present the “CP finding algorithm” that solves the chordless path set discovery problem. The algorithm calculates the chordless path set by adding each vertex v into a vertex set S and processing the vertices. The algorithm terminates when all vertices are added into S. We will show how the chordless path sets can be updated during this process. We arbitrarily choose one vertex v from V and add it to S. If (s, v) ∈ E, then we store a path (s, v) on v. Suppose at some stage, a number of vertices have been added and the chordless paths from s to all v ∈ S, involving only vertices in S, are computed and stored on v. Now we show that we can add in a new vertex k ∈ V − S to S and correctly update P(s, v). We first compute the chordless paths from s to k that only involve vertices from S ∪ {k}. We call these paths “P(s, k) w.r.t. S ∪ {k}”. Let VkS− = {v|(v, k) ∈ E, v ∈ S}, the vertices in S that have an edge with k. It is apparent that P(s, k) w.r.t S ∪ {k} is a subset of v∈VkS− (P(s, v) (v, k)), where is an operator that appends edge (v, k) to every path in P(s, v) to form a path set from s to k. Now observe that if there exist two vertices u, v ∈ VkS− where u appears in some path p ∈ P(s, v), then {p} (v, k) is not a chordless path because edge (u, k) is a shortcut. Similarly, for v ∈ VkS− and p ∈ P(s, v) if u ∈ VkS− and u ∈ p, then the new path {p} (v, k) is a chordless path, i.e. it is in P(s, k) w.r.t. S ∪ {k}. 39 To quickly determine if a path can be extended to a new chordless path, we store with each p in P(s, v) a bit vector f p of length |V |. Let f p [i] = 0 indicate that vertex i cannot be extended to i on this path. We initialize each f p [i] to 1 to indicate that it may be able to be extended. Each time a path p ∈ P(s, v) is to be extended to some vertex k, v will first check f p [k]. If f p [k] = 1 then path p = p (v, k) is added to P(s, k) and f p is set as following. First p cannot extend to any vertex that p cannot extend to, so f p [i] = 0 if f p [i] = 0. Second, p cannot be extended to v or any of v’s neighbors, so f p [i] = 0∀i ∈ Vv+ = {t|(v,t) ∈ E,t ∈ V } and f p [v] = 0. Now we must update the existing chordless path sets P(s, v) w.r.t. S so that after updating, we get chordless path sets P(s, v) w.r.t S ∪ {k} for all v ∈ S. All existing paths in P(s, v) are still valid chordless paths because none involve k; therefore, k does not create any shortcuts. We only need to consider paths that are extended from P(s, k) w.r.t S ∪ {k}, which we have just obtained. This can be done easily by performing the same test on f p that we performed before for p ∈ P(s, k) for v ∈ VkS+ , where VkS+ = {v|(k, v) ∈ E, v ∈ S}. Also, when a new path is added, the corresponding f p vector is set. Figure 3.3 shows the procedure of adding k to S. t t u fp1[k]=1 S u s s w v v fp2[k]=0 p1=(s,t,u) stored on u p2={s,t,u,v) stored on v compute P(s,k) w.r.t. SU{k} k w S fp3[k]=0 k p3={s,t,u,k) stored on k update P(s,w) w.r.t. SU{k} Figure 3.3: The process of adding in a new vertex k to the partial set S (shown by the dashed box) and computing P(s, i) w.r.t S ∪{k} for all i in S ∪{k}. Theorem 3.5.1. The CP finding algorithm solves the chordless path set discovery problem in O(h) time for a (directed) graph G with |E| = O(|V |), where h is the total number of chordless paths, assuming that selecting a path with f p [k] = 1 for a chordless path set P(s, v) takes constant time. Proof. Computing P(s, k) for a newly added vertex k requires only finding extend40 S . The number of such P(s, v) is a constant amortized able paths in P(s, v) for v ∈ Vk− across all vertices, given that |E| = O(|V |). Adding a new path to P(s, k) requires two operations. First it requires extending a valid path, which takes constant time. Setting f p for this new path involves finding one vertex’s neighbors, which is also a constant under the |E| = O(|V |) assumption. Therefore, all paths can be computed in O(h) time, where h is the total number of chordless paths. The following corollaries (Corollary 3.5.2, Corollary 3.5.3 and Corollary 3.5.4) ensure that S(CM) can be computed efficiently and can be incrementally maintained. Corollary 3.5.2. The paths involved in CM for all candidates can be computed in O(h) time, where h is the total number of chordless paths. Corollary 3.5.3. When Graph G is extended by adding a new vertex v and a number of new edges connecting v with other vertices, then the P set can be updated in O(hnew ) time where hnew is the number of newly formed chordless paths. Corollary 3.5.4. When adding a new peer and its mappings, CM for all candidates can be updated in O(hnew ) time where hnew is the number of newly formed chordless paths. Chordless paths represent possible query translation paths. For acquaintance selection, we only need to estimate the effectiveness of the current aggregate mapping (Definition 3.4.7). Depending on the effectiveness measure used, it may not be necessary to enumerate all chordless paths (which, in worst case, is of order O(2|V | )). The algorithm can be modified to represent chordless paths between two peers in a compact way. The savings are quantified in the following theorem: Theorem 3.5.5. If the effectiveness of path p e, S(M(p e)), can be computed from S(M(p)) and S(M(e)) in constant time, then S(CMi j ) for peer i and another peer j can be computed in O(|E|) time. The proof to Theorem 3.5.5 is straight-forward from Theorem 3.5.1. Next we present an algorithm which modifies the original CP-finding algorithm to achieve the savings and proves Theorem 3.5.5. The basic idea is to group 41 chordless paths ending at each node to a constant number (when |E| = O(|V |)) of, or O(|V |) (when |E| = O(|V |2 )) groups so that either all the paths in a group can be extend to a new node forming a new group of chordless paths or none of them in the group can be extended to the node. Therefore, we avoid examining each chordless path individually and instead examine paths in groups. While the complexity in Theorem 3.5.1 is essentially the lower-bound when each path needs to be enumerated explicitly to compute effectiveness, the “group representation” in Theorem 3.5.5 can significantly reduce the complexity of computing S(CM). Let B(i, j) = {p|p ∈ P(vi ), vt ∈ p s.t. (vt , v j ) ∈ E} i.e., B(i, j) is the set of chordless paths at vi that can extend to v j . Following the notation above, the chordless path set on vk is P(vk ) = ∪vi ∈Vk− B(i, k). where P(v) is the union of all chordless paths extendable to v. The size of P(vk ) is exponential to |V | in the worst case so computing B(k, l) (for l ∈ Vk+ ) from P(vk ) can be expensive. Fortunately, B(k, l) can be computed by B(k, l) = ∪vi ∈Vk− ,vi ∈V / l − (B(i, k) ∩ B(i, l)) The union of B(i, k) and B( j, k) (i = j) is easy as they contain paths ending on different nodes thus B(i, k) ∩ B( j, k) = 0. / Additional information is maintained on each node. For a node vi , it maintains B(i, j) for j ∈ Vi+ and also B(i, k) for k ∈ ∪ j∈Vi+ V j+ . Assume on average each node has a constant number (C) of acquaintances, then each node maintains C + C2 such B(i, j) sets, and therefore the overall cost is O(|V |). The max-min approximation measure we introduce below in Section 3.5.2 takes advantage of the improved CP finding algorithm because SCMmax−min (p i) depends only on the value of SCMmax−min (p) and the quality of the mappings leading to node i, therefore explicit enumeration of chordless paths to node i is not necessary. Therefore, SCM values for all nodes can be computed in O(|V |) time. 42 3.5.2 Max-min approximation for S(CM) It is usually hard to accurately measure how well queries can be answered after being translated along a path of schema mappings. To do so requires performing the mapping composition operation which is, in general, computationally hard. To enable fast estimate of the “current condition” of a peer in the system, we use a “max-min” approximation to estimate the effectiveness of a current aggregate mapping (S(CM)). It assumes that the number of attributes mapped in a CMi j , is the maximal number of mapped attributes in all equivalent mappings over all chordless paths from peer i to j, and the number of attributes mapped in such an equivalent mapping, is the minimal number of mapped attributes in all the mappings (the edges) on this path. Following the MF measure (Definition 3.4.2), let xip j denote the number of attributes in Mi, j (p), then S(CMi j ), is max-min approximated as S(CMi j ) = max xip j /|sch(i)| p∈P(i, j) where xip j is estimated along a chordless path p using xip j = min xa b (a,b)∈p where xa b is the number of mapped attributes (Definition 3.4.2). For example, S(CMs t ) in Figure 3.2 is estimated using the three chordless paths p1 , p2 and p3 . While other (more) effective methods may also be applied, the max-min approximation captures the fact that an ineffective mapping in a path may greatly harm the query answering ability of the whole path. This max-min approximation is sufficient for the purpose of evaluating the performance of the proposed selection schemes without depending on further information availability assumptions; it requires a minimum amount of information be transmitted across the peer network. Also, the max-min approximation can be quickly computed inline with the chordless path finding algorithm. Next, we describe in detail the two selection schemes and their direct mapping effectiveness (S(Mi j )) estimators. 43 3.5.3 The one-shot selection scheme We first present the “one-shot scheme”. As with all selection schemes, the one-shot scheme is constructed by the framework operations (Section 3.4). In the one-shot scheme, the host peer first probes the PDMS to collect mapping effectiveness information for existing pairwise mappings. The one-shot selection scheme uses a direct mapping effectiveness estimator called the one-shot estimator in the estimate step. It works by classifying peers in the PDMS into a number of clusters using the information from established mappings. Direct mapping effectiveness estimations are then made using the properties of the discovered clusters. This algorithm identifies valuable candidates in one pass over the collected information and thus is named “one-shot”. Assume that the PDMS has N peers to classify into C clusters where C is a pre-determined constant. The estimator clusters such that the pairwise mapping effectiveness between two peers in the same cluster is high, while that between peers from different clusters is low. The clustering then is used to predict unknown mapping effectiveness. The challenge of the one-shot estimator is to discover the best cluster assignments using the limited established mapping effectiveness information from the PDMS. Model of the one-shot estimator Peers: Z as their cluster assignments Z1 Z2 Z3 11 00 11 00 111 00 000 00 11 11 000 111 X12 X 13 observed X 24 Z4 effectiveness ... X14 X23 unobserved Figure 3.4: Variables and dependencies in one-shot estimator. The variables and their dependencies in the one-shot estimator are shown in the graphical model in Figure 3.4. The upper row of nodes consists of the peers 44 with independent random variables Zi representing peers’ cluster assignments. The second row shows the pairwise mapping between every two peers; Xi j represents the mapping effectiveness determined by the pair of peers in the mapping. Some mappings are already established, so their effectiveness is observed by the host peer during operation probe (e.g., X12 , X13 , X24 ). Other mappings are not established, and thus their effectiveness must be estimated (e.g., X14 and X23 ). The model assumes that the mapping effectiveness between two peers, given their cluster assignments Zi = a and Z j = b, follows Gaussian distribution with hidden pa2 ). Depending on the number of clusters, C, a set of |C|×|C| rameters θab = (µab , δab Gaussian distributions is used to characterize the clusters. The estimator infers both the cluster assignments Z = (Z1 , . . . , ZN ) and the Gaussian parameters θ for all C clusters. It searches for the MAP estimates (Section 3.2.2) of Z and θ . Formally: (Z, θ )MAP = arg max Pr(X|Z, θ )Pr(Z) Z,θ Since θ takes continuous values and Z is discrete, it is hard to optimize them together. To resolve this, we conduct a two-phase optimization using a modified EM (Section 3.2.3): we first pick a θ and find the MAP estimates of Z under this fixed θ ; then we optimize θ with the just discovered Z. The two steps iterate until Z and θ converge. After Z and θ have converged, the one-shot estimator estimates the direct mapping effectiveness between the host peer i and a candidate peer j. It returns the mean µZi Z j as the mapping effectiveness estimation. Next, we describe how the MAP estimates of Z and θ can be efficiently computed. We assume Gaussian distribution of the mapping effectiveness between peers within the same cluster and between peers in different clusters. The justification for choosing Gaussian distribution is as follows. First, Gaussian distribution is commonly used in inferences and it brings convenience in parameterizing and analysis. Second, the peers studied here are assumed to be independent. Considering each effectiveness measure between two peers as a random variable (of any distribution, with finite mean and variance), by the central limit theorem, the group behavior of peers in a cluster will approximate a Gaussian distribution and the approximation 45 is closer when the number of peers in the clusters grows large. The Gaussian distribution is not a hard constraint to the one-shot estimator. As long as there is a distinguishable difference between the mapping effectiveness for peers in the same cluster and the mapping effectiveness for peers belonging to different clusters, the one-shot estimator is able to detect it and discover the true cluster properties. In our experiments, we present a set of results where the distribution is exponential rather than Gaussian; the results show that we are still able to correctly cluster peers. Local search on Z for MAP estimates In the first phase (E step), we search for Zˆ MAP with a fixed θ . The function to optimize, as described above, is f (Z) = P(X|Z, θ )P(Z) and the MAP estimates for Z, with θ fixed, is Zˆ MAP = arg maxZ f (Z) where X is the set of observed mapping effectiveness. By the independence of Zi , f (Z) can be factored; we study log f (Z) (they lead to the same MAP estimates): log f (Z) = ∑(log N (xi, j |θZi ,Z j ) + log P(Zi )P(Z j )). i, j for all (i, j) with xi, j ∈ X observed. Note that θ is fixed in this phase; the only variable is Z. Let Z [n] ,P[n] (Z) and θ [n] denote the Z, P(Z) and θ in iteration n. The above formula can be transformed to an update function w.r.t. the iterations. Z [n+1] = arg max log f (Z) Z [n] = arg max ∑(log N (xi, j |θZi ,Z j ) + log P[n] (Zi )P[n] (Z j )) Z with P[n+1] (t) = ized as P[1] (t) = Nt N 1 C i, j [n] where Nt is the number of Zi ’s that are valued t and is initialfor all t ∈ 1..C. 46 Finding Zˆ MAP (i.e., the cluster assignment) requires exhaustively searching all possible assignments; this is inefficient for a search space as big as CN . We conduct a segmented local search (SEG) to speed up this procedure. Figure 3.5 shows an example of segmented local search. Z(s1)_1 111111 000000 0000000 1111111 ... 111111 000000 ... 111111 000000 s1 Z(s1)_1 0 0 s2 0 1 2 1 Z(s1)_2 0 0 2 1 2 1 K K ... 0000000 1111111 ... 0000000 1111111 ... 1111111 0000000 ... 1111111 0000000 ... 111111 000000 ... 000000 111111 ... 000000 111111 ... 000000 111111 0 0 0 0 0 0 0 0 0 0 0 0 T(1)=2 0 0 2 0 0 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 0 0 2 0 0 2 1 2 1 1 2 1 Z(s1)_2 ... ... Figure 3.5: An example for two steps in a SEG: Active segments are shadowed, white ones are to be searched in the next step, and carried over segments from previous step are marked black. The length of a segment is K. SEG first searches all CK assignments of Z(s1 ). Only the T (1) = 2 “best” assignments are carried over to the next step where each branch searches s2 for Z(s1 ∪ s2 ) with carried over Z(s1 ). The segmented local search algorithm is initialized with parameter K as the segment length. It breaks Z into t = N/K segments s1 ..st and then performs local search segment by segment. Let Z(si ) denote the assignment of si . Starting with s1 , SEG searches each segment using exhaustive search, i.e., it tries all CK assignments for the segment. For si , SEG computes f (Z(∪ j≤i (s j ))) with assignments on ∪ j<i (s j ) fixed on one of the carried over assignments from the search on si−1 . For each si , which has search space S(i) = CK ∗ T (i − 1), it keeps the best T (i) = log S(i) assignments to carry over to the search on segment si+1 . Additionally, we define T (0) = 1. The segmented local search finishes when all segments are processed. The assignment that yields the maximal f (Z) is used in the second phase (M step) to optimize θ , as described in the following section. 47 Computing MAP estimates for θ We use the Z assignment found in the E step to compute θˆMAP . The goal remains to maximize P(X|Z, θ )P(Z). The updating function for θ can be written as θ [n+1] = [n] [n] arg maxθ ∑i, j log(P(xi j |θZ [n] ,Z [n] )) + log P[n] (Zi )P[n] (Z j ). i j Note that in this phase, Z is fixed and the only variable is θ . Solving (taking the derivatives and setting it equal to 0) under the assumption of Gaussian distribution, P(xi j |θZ [n] ,Z [n] ) ∼ N (µZ [n] ,Z [n] , δZ [n] ,Z [n] ) i j i j i j results in [n+1] µ p,q = 1 xi, j ∑ [n] Np,q Z [n] =p,Z [n] =q i 2[n+1] δ p,q [n] = 1 j ∑ [n+1] (xi, j − µ p,q )2 [n] Np,q Z [n] =p,Z [n] =q i j [n] [n] where Np,q is the number of xi, j ’s with Zi = p, Z j = q. In other words, µ and δ 2 are updated using the empirical mean and standard 2 ) is set deviation. If one cluster receives too few peers, the corresponding (µ p,q , δ p,q to a prior value, e.g., (0.7, 0.03) if p = q and (0.3, 0.03) otherwise. Convergence of iterations The one-shot estimator iteratively optimizes Z and θ until they converge. In optimizing both Z and θ , f (Z) is guaranteed to monotonically increase. Because f (Z) is bounded, the iterations always converge by requiring that the increment of f (Z) is smaller than some threshold for a number of consecutive iterations. Reordering peers for segmented local search It is easy to observe that the assignments for the first several segments are very important. As the quality of the inference is largely determined by the amount of available information (i.e., the number of existing mappings) for the segments being processed, the peers can be reordered so that as much information as possible can be used in the early stages of segmented local search. Figure 3.6 shows an ex48 ample of this reordering for 20 peers, numbered 1 to 20. The established mappings are shown as X’s. The X’s in corresponding boxes are what the segmented local search can use for segments s1 to s4 . s1 1 2 X 3 4 5 6 X 7 8 9 10 11 X 12 13 14 15 16 X 17 18 19 20 s2 s3 X X s4 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 20 5 12 8 19 2 13 18 7 3 14 4 10 17 9 11 1 6 15 16 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X (a) before reordering X X X X X X X X X X X X X X X X (b) after reordering Figure 3.6: An example of reordering 20 peers We use a fast greedy algorithm to re-order peers as follows. Starting with an ordered set, S, having one peer with the maximal number of established mappings in the PDMS, we add in each step a peer that maximizes the total number of observed mappings among peers in S. This process is repeated until all peers are added into S. This re-ordering can be done in O(N 2 ) time. Figure 3.6(b) shows that more information can be used by the local search in early segments after reordering. Initialization strategy The one-shot estimator starts with an initial θ parameter for the Gaussian distribution (Section 3.5.3). A carefully initialized θ matrix can bring extra benefit to the estimator based on the following two aspects. We observe that if we use neutral clusters, we may further reduce the search space for the first segment without any quality loss. Here neutral means the |C| clusters are initialized with same parameters which means the “name” of the cluster does not affect the cluster assignment in our problem. A θ matrix corresponding 49 to such a setting can be written as θ|C|×|C| = c1 I|C| + c2 (1 − I|C| ) where I is the identity matrix and c1 , c2 are two (µ, δ 2 ) pairs with c1 .µ > c2 .µ. For example, (0.7, 0.02) (0.3, 0.04) (0.3, 0.04) (0.3, 0.04) (0.7, 0.02) (0.3, 0.04) (0.3, 0.04) (0.3, 0.04) (0.7, 0.02) is a “neutral” cluster setting where c1 (µ, δ 2 ) = (0.7, 0.02) and c2 = (0.3, 0.04). Given that θ is initialized this way, changing the label of clusters will not affect the likelihood value. For example, given 6 peers in 3 clusters, an assignment (0, 0, 1, 1, 2, 2) is equivalent to the assignment (1, 1, 0, 0, 2, 2) and (1, 1, 2, 2, 0, 0) — all representing the same clustering. Thus, we should avoid testing this kind of equivalent assignments. Definition 1 (Equivalent Assignments). Two assignments Z1 and Z2 are considered as equivalent under parameter θ if the following conditions are satisfied. 1. There exists a mapping P(x) : C → C that ∀i, P(Z1 [i]) = Z2 [i] 2. For all P(x) = y, θ [x][k] = θ [y][k]∀k = 1..|C| Avoiding testing equivalent assignments can result in substantial savings: for |C| = 3 and K = 5, in the 35 = 243 assignments the number of non-equivalent assignments is only 41, an approximately 1/C! savings. 1 When testing assignments, we can avoid traveling through those equivalent assignments by using the automata shown in Figure 3.7. Analysis of the one-shot estimator The complexity of each iteration of the one-shot estimator comes from the two optimization procedures. While it is easy to see that optimizing θ is O(N), the 1 The [iK |C| exact number of non-equivalent assignments is calculated by T = ∑i=1 P(i), where P(i) = and P(1) = 1. i − ∑i−1 j=1 ( j )P( j)]/i! 50 1 1 1 2 1 3 2 4 3 2 1 2 |C| 2 |C| 3 |C| Figure 3.7: The automata that generates non-equivalent assignment series. following analysis shows that searching for Zˆ MAP is also O(N). The time complexity on searching for Zˆ MAP can be broken into two parts. One is the cost of searching through the assignments and the other is the cost of computing f (Z) for each (partial) assignment. We have the following lemma for segmented local search (SEG). Lemma 3.5.1. Given N peers to classify into C clusters, if the segment length is set to K, The total number of assignments that SEG tests is O(CK N logC). Proof. The number of assignments of each segment of length K is P = CK . Let S(i) denote the search space when the searching is on segment i and T (i) be the number of assignments that get carried to the next segment. Then we have S(1) = P, T (1) = log P and S(t) = P · T (t − 1), T (t − 1) = log S(t − 1) for t > 1. Hence, S(t) = P log S(t − 1) = P log P + P log log(S(t − 2)) expanding S(t) one more step we have S(t) = P log P + P log[log P + log(log S(t − 3))] because segments have uniform length, the term log(log S(t − 3)) can be approximated by log log(µCK ) < (1 + ε) log K, for small constants µ,ε. Note that we 51 always choose K > C, so, S(t) = P log P + P log[K logC + (1 + ε) log K] < CK K logC +CK log[K log K + (1 + ε) log K] = CK K logC +CK [log(K + 1 + ε) + log log K] < CK (K logC + (1 + ε ) log K) for a small constant ε . Sum this up for all N/K segments, we get an upper bound of the total complexity N/K ∑ S(i) = CK [logC + (1 + ε ) i=1 log K ]N K with logC > (1 + ε ) logKK , the total complexity goes to O(CK logC · N). The cost of computing the likelihood value for each assignment is affected by the number of mappings observed. In a typical PDMS setting where one peer does not map to a large number of other peers, we have the following theorem: Theorem 3.5.6. The complexity of the one-shot estimator for each EM iteration is O(BKCK N logC), where B is the maximal number of acquaintances a peer in the PDMS has; K is the number of segments in SEG, C is the number of clusters and N is the number of peers. Proof. When searching in a segment of length K, the observed mapping effectiveness for each segment of length K is bounded by O(BK), therefore, the cost of computing the likelihood for an assignment is O(BK). With Lemma 3.5.1, the total complexity of the one-shot estimator is O(BKCK logC · N). While there is no theoretical guarantee on the number of iterations after which the EM process will converge, in our empirical study (Section 3.6), we observed that most runs converged in a small number of iterations. There are three ways to control the number of iterations in practice. First is to set a convergence threshold. If the improvement between iterations is smaller than the threshold for a predetermined (e.g., 10) iterations, then we stop and regard the process as having converged. The second way is to set a hard “cut-off” value and return the currently 52 best result after a pre-determined number of iterations or time. A third way is to apply a random restart strategy for EM. After the improvement speed has slowed down below a pre-determined threshold, we restart the EM process with another initial setting. Random restarting is a commonly used strategy in local search and in many cases could help to improve a (previously converged) local minima [43]. The number of clusters (C) also impacts the estimation both on the speed and the amount of observed information required for accurate estimation. Generally, the larger C we have, the longer time the estimator needs because of the enlarged search space CN . Our empirical study shows that for a larger C, the EM process also takes more iterations to converge. Meanwhile, when C increases, the required number of observed mappings increases significantly in order to keep the clustering accurate (see Table 3.3 in Section 3.6). We find that 4 clusters is a good balance point: peers generally come from 4 clusters with in-cluster mapping effectiveness higher and cross-cluster mapping effectiveness relatively lower; for this case, only 15 (2.5% of 600 peers) observed acquaintances are needed for an accurate estimation (misclassification ratio < 5% or < 30 peers out of 600 peers are not located in the best clusters). An immediate question is how many different clusters are there in a PDMS in practice? Our current experience is “not many” when we look at several growing schema sets. The one-shot selection scheme relies on the one-shot estimator to estimate the direct mapping effectiveness between the host peer and a candidate acquaintance. It uses the max-min estimator (Section 3.5.2) to estimate the current aggregate mapping effectiveness for a candidate and computes criteria values as described in Section 3.4.3. All candidates are then ranked by the criteria values, which represent the benefit of choosing the candidate as an acquaintance. Again we point out that the host peer will not always choose its new acquaintances from within its own cluster; both estimates affect the final decision. To summarize, the one-shot acquaintance selection scheme manipulates existing pairwise mapping effectiveness information to cluster peers and further infer the unobserved mapping effectiveness. Combined with the current mapping effectiveness estimation, this guides acquaintance selection. In Section 3.6 we empirically validate how much information needs to be gathered for the one-shot estimator to perform well. The empirical study also validates our theoretical analysis that 53 the estimation algorithm scales well in the PDMS size. Since the one-shot estimator requires prior knowledge of the number of clusters, this scheme is inapplicable for scenarios in which this information is unavailable. For such scenarios, we have developed another selection scheme that does not rely on this prior knowledge: the two-hop selection scheme. 3.5.4 The two-hop selection scheme This section describes a novel two-hop acquaintance selection scheme. The twohop selection scheme uses a new two-hop direct mapping estimator, which differs from the previous one-shot scheme in the following aspects: 1. The two-hop estimator does not need to know the number of clusters that the peers potentially form. Thus, the scheme is applicable when that information is unavailable. 2. Under the two-hop scheme, a peer explores the network in multiple rounds, requiring fewer messages to be transmitted during each probe step. 3. New peers joining the network and existing peers leaving the network do not affect other peers’ selection procedures, which makes the two-hop selection scheme suitable for PDMSs that experience frequent peer updates. 4. In addition to using established mapping effectiveness in the PDMS, the twohop scheme uses other peers’ estimations with the heuristics from this section. 5. The two hop scheme makes use of mapping effectiveness information updated between probes. It can adjust estimation dynamically and supports more flexible acquaintance selection demands (e.g., wait until a candidate with good enough benefit appears). Before detailing the selection operations, we first describe the new direct mapping effectiveness estimator — the “two-hop estimator”. Instead of collecting only pairwise mapping effectiveness as in the one-shot estimator, the two-hop estimator focuses on mapping paths of length 2, thus receiving the name “two-hop”. The two-hop estimator in detail We start with the definition of a two-hop path. 54 Definition 3.5.2. [two-hop path] A path (i, k, j) is a two-hop path from i to j if and only if the mapping effectiveness or mapping effectiveness estimation of both mappings (i, k) and (k, j), for some other peer k, are known to peer i. To estimate the direct mapping effectiveness between peer i and peer j, the two-hop estimator computes all two-hop paths from i to j. It then estimates the direct mapping effectiveness S(Mi j ), where S is the effectiveness measure in Definition 3.4.2. Let H denote the set of two-hop paths. For each two-hop path p(i, k, j) ∈ H, the two-hop estimator first estimates the direct mapping effectiveness on path p, denoted as S p (Mi j ) using its expectation. Formally: Sest p (Mi j ) = E(S p (Mi j )) (3.5.1) 1 = 0 tP(t|S(Mi k ), S(Mk j ))dt where P(t|S(Mi k ), S(Mk j )) is computed using a pre-trained model (PTM). The PTM is an array in which each element PT M[a][b][c] represents the number of instances observed for a two-hop path (i, k, j) with S(Mi k ) = a, S(Mk j ) = b and S(Mi j ) = c. Using the PTM, this conditional probability P can be approximated using the sample probability P∗ : P(t|S(Mi k ), S(Mk j )) = P∗ (t|S(Mi k ), S(Mk j )) = PT M[S(Mi k )][S(Mk j )][t] 1 0 PT M[S(Mi k )][S(Mk j )][t]dt (3.5.2) In the implementation of the two-hop estimator, the range of the mapping effectiveness measure is partitioned into T equal width intervals. Elements in the PTM are indexed by integers i ∈ 0..T − 1. An entry PT M[i] on one of its dimensions, covers mapping effectiveness valued in interval [i/T, (i + 1)/T ). Therefore, Equation 3.5.2 is re-written into P(t|S(Mi k ), S(Mk j )) = 55 PT M[a ][b ][c ] ][b ][u] −1 PT M[a ∑Tu=0 (3.5.3) where c = T · t and Equation 3.5.1 into Sest p (Mi j ) = 1 2T T −1 ∑ s=0 (2s + 1) · PT M[a ][b ][s] T −1 PT M[a ][b ][u] ∑u=0 (3.5.4) where a = T · S(Mi k ) , b = T · S(Mk j ) for Equation 3.5.3 and Equation 3.5.4 and p = p(i, k, j). The two-hop estimator aggregates the estimates from the mapping effectiveness on all two-hop paths in H and returns the final estimation of direct mapping effectiveness: Sest (Mi j ). A simple aggregation2 takes the mean over all two-hop paths: Sest (Mi j ) = 1 ∑ Sest p (Mi j ) |H| p∈H (3.5.5) where Sest p (Mi j ) is as computed in Equation 3.5.4. The estimator focuses on paths of length two rather than on longer length paths for several reasons: first, the accuracy is likely better with shorter paths. Second, it is more efficient than trying longer length paths — e.g., when the dimensionality of PTM increases, many more peers must be trained. As we show in Section 3.6.4, it performs well in practice. We leave trying an increased number of hops as future work. Unlike in the one-shot estimator, in this two-hop scheme, instead of clustering peers, we estimate the mapping effectiveness directly. This removes the requirement of knowing the number of “clusters” the peers form and furthermore does not rely on iterations like what we have for the EM algorithm in the one-shot scheme. Freedom from knowing the prior information and avoiding iterative operations do not come with no cost. The two-hop scheme requires examining more 2-hop paths and applying some heuristics in order to obtain good estimation accuracy. Fortunately, for medium to large sized PDMSs, we are likely to observe enough 2-hop mapping paths; we describe heuristics in the later part of this section. We now look at the two-hop selection scheme in detail. 2 An improved aggregation is discussed in Section 3.5.4. 56 Selection operations in the two-hop scheme Peers using the two-hop selection scheme need multiple rounds to discover all peers in the PDMS. The host peer discovers new peers in the probe step for each round. Each probed peer returns three types of information to the host peer: I. the existence of new peers, II. the mapping effectiveness of newly established mappings, III. the (updated) direct mapping effectiveness estimation. The host peer uses information of type II to update the PTM. The newly discovered peers are added to the set of acquaintance candidates. After the host peer has finished the probe phase, it goes into the estimate phase. First the host peer uses the two-hop estimator to re-evaluate direct mapping effectiveness between itself and all peers in its candidate list except those which were just added in the probe phase. This is to ensure that new estimates are always based on the most up-to-date information. After all mapping effectiveness estimations for old candidates is updated, the host peer estimates the directly mapping effectiveness for newly added peers. In the estimate step, S(CM), the current aggregate mapping effectiveness (Definition 3.4.7) is computed using the technique in Section 3.5.2. The difference from the one-shot scheme is that S(CM) is only computed for the candidates discovered by the host peer instead of all peers in the network. After both S(M) and S(CM) are computed, the host peer picks its acquaintances using a selection criteria, for example the one described in Section 3.4.3. Because the two-hop selection scheme disseminates and updates information in multiple rounds, the bookkeeping operation needs to remember if a peer has already received updated information, so that the same messages will not be sent to a peer twice. This can be done by time-stamping the messages and recording for each peer the latest time it was updated. Unlike in the one-shot estimator, a host peer using the two-hop scheme needs multiple steps to discover the whole network. A natural concern is how long it takes for a peer to discover a potential acquaintance. The following theorem guarantees that it does not take long for a peer to discover the whole P2P network. Theorem 3.5.7. For a network which each peer has on average K(K ≥ 3) random outbound mappings, it takes on average O(log log N) steps for a peer to discover all peers in the network using the two-hop scheme, where N is the number of peers 57 in the network. Proof. This result follows from results reported in [41] which states that for the assumed network topology, the expected distance between two vertices in a connected network of size N is O(ln N). Additionally, the mean square deviation is small, so the actual distance will not deviate much from the expectation. Now suppose path p is the shortest path from i to j and its length is p = O(log N). Because all peers conduct their “probe” operation simultaneously, this length is halved after each round of probing. So it takes O(log log N) rounds for one peer to discover the existence of another peer — if there exists a path between the two peers. Before the two-hop scheme can be used on a PDMS, the PTM must be trained. It is trained on a distribution of cord effectiveness given the two mappings (hops) on which the cord is defined. The training process can be performed using twohop selection in a training peer set with mapping effectiveness fully observed or the PTM initialized to follow a prior distribution. If the training data’s distribution deviates from the distribution of the running PDMS, the estimation from two-hop estimator will be inaccurate. However, the two-hop scheme keeps updating the pre-trained model so that the PTM is updated using the observations reflecting the true distribution of the running PDMS. In other words, it actively learns to improve the model stored in the PTM. In our empirical study in Section 3.6, we purposely initialize the two-hop scheme with a pre-trained model that is different from the test data. In practice, if the host peer has a strong prior knowledge of the PDMS, it will initialize the PTM with a big training data set so that updates from the running PDMS do not change much of the model. On the other hand, training the PTM using a small training data set allows it to adopt to a new (observed) distribution more quickly. Next, we describe some useful heuristics that improve the accuracy of the twohop estimator. Heuristics to improve the two-hop estimator There are several opportunities to improve the basic two-hop estimator. First, we observe that simply taking the mean of all estimations from two-hop paths may not 58 be the best strategy to obtain a good estimation of S(Mi j ). Consider the example in Figure 3.8, where the host peer i wishes to estimate the direct mapping quality 0.3 0.4 i a 0.3 0.9 0.2 j b 0.9 0.9 c Figure 3.8: A motivating example for heuristics in two-hop estimation: labeled circles denote peers, solid links denote established mappings, and dotted links denote the estimation of mapping quality obtained. The number beside each link denotes the (estimated) mapping quality. to candidate j using the two-hop estimator. There are three two-hop paths to j in this example. Mappings (i, a), (i, c) and (c, j) are established mappings; therefore, the effectiveness can be directly computed using an effectiveness measure (e.g., the MF measure in Definition 3.4.2). The effectiveness of the other mappings {(a, j), (i, b) and (b, j)} comes from previous estimates: e.g., the 0.4 effectiveness of (i, b) is obtained by examining all two hop paths connecting node i and node b. Our first heuristic is to assign more weight on the established mappings than the estimated mappings: Heuristic 1: We assign different weights to paths that are formed by established mappings, estimated mapping quality, and the combination of the two. A two-hop path formed by two established mappings is given the highest weight, a path with both edges mapping quality estimation is given the lowest weight. The weight for a path that has one established mapping and one estimated mapping lies between. Let w p denote the weight for a path, then the aggregate Equation 3.5.5 is re-written using the weighted mean as Sest (Mi j ) = ∑ p∈H w p Sest p (Mi j ) ∑ p∈H w p 59 We tried different ways of assigning weights; the results suggest the weights be assigned as follows. An established mapping gets weight 2 and an estimated mapping gets weight 1. Then the two hop path weigh is the product of the mapping weights. That is, a 2-hop path with both established mappings gets weight 4; the one with both estimated mappings gets weight 1 and a “mixed” path gets weight 2. Next, we observe that the current aggregate mapping quality gives valuable information on the quality of the direct mapping. Under the assumption that a direct mapping will likely enable better query translations (Section 3.4.2), the quality of the current aggregate mapping can be used as a lower bound of the direct mapping quality estimation. This results in our second heuristic for the two-hop estimator: Heuristic 2: During acquaintance selection, the two-hop selection scheme first estimates the quality of the current aggregated mapping S(CMi j ) for host peer i and candidate peer j. Then the two-hop estimator uses S(CMi j ) as a lower bound in estimating the direct mapping quality S(Mi j ). I.e., instead of computing the probability as in Equation 3.5.2, we compute the probability conditioned on S(CMi j ) as follows: P(t|S(Mi k ), S(Mk j ), S(Mi j ) ≥ S(CMi j )) 1KB[S(Mi k )][S(Mk j )][t] , t ≥ S(CMi j ); u KB[S(Mi k )][S(Mk j )][t]dt = 0, otherwise. where u = S(CMi j ). Replacing P in Equation 3.5.2 with this conditional probability yields a more accurate mean with higher confidence. The above heuristics help the two-hop estimator to achieve better results. We observed from experiments that when the pre-trained model approximates the testing data well, the two-hop estimator works equally well with or without the above heuristics. However, when the distribution of the testing data deviates from the pretrained model, the two heuristics help to more accurately update the PTM to fit the testing data. Specifically, heuristic 1 lowers the negative effect from inaccurately estimated mapping effectiveness and heuristic 2 helps to lower the variance of the estimated distribution. 60 3.5.5 Choosing between one-shot and two-hop The use of different direct mapping effectiveness estimators makes the two selection schemes suitable for different scenarios — depending on whether the host peer knows the number of potential clusters that peers form in the PDMS. When we have no such information, we can only use the two-hop scheme. Although the two-hop selection scheme can be applied virtually to any PDMS, deciding which of one-shot or two-hop to use is non-trivial. As we will show in Section 3.6, the one-shot estimator — which uses all established mappings in the PDMS to infer direct mapping effectiveness — does not need a large number of mapping qualities to be observed before it can perform well. Additionally a highly mismatched pre-trained model (PTM) can mislead the two-hop estimator until updates from the current PDMS change the distribution in PTM to one that positively helps estimation. Therefore, for such scenarios and scenarios where no pre-trained model is available, the one-shot selection scheme is preferred. Scalability is another issue to consider when choosing between one-shot and two-hop. In one-shot, the whole topology of the PDMS needs to be collected in the probe stage of acquaintance selection. This can be burdensome when the PDMS is large and can make the one-shot scheme impractical for very large PDMSs (e.g., those having thousands of peers). In this case, the two-hop scheme is preferred. State-of-art PDMSs do not contain more than several hundred peers. While a modern file sharing P2P network easily grows up to have several thousand peers, a PDMS does not scale this big simply because it is more difficult to create and maintain semantic mapping for query translation than network connections for just transmitting data. Therefore, the one-shot scheme still suits most PDMSs. In Section 3.5.3, we discussed the impact of the number of clusters. While two-hop does not rely on this prior knowledge, the heterogeneity of the peers still affects the two-hop estimation. Using Figure 3.8 as an example, the two-hop path i − c − j suggests that i and j could be mapped effectively. However, in the selection of acquaintances, if such path as i − c − j already exists, then (for i) picking j as acquaintance will not increase the gain of query answering — peer j is unlikely to be chosen by the selection criteria. Therefore, we favor high effectiveness mappings estimated from two-hop paths where both mapping effectiveness of the hops 61 are low. When the number of peer clusters increases, the chance that two peers in the same cluster are only reachable via peers in other clusters becomes low, making it more difficult for a two-hop process to discover valuable hidden mappings. While the estimation from the one-shot estimator is based on all established mapping effectiveness in the PDMS and is potentially more accurate, the twohop selection scheme has distinct advantages on aspects we have mentioned at the beginning of Section 3.5.4, and as we show in the next section, empirically performs very well. 3.6 Empirical Study This section evaluates the two acquaintance selection schemes in the previous sections. In particular, the following components are implemented: 1. The max-min estimator (CP-CM) used by both the one-shot and two-hop acquaintance selection schemes. (Section 3.5.1 and Section 3.5.2) 2. The one-shot selection scheme and one-shot estimator (OSME) (Section 3.5.3). 3. The two-hop selection scheme (Section 3.5.4) and two-hop estimator (THME) with heuristics. All algorithms are implemented in C++. The simulation platform is a PC running Windows XP with an Intel core2 duo 2.4G CPU and 4 Gigabyte of RAM. A single instance uses only one core. Table 3.1 lists the notations used in our tests. Notation N D C Meaning The size of the PDMS Average connectivity of the PDMS. Number of potential clusters peers form Table 3.1: Notations in empirical study for acquaintance selection 3.6.1 Experimental settings We used synthetic peers in our experiments as we do not have access to a PDMS large enough for the desired experiments. The topologies that we generated are (1) 62 random D-regular connected graphs and (2) scale free (a.k.a. small world) graphs. These two topologies represent the initial established mappings among peers in the PDMS. The D-regular topology simulates those PDMSs in which peers have a similar number of acquaintances. This is commonly seen in networks with no particular “super servers” [70]. Moreover, for a PDMS that does not initialize its topology as D-regular, if the peers independently select a similar number of acquaintances, then the network topology will migrate towards a D-regular one, where D equals the number of acquaintances each peer has. Therefore the Dregular topology also represents what will happen in a long running PDMS. The scale-free topology has been observed in semantic networks [115] and social networks [38, 84] in which peers’ degree distribution follows a power-law distribution (e.g., the Yule-Simon distribution [112]). For the scale-free topology, the average connectivity D is calculated as the average of all the peers’ degrees. The mappings and mapping effectiveness used in the tests were drawn from Gaussian distributions. However, the potential number of clusters (C) is exposed to the one-shot scheme. Because we simulated the selection process using one computer, networking factors are ignored. Our empirical study focuses on the effectiveness of the selection schemes and the extra computation overhead that they impose. The impact of the overlay layer to the physical network is future work. Both schemes uses prior knowledge on the mapping effectiveness (MF is used throughout the experiments). To get meaningful prior values and also to make the generated synthetic mapping effectiveness close to real-world schema mappings, we examined 20 schemas from [10] about retail stores, customers and products. We manually mapped the 5 schemas that model customers and applied the effectiveness measure MF on them; the average MF value between customer schemas is 0.65; the average MF values for mappings from customer schemas to store schemas is 0.29; mappings between the 4 schemas involving products had an average MF value of 0.67. This helped us to ensure that our synthetic schemas — which allowed us to effectively vary the parameters that we wished to explore — were realistic. We can see that schemas do have a quite clear separation on the mapping effectiveness. Interestingly, one schema named “Product orders and deliveries” actually has an MF value of 0.5 on mappings to the customer group and only 0.22 63 on mappings to products — the one-shot scheme would cluster it with the customer schemas. The fact that schemas naturally form clusters based on their contents is also discovered in existing works. In [40], the authors examined the 877 online book stores and find they have high schema similarity; in [113], a tool is developed to explore a similar clustering behavior among schemas and in [81], schema clustering is used to identify domains that schemas belong to and further guide the creation of mappings. While it is tempting to attempt to verify experimentally that one of two-hop or one-shot is better than the other, this would not lead to any meaningful results, and attempting to include any such results would only mislead the reader. The reason is as follows: one-shot performs a simple, global search. In contrast, twohop performs a local search, and makes changes at each round. Thus, the “best” acquaintance to add for one-shot may not be the “best” acquaintance to add by the time that two-hop explores that acquaintance — in the intervening rounds, two-hop will have added more acquaintances and performed more work to update mapping estimates, thus possibly leading to a new “best” choice. Section 3.6.3 and Section 3.6.4 present our empirical study for the two selection schemes respectively. While the two schemes are not directly compared to each other, they are both compared to the best possible (optimal) acquaintance selection that for their own scenarios and strategies respectively. The experiments in Section 3.6.3 and Section 3.6.4 are quite different from another in order to explore the algorithms fully. The fact that exploring the algorithms fully requires a completely different suite of experiments further illustrates why we cannot compare the two — just as it would not make sense to see if two-hop would choose an acquaintance that one-shot did when that acquaintance may not be the best choice when two-hop explores it, it does not make sense to say how many rounds it takes one-shot to explore the whole topology, since it explores all peers at once. Since the simulation platform is a single PC, it does not show the networking behavior of the acquaintance selection. For OSME, the networking overhead occurs at the probe stage where the communication complexity is O(|E|) (or O(DN) for a D-regular topology and O(N 2 ) for arbitrary topology) and no further networking overhead is required. For THME, the networking overhead is bounded by the number of two-hop paths from the host peer to other peers in the PDMS, which is 64 also O(N 2 ). Therefore, we do not specifically study networking overhead. 3.6.2 Evaluation of chordless path finding algorithm The key factor affecting the performance of the max-min estimator (Section 3.5.2) is the performance of the chordless path finding algorithm in Section 3.5.1. Figure 3.9(a) shows the number of chordless paths (CP) between a randomly chosen peer and all other peers in a PDMS with a 20-regular graph as the initial topology. We varied the network size from 100 to 300 peers. Figure 3.9(b) shows the time for the chordless path finding algorithm to compute all the paths, along with the max-min estimates. 5 15 x 10 8 Time(second) #Chordless Paths 10 6 4 10 5 2 0 0 100 140 180 220 260 Size of the PDMS 300 (a) # CP vs. N, D=10 0 2 4 6 8 10 Number of chordless paths x 105 (b) time vs. # CP, D=10 Figure 3.9: Testing the CP finding algorithm. Figure 3.9(a) shows that when the PDMS size, N, gets large, the number of chordless paths increases very quickly. This validates that we need a fast estimation of the current aggregate mapping effectiveness. Figure 3.9(b) empirically validates Theorem 3.5.1: the chordless path finding algorithm is linear in the number of chordless paths. The algorithm also performs very quickly: it computes around 400K paths in 4 seconds. The main reason is that the CP finding algorithm discovers new paths by extending already discovered paths so that even long paths can also be discovered very quickly. Since the chordless path algorithm performs quickly, the overhead of the max-min estimator is small. 65 Name K C Dist Topo ε Meaning segment length true cluster size effectiveness distribution SON topology EM Convergence threshold Value 15 4 Gaussian D-regular 0.01 Table 3.2: Default values for one-shot selection 3.6.3 Evaluation of the one-shot selection scheme We conducted a set of experiments on the one-shot selection scheme. We first tested the accuracy of the one-shot estimator (OSME) on D-regular graphs with the average number of peers (D) varied from 10 to 15. The number of peers in the PDMS (N) scales from 100 to 600. The number of clusters (C) is set to 4 with true in-cluster mapping effectiveness distribution set to N (µ, δ 2 ) = N (0.8, 0.02) and the inter-cluster mapping effectiveness set to follow N (0.3, 0.05). The distribution parameters were hidden to the one-shot estimator which was initialized with N (0.7, 0.05) for in-cluster and N (0.4, 0.05) for inter-cluster effectiveness distribution. Hence, the initial θ to start OSME was (0.7, 0.05) (0.4, 0.05) (0.4, 0.05) (0.4, 0.05) (0.4, 0.05) (0.7, 0.05) (0.4, 0.05) (0.4, 0.05) (0.4, 0.05) (0.4, 0.05) (0.7, 0.05) (0.4, 0.05) (0.4, 0.05) (0.4, 0.05) (0.4, 0.05) (0.7, 0.05) Default values for other parameters are shown in Table 3.2 Figure 3.10(a) shows the root mean square error (RMSE)3 for a set of estimations of the direct mapping effectiveness from the one-shot estimator. The results show that the one-shot estimator in general is more accurate when more initial mappings are observed (i.e., with larger D). Figure 3.10(b) also shows this trend. In general, a lower misclassification ratio (calculated as #misclassi f ication/N) leads to a smaller estimation error. Note that we have some runs with no misclassification as shown in Figure 3.10(b), but the corresponding RMSE in Figure 3.10(a) 3 RMSE is defined as 1 N 2 ∑N i=0 (yˆ − y) , where yˆ is the estimation and y is the true value. 66 D=10 D=12 D=11 D=13 D=15 0.6 Misclassification ratio RMSE 0.3 D=14 0.2 0.1 100 200 300 400 500 0.5 0.4 0.3 0.2 0.1 0 100 600 200 300 N 400 500 600 N (a) Estimation error (b) Misclassification ratio Figure 3.10: OSME accuracy on the D-regular topology. is above zero. This error comes from both the variance of the true effectiveness distribution and OSME’s error in its inference on parameters (θ ). We can see that OSME estimates θ with good precision. This set of experiments also shows how optimally the one-shot scheme can perform, with different numbers of observed mappings. For C = 4, random assignment generally gives a misclassification of 0.75 and an oracle produces no error. Figure 3.10(b) shows that for a PDMS with as many as 600 peers, if (on average) each peer has 14 acquaintances, the one-shot consistently achieves an accuracy above 90%. time (sec) 1000 N=200 D=11 D=13 D=15 time (sec) 1200 800 600 400 200 0 100 200 300 400 500 600 N=400 N=600 1200 1000 800 600 400 200 0 10 11 12 13 14 15 D N (a) OSME time vs. N (b) OSME time vs. D Figure 3.11: OSME time on the D-regular topology. We recorded the running time of the one-shot estimator for the above experiments. The average time used per EM iteration is shown in Figure 3.11. We chose to use average time because different instances require different number of itera67 tions to converge. We observed that in our experiments most runs converge within 5 iterations. The results presented in Figure 3.11(a) show that the running time for each iteration increases linearly with the size of the PDMS. This matches our theoretical prediction in Theorem 3.5.6 (Section 3.5.3). Figure 3.11(b) shows that the time consumption on each iteration also increases linearly with the connectivity parameter (D), which validates the assessment in Theorem 3.5.6. Similar behavior is also observed for experiments on the scale-free topology; those results are shown in Figure 3.12. Figure 3.12(a) shows the misclassification D=10 D=12 D=14 0.06 D=10 1200 D=12 1000 D=14 800 600 400 200 0 100 200 time (sec) Misclassification ratio ratio and Figure 3.12(b) shows the time needed for the computation. 0.04 0.02 0 100 200 300 400 500 600 N 300 400 500 600 N (a) Misclassification ratio (b) OSME time vs. N Figure 3.12: OSME on the scale-free topology. We can see that the reordering process in Section 3.5.3 in OSME takes advantage of the high-degree nodes in the scale-free topology so that OSME performs more accurately compared to using the D-regular topology with the same D value. As a result, for scale-free networks, fewer established mappings are required for OSME to deliver good estimates. Finally, we empirically show how many established mappings are required for the one-shot estimator to make good estimation for both types of topology. The results are shown in Figure 3.13. We can see from Figure 3.13(a) that on the Dregular topology, for a PDMS with no more than 100 peers, D = 10 is sufficient. This number grows with the size of the PDMS and for N = 600, the required connectivity increases to 15. On the scale-free topology side, Figure 3.13(b) shows that for N as large as 600, an average degree of 13 is already sufficient. Note here for both cases, the required connectivity does grow slowly compared to the corre- 68 min(D) for err < 10% C=5 C=6 C=8 DR N=300 18 24 33 DR N=600 17 24 35 SF N=300 13 14 20 SF N=600 13 14 19 Table 3.3: Required D to keep misclassification below 10% using the oneshot estimator. N=100 N=200 N=300 N=400 N=500 N=600 0.5 0.4 0.3 Misclassification ratio Misclassification ratio sponding growth on N. This indicates a good scalability of the one-shot estimator. 0.2 0.1 0 0.08 N=100 N=200 N=300 N=400 N=500 N=600 0.06 0.04 0.02 0 10 11 12 13 14 15 10 11 12 13 14 15 D D (a) for the D-regular topology (b) for the Scale-free topology Figure 3.13: Requirements on initial information for the one-shot estimator. We also generated extra data sets in which peers form a different number of clusters and re-ran the one-shot estimator on the D-regular (DR) and Scale-free (SF) topologies. We tested for C = 5, and C = 6 and the results show that keeping the misclassification ratio low (< 0.1), for the D-regular topology, requires more observed mapping effectiveness to make the right inference. For the scale-free topology, the required average connectivity does not increase much. The results are shown in Table 3.3. The different behavior of the one-shot estimator on the D-regular and Scalefree topologies is interesting. We further investigated the detailed logs of the experiments and the logs show that the difference is caused by the peer-reordering strategy (Section 3.5.3). During reordering, due to the fact that in the Scale-free topology, there are some peers with very high connectivity, they are re-arranged to the first several blocks. And because of the adequate number of observed effec69 (µ, δ 2 )same , (µ, δ 2 )di f f λ = 5 (.2, .04), (.9, .04) λ = 10 (.3, .01), (.8, .01) λ = 20 (.35, .0025), (.75, .0025) N=300 (.241, .05), (.885, .05) (.343, .03), (.817, .03) (.353, .01), (.75, .01) N=600 (.203, .05),(.892, .05) (.33, .03), (.807, .03) (.348, .0063), (.75, .0047) Table 3.4: Estimation of mean and variance using the one-shot estimator for cluster effectiveness distribution following an exponential distribution. tiveness, those peers are correctly clustered. Once these “hub”s are correctly classified, their connecting peers are also correctly classified, which eventually makes the Scale-free topology require less connectivity in order to achieve high accuracy. Although the one-shot estimator assumes a Gaussian distribution of the mapping effectiveness between the peers, this is not a hard constraint. From the analysis of the one-shot estimator we already know that as long as the in-cluster mapping effectiveness and inter-cluster mapping effectiveness is different, the one-shot scheme will work. To test this, we generated another set of similarities between peers and set them to follow exponential distribution with λ = 5, 10 and 20. For in-cluster mapping effectiveness, the upper bound is set to 0.4 and for inter-cluster mapping effectiveness, the lower bound is set to 0.7. Our experiments showed that for λ = 10 and 600 peers on the D-regular topology, and as long as D >= 10, there is no misclassification. This is reasonable because the two distributions do not overlap and thus it is even easier for the one-shot to classify this “easy case”. The one-shot estimator also inferred the mean (0.3 and 0.8) for the effectiveness measure with high accuracy but the estimated variance is higher than the true variance (0.01). Table 3.4 shows the estimation for different λ values and N = 300 and 600 respectively. From the above tests, we can conclude that the one-shot estimator is able to estimate well using a small amount of mapping effectiveness information for PDMSs with D-regular or scale-free topologies. The estimator also scales well both on estimation accuracy and operation efficiency with the size of the PDMS for the two types of topologies studied. 70 45 D=3 D=5 D=7 maximal shortest distance 40 35 30 25 20 15 10 5 0 0 200 400 600 800 1000 1200 1400 network size Figure 3.14: Maximal pairwise shortest distance in random graphs with different connectivity. 3.6.4 Evaluation of the two-hop selection scheme In this section, we evaluate the two-hop acquaintance selection scheme. First we conducted a set of experiments on the speed for a peer to explore the PDMS to empirically validate Theorem 3.5.7. We computed the maximal shortest distance(MSD) between any two peers in a PDMS with a different random D-regular mapping topology. Letting this distance be d, a peer will take O(log d) rounds to discover the existence of all other peers in the PDMS if they also carry out twohop selection. If the host peer is the only one in the PDMS to perform the two-hop selection, then it needs O(d) rounds to discover the entire population. The results in Figure 3.14 show the d value under different settings of PDMSs. The results show that when each peer has more than 5 acquaintances, a peer can discover other peers very quickly. This validates the claim in Theorem 3.5.7. We first evaluated the accuracy of the two-hop acquaintance selection. We show results for PDMSs with D-regular and scale-free topologies with D = 11. The PDMS scaled from 100 to 600 peers. Each peer performed two-hop selection in a round-robin fashion for 9 rounds and in each round chose the top 3 candidates as its acquaintances. We trained the initial model with a small data set with 20 peers whose mapping effectiveness distribution was purposely set to deviate from 71 the one used for testing. This model discrepancy was set to test the estimator’s ability to refine the model during selection. We examined how well the peers selected their 3 acquaintances in each round. For each of the 3 acquaintances chosen by a peer in each round, we checked with an “oracle” (which knew the true direct mapping effectiveness) to see if the peer had chosen the best t% of the available candidates. A selection was considered as a “hit” if its rank was in the best t%. We computed the hit ratio as the quality of the selection. We started with t = 20 and gradually decreased t until a significant number of selections fell out of the top t%. The results in Figure 3.15 and Figure 3.16 show the hit ratio for all peers in the PDMS at each round, for different sizes of PDMSs with the two types of topology. Figure 3.15 shows an interesting trend across the rounds in which peers performed selection. In the early rounds when the pre-trained model still deviated from the true distribution, a fair portion of selections failed to hit the best possible choices. However, the situation quickly improved in succeeding rounds, and this was observed for all PDMS sizes. Take N = 400 for example; the hit ratio climbed to 80% after 3 rounds and kept increasing to finally exceed 95% at round 9. This suggests that the two-hop scheme does improve its prediction along with the selection process. The same experiments conducted on the scale-free topology which the results are reported in Figure 3.16 demonstrated a similar behavior. N=200 N=400 N=600 Top 10% hit ratio 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 round Figure 3.15: Hit ratio for two-hop on the D-regular topology where t = 10%. We further measured the significance of the two-hop acquaintance selection by comparing the actual mapping effectiveness benefit gained against the benefit that peers get if they randomly choose their acquaintances. The box plot [121] in 72 N=200 N=400 N=600 Top 10% hit ratio 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 round Figure 3.16: Hit ratio for two-hop on the scale-free topology where t = 10%. Figure 3.17 shows the results for a PDMS of size 400 with a D-regular topology. 0.5 Selection benefit statistics 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 round Figure 3.17: Two-hop vs. random on the D-regular topology. In Figure 3.17, the notched boxes represent the benefit achieved when twohop selection is applied. Each box represents statistics for the 1, 200 selections in that round. The lower series of boxes are the corresponding statistics of the benefit gained from random selection. It shows that, with strong statistical support, the benefit achieved by the two-hop selection outperforms random selection in all rounds. Similar behavior was also observed for the statistics collected for the scalefree topology. Thus, these experiments show that that the two-hop selection scheme can accurately help peers choose acquaintances that maximize its query answering benefit. We also recorded the speed of the above experiments. Figure 3.18 shows the results for the D-regular topology. Figure 3.18(a) compares the average time (over 73 all selections) needed in different rounds. Figure 3.18(b) shows the time needed 12 Avg. selection time(sec) Avg. selection time (sec) for different sizes of PDMSs. Figure 3.18 suggests a linear increment in the time N=100 N=200 N=400 N=600 10 8 6 14 12 10 8 6 4 2 0 100 4 2 0 1 2 3 4 5 6 7 8 9 round round 1 round 3 round 6 round 9 200 300 400 500 600 N (a) avg. time vs. round (b) avg. time vs. N Figure 3.18: THME time on the D-regular topology. consumption along selection rounds for all N. This happens because in later rounds a peer has to evaluate more candidates, which in turn consumes more time. Figure 3.18 shows a faster than linear increasing trend for the time consumption to grow with N. The time increment especially for N > 500 suggests that, although it takes only about 10 seconds to finish a round of selection, we should still explore more efficient incremental update methods (e.g., to reduce the number of re-estimations in later rounds) for the two-hop scheme on large PDMSs. A similar trend on the scale-free topology was also observed. This set of experiments suggests that the two-hop scheme quickly selects acquaintances and scales with the PDMS size. As a summary for the empirical study of acquaintance selection, we tested both the accuracy and the time overhead of the two selection schemes. The results show that both schemes help peers choose good acquaintances. Both schemes scale to large PDMSs. We did not compare the two schemes as they require different assumptions and suit different PDMS scenarios. In this chapter, we investigated the acquaintance selection problem. Our solution helps domain heterogeneous (DH) data sources to better choose acquaintances with which to establish schema mappings so that query answering for data sources is optimized. Our acquaintance selection framework defines the acquaintance se- 74 lection workflow and the necessary computational procedures. The design goal is to provide a concise and yet flexible enough framework that supports different selection criteria. Our one-shot and two-hop schemes help to quickly estimate direct mapping effectiveness before efforts to fully establish them are required, making the acquaintance selection highly efficient. The acquaintance selection algorithms are empirically evaluated using synthetic data and with different peer topologies. Our theoretical claims are verified by the empirical results. 75 4 THE DECOMPOSITION AGGREGATION QUERY In Chapter 1, we motivate the domain heterogeneity integration with two examples. Example 1.1.1 describes the demands to find out cell damage from databases of building damage assessment; and Example 1.1.2 provides another case study where a room’s building cost needs to be estimated from the building components used in the design. The gap is thus clear that data (objects) made available by databases in separate domains are defined at different granularities; therefore, a query for one database cannot be answered by a SPJ query on another. Computing the damage and loss attributes in Example 1.1.1 requires translating queries on Cell into aggregate queries on BdnDmg data. This requires that the data integration system to be able to use the connection between an object (called “compounds” here after) and its constituent, finer granularity objects (called “components” here after). One difficulty is that if the users do not know the data sources (e.g., one that hosts BdnDmg) beforehand, it is impossible to pre-determine the aggregate queries. This problem motivates the focus of this chapter: how can we bridge the gap between domain heterogenous data sources by automating transforming queries over complex compounds into the components that form them. Answering domain heterogeneous aggregate queries is challenging for three reasons: (1) domain heterogeneity prevents users from knowing all domains (2) 76 query transformation must be transparent to users and (3) aggregations must be collected from multiple databases. These challenges require supporting fully automatic query translation at the system level. Domain heterogeneous queries occur in many scenarios, e.g., estimating the cost to build a room. The cost of a room is estimated by decomposing the room into its constituent parts (e.g., windows and beams) and then aggregating their costs from providers’ databases. Another example is computing the total cost of online shopping carts [11]. When items are in multiple stores, decomposing a cart into items and aggregates to find the total price is non-trivial. This chapter is organized as follows. Section 4.2 contains the preliminaries and Section 4.3 summarizes related work. Section 4.4 formalizes the DAQ and describes a “two-phase” query answering scheme. Section 4.5 and Section 4.6 are devoted to the aggregation rewriting which translates queries on compounds into aggregate queries on components. The empirical study is presented in Section 4.7. 4.1 Decomposition Aggregation Queries q-node K2 d-node I K1 J I: Cell(cellid, shape,damage) a-nodes J:BdnGIS(bid, pos) K1 ,K 2 :BdnDmg(bid, intensity, damage) Query at node I: Q('C1', d) :-Cell('C1', _, d) Figure 4.1: A 3-role structure for Example 1.1.1. The DAQ bridges the gaps between domains such that the problems illustrated in Example 1.1.1 can be solved systematically. Our results stem from the first major contribution of the solution: a novel 3-role structure discovered from the data sources as shown in Figure 4.1. The roles are as follows: (1) the Query Node (q-node) (e.g., node I) poses a query on compounds — objects which are defined as a set of components (e.g., “Cell”); (2) the Decomposition Node (d-node) (e.g., node J) breaks the compounds into components — the sub-elements needed for the aggregation (e.g., “BdnGIS”). Finally, (3) the Answer Nodes (a-node) (e.g., K1 , K2 ) processes aggregates. This 3-role structure allows queries on compounds in 77 one domain to be transformed to aggregates of components in another domain; it bridges the gap between different domains. Formally: Definition 4.1.1. [3-role structure] A 3-role structure is a directed graph T (V (q, d, A), E), where data sources are the vertices and mappings are the edges. q and d are called q-node and d-node respectively; and A is a set of data sources called a-nodes. Schema mappings in E are either from q to d or from data sources in A to d. q maintains compounds that are formed by components from data sources in A. This 3-role structure is commonly seen in cross-domain query answering problems. A d-node is usually a data repository that many data sources map their entities to. In JIIRP ( Example 1.1.1), a GIS database is such a repository. Similar repositories are found in many other domains; for example in manufacturing, standardized parts all refer to a global standard ISO code. A d-node also helps identify and remove duplication since components on a-nodes may overlap. While the q-nodes, d-nodes, and a-nodes are logically separated by their roles in the 3-role structure, they do not have to be physically distributed data sources. The physical database locations and the way they are mapped depend on the semantic integration environment. In a peer data management system (PDMS), they can be 4 peers (2 a-nodes) connected by pairwise schema mappings; in a data integration system using a mediated schema, queries are translated between data sources via the help of mediated schema. The 3-role structure is discovered1 by examining the mappings between data sources. No additional work is needed to “set up” the 3-role structures. The cell, building, and damage assessment databases in Figure 4.1 are automatically identified to form a 3-role structure, as we show later in the planning step (Section 4.5.2) of DAQ answering. In this chapter, we define decomposition aggregation queries (DAQs) over this 3-role structure and shows how to process DAQs using aggregation rewritings. Processing DAQs has some distinct features that are different from existing semantic integration query answering: 1. Instead of only querying objects of equivalent 1 We emphasize that it already exists in the semantic integration settings. 78 granularity, queries over compounds are transformed into queries over the constituent components. 2. Instead of processing open queries for “tuples satisfying some condition,” DAQs are closed: they solve for a pre-determined set of compounds. Hence, in addition to sending translated queries during query answering, data sets are also sent among data sources. 3. Decomposition of compounds on the 3-role structure relies on data sources that do not directly answer the query; previously only data sources that directly answered the query were useful. 4. When multiple data sources for the components are present, choosing the right sources to use becomes important. Moreover, the distribution of answers from different combinations of data sources is also of interest. We make the following specific contributions for supporting the decomposition aggregation query (DAQ). 1. We propose an architecture-independent solution to semantically integrate domain heterogeneous schemas using aggregations over 3-role structures. 2. We introduce the decomposition aggregation query (DAQ) and propose a generic, scalable query rewriting algorithm that can be used together with existing query rewriting techniques to translate DAQs into aggregate queries that can be answered by domain heterogeneous data sources. (Section 4.4, Section 4.5, Section 4.6) 3. We formalize the semantics of DAQ answers and propose a two-phase query answering scheme that respects both the efficiency and quality of query answering. 4.2 Preliminaries 4.2.1 Mapping tables Mapping tables are relational tables which explicitly state the association between elements in two schemas. The attributes of a mapping table refer to the schema elements in the mapped data sources. In most cases the attributes are the keys of the corresponding relations. For example, a mapping table mt(cellid, Bid) defines the correspondence between cell instances and building instances. 79 In query translation, mapping tables are used as normal relations. A mapping table mt(key1 , key2 ) is equivalent to the following mapping rule set: Q1 (key1 ) :− R1 (key1 , ...) Q2 (key2 ) :− R2 (key2 , ...) M(key1 , key2 ) :− Q1 (key1 ), Q2 (key2 ), mt(key1 , key2 ) As the first two rules have no predicate, evaluating M against any database of R1 and R2 yields the same table as mt. Mapping tables have been extended to allow carrying of variables [67]. In this case, each row in such an “extended mapping table” is a mapping that can be expanded to a 3-rule mapping straight-forwardly — the variables and constants are carried into the converted 3-rule mapping and mt in rule-3 can be removed. 4.2.2 Open world and closed world assumptions The OWA and CWA are two concepts asserting “truth” in query answering. Informally, the CWA states that if a data source with a relation R does not include a tuple t, then t is false. The OWA assumes the opposite — if a relation R does not include a tuple t, then t may still be true. To correctly decompose compounds, mappings must adhere to the CWA — otherwise it is not guaranteed that the aggregation rewriting will be complete. For example, if both of the cell and building schemas use the OWA, we can assert that the mapped buildings are in a cell but cannot justify that no other buildings are also in the cell. The decomposition between two OWA schemas may not be complete and thus leads to the difficulty of justifying the semantics of an aggregation on it. To distinguish the mappings that can be used to decompose compounds from incomplete mappings, CWA must be applied on the target schema of the mapping. For mapping rule sets, the CWA is applied on the schema of the second rule (the target schema) of a mapping written in the 3-rule format. For mapping tables, the target schema in the mapping is considered to hold a complete set of data instance satisfying the CWA. It is only when the CWA is applied that we can be sure a decomposed component set is complete and therefore an aggregation correctly 80 represents an attribute of the compound. For example, the monetary loss of a cell is defined as the sum of the loss on ALL the buildings in the cell. It is only when the “ALL” can be enforced that the aggregation makes sense. At the same time we limit the use of the CWA to the small set of decomposition mappings. The implication is twofold. First, the OWA is often preferred in semantic integration: e.g., the mapping “K1 -J” in Figure 4.1 uses the OWA on both K1 and J; second, the CWA is required only for the components in the corresponding decomposition mappings. Fortunately, many data repositories can have the CWA applied, justifying its use. For example, in JIIRP a GIS is believed to contain records of all the buildings on campus; in engineering, makers of bolts and nuts refer to global ISO standard codes; in scientific research, researchers associate their work on genes to big gene repositories like NCBI and Ensemble. These repositories usually maintain a complete set of entity records (e.g., buildings and products) for reference and thus are very reliable sources for compounds to be defined on. However, as opposed to a mediated schema approach, the repositories do not collect all the features for its entities so that we usually cannot “look up” schema elements as we do for a mediated schema. (E.g., the GIS does not manage damage assessment schema or data.) 4.3 Related Work Cohen et al. [28] show that the complexity of rewriting aggregation queries is NPcomplete for complete rewritings and polynomial algorithms for partial rewritings only exist for linear queries. They also provide a theoretical framework to unify previous research on rewriting aggregation queries. Our work differs from these approaches since we discover aggregations and the grouping functions necessary to transform a query to aggregations instead of using aggregation views for query rewriting. In [3], the authors study aggregate query processing in data exchange. Their aggregations are discussed w.r.t. possible worlds at the target schema; our aggregations are rewritten, sent and executed locally on the data sources that host the components. Therefore, we are not computing an aggregation over a target instance with many “possible and undecided states” as in [3]; instead we retrieve information directly from those data sources. Therefore, the assumptions and techniques 81 in [3] do not apply to DAQ processing. The OWA is used by many semantic integration systems (e.g., Piazza [53] and Hyperion [7]) because real-life applications cannot assume a data source is omniscient. However, aggregation requires using the CWA to assert that the aggregated set is complete [3, 48, 73, 74]. We incorporate the CWA for aggregations while still respecting the OWA nature of the data sources as much as possible. Similarly, [74] allows both to co-exist. Our work binds the CWA and the OWA to mappings instead of to data sources so that a data source can flexibly apply the CWA or the OWA as needed. Piazza [53], Hyperion [7] and Orchestra [57, 118] are semantic integration systems. Piazza allows mediated schemas which peers use to translate queries. JIIRP uses a more flexible, pairwise mapping system instead of a mediated schema. Multi-hop query translation is thus pairwise. Hyperion uses pairwise mappings and mapping tables and thus is closer to our setting. Orchestra adopts a data exchange setting and synchronizes updates on one data source to other sources in the system. This is different from our setting: we always translate queries to retrieve information directly from the data sources that own and maintain the information. None of the above architectures support translating aggregate queries; the DAQ query processing model can be considered as a useful extension to the existing systems. In-network aggregation [30] is widely used in distributed architectures such as sensor networks [79, 83]. While DAQ query processing also uses in-network aggregation, our scenario differs from existing works: (1) in-network aggregation is applied in the 3-role structure mainly to balance the load for the d-node while in sensor network is to lower transmission overhead; (2) the 3-role structure does not have a complex topology as in [83] and (3) we do not use all available a-nodes in a 3-role structure for a DAQ while in a sensor network setting, all usable sensor nodes are involved in aggregation. Because of these differences, we do not need to design special aggregate schemes [76] for different aggregate functions or to apply approximations [9]. 82 4.4 DAQ Formalization This section formalizes a DAQ’s key components. We first define the semantics of aggregate answers since it differs substantially from traditional aggregations on a single database. Then, we describe the decomposition mapping and aggregation bindings required by the aggregation rewriting which translates queries over compounds into aggregate queries over components. Creating and validating mappings between data sources are active research topics and beyond the scope of the discussion here. The work covered in this chapter focuses on query processing and assumes that the meta-data information necessary to processing queries using aggregations is available. 4.4.1 The semantics of DAQ answers Because a component may have different values in different data sources, using different data sources for the same aggregate query may result in different answers. Therefore, the semantics of an aggregation in semantic integration differs from querying a single database which contains only a single value for a component. We now define the “viable answer” for an aggregation in semantic integration: Definition 4.4.1. [viable answer] Let s denote a set of sources answering an aggregation, and let v = Agg(s) be the aggregated value computed from s. Let V = {vi } be the set of aggregated values from all possible source combinations. A viable answer to an aggregation is a value in the interval [inf(V ), sup(V )] that adheres to type restrictions (e.g., integer) where inf and sup are infimum (greatest lower bound) and supremum (smallest upper bound) respectively. We allow any value in the defined interval, even if it does not correspond to the value produced by any source combination. There are several reasons for defining a viable answer in this way. First, when the SemIS computes different aggregate answers with different data source combinations, it is impossible to decide which answer to use without extra information; therefore, multiple answers should be accepted. Second, treating the answer distribution as a continuous range is convenient for analysis. For example, if we use the mean of an aggregation as a scalar answer, we do not need to worry if the mean value can be computed by a certain source combination. Finally, it is easy to see that with the exponential space of 83 possible source combinations, determining if a single value is computable from any source combination or determining if there exist source combinations that produce answers in a given range is hard. Extending the answer semantics to aggregates to allow a range of values does not mean that viable answers are approximate answers even though query answering for semantic integration often results in approximate answers. For example, when answering queries using views, it is often only possible to obtain contained rewritings instead of equivalent rewritings. When answering aggregations, missing data and duplicates also force us to return approximate answers. It is necessary to distinguish “viable answers” and “approximate answers”. The existence of different combinations and the inconsistency between different data sources lead the viable answers to be defined as a value interval in Definition 4.4.1. All viable answers are “accurate” answers while approximate answers are answers that may contain error, e.g., we cannot guarantee an approximate answer is a viable answer. There are several factors that could introduce error to answering aggregate queries: 1. Inequivalent query rewriting. Using answers from contained query rewritings introduces error. 2. Duplications in partial aggregates. Duplications affect “duplication sensitive aggregates” such as sum. 3. Missing data. Some aggregation components may not be covered by any data source. In this case, the returned answer contains error. Given the above semantics, we define DAQ answering as a two phase query answering process (Figure 4.2). Phase 1 returns scalar aggregation results, which is the same as in querying a relational database. The processing goal for phase 1 is to quickly return answers using minimal computational resources in the integration system. Phase 2 returns estimates to the DAQ answer distribution. 4.4.2 The decomposition mapping To correctly compute aggregates, a compound must be decomposed into a complete set of its components. Ordinary mappings do not tell if a mapped component set exactly forms the compound unless the Closed World Assumption (CWA) is 84 Decomposition on 3-role structure DAQ query rewriting two phase query answering Source selection DAQ query rewriting with top-1 selected sources sampling Phase 2: estimate DAQ answer distribution Phase 1: scalar aggregates Figure 4.2: Two major operations interleave in DAQ processing: the DAQ query rewriting performs query translation; the two phase query answering returns results. assumed on the target schema of the mapping, rather than the Open World Assumption (OWA). Details about CWA and OWA can be found in Section 4.2.2. We define a decomposition mapping as a mapping that ensures complete decomposition of compounds by satisfying the following two conditions: Definition 4.4.2. [decomposition mapping] A decomposition mapping is a mapping where (1) a compound in the source schema consists of a set of components in the target schema and (2) the CWA applies to the target schema. The first condition qualifies “consists of” and the second one qualifies “consists only of”. For a DAQ to be correctly answered, decomposition mappings are needed from the q-node to the d-node in a 3-role structure. Mapping 2.1.1 is a decomposition mapping; Cells in the source schema map to sets of Buildings in the target schema. Data mappings [67] can also serve as decomposition mappings. In fact, each data mapping (represented by mapping tables) has an equivalent mapping rule form (a single rule or a set of mapping rules, depending on whether the mapping table 85 contains variables). Refer to Section 4.2.1 for more details on using mapping tables in DAQ processing. Decomposition requires no other additional assumptions on the mappings; therefore existing mappings can be used without modifications. In fact, we only need a weak CWA assumption, so systems that usually adopt the OWA can easily be adjusted for DAQs. See Section 4.2.2 for details. 4.4.3 The aggregate binding binding I-K1 I-K2 L attr damage loss R attr Damage Loss Agg func avg() sum() Table 4.1: Aggregate bindings for relations in Table 1.1 Aggregate bindings encode the aggregate functions to be used for answering DAQs. Table 4.1 shows the aggregation binding for Example 1.1.1. The aggregation function for cell damage and loss are “avg” and “sum” respectively. Formally: 86 Definition 4.4.3. [aggregate binding] An aggregate binding is a triple B(Rσ .a, F, Tτ .b) where Rσ .a denotes attribute a of relation R from schema σ , (Tτ .b) is attribute b of relation T from schema τ, and F is an aggregate function. The semantics is Rσ .a = F(Tτ .b) for appropriate component groupings. Aggregate bindings can be regarded as a special kind of mapping. Therefore, we can formalize a binding using a Datalog rule (omitting other attributes in Rσ and Tτ ): H(a, F(b)) :− Rσ (a), Tτ (b), G; where a dummy predicate G (for grouping) indicates that the aggregation holds only for proper compound decomposition. An aggregate binding specifies the aggregate function F and the corresponding attribute pair, but not an explicit specification of G which will be computed during decomposition in aggregate rewriting. While we leave automatically determining which aggregate binding to use as future work, specifying aggregate bindings is easier than creating mappings between data sources because it does not require tuple generating logic. We believe that finding aggregate bindings has two key parts. First, an initial binding between two sources must be provided from outside the integration system. The data source that holds the components could specify the aggregate function suitable for attributes; or the person who defines a compound could tell the system one binding; or an initial aggregate binding can be learned, e.g., using technologies described in [72]. Then, using this initial binding as a “seed”, more aggregate bindings can be inferred using mappings between data sources. For example, I −K1 in Table 4.1 (from Figure 4.1) is a seed. If a mapping exists between K1 and K2 that involves “K1 .Damage”, then a binding from I to attributes in K2 can be inferred. Because a binding only needs to identify related attributes, it is relatively easy (w.r.t. the mapping composition operation) to infer new bindings using seeds and existing mappings. For general cases, additional transformations may be needed and this can be captured by the aggregation function. 4.4.4 The aggregation rewriting Queries on compounds are translated into aggregate queries on components through a process called “aggregation rewriting”. In this section we formalize this new type of query translation; the detailed algorithm appears in Section 4.5. 87 An aggregation rewriting is a new subclass of general query rewriting that takes a non-aggregation query as input and produces an aggregation. Formally: Definition 4.4.4. [aggregation rewriting] A query Qτ or a set of queries {Qτ } is an aggregation rewriting if Qτ (q ∈ {Qτ }) is an aggregate query translated from a non-aggregate query. I.e., Qτ ( f (x), g(y)) :− A, where f is an aggregate function, g is a grouping, A is a condition, x is a variable and y is formed by variables and constants. In Figure 4.1, if G is the set of buildings in cell “C1”, Q is rewritten into an aggregate query for K1 : Q (Avg(d), ’C1’) :− K1 .BdnDmg(x, ’VIII’, d), G(x). When a set of aggregate queries are the output, each of them represents a partial aggregate to be later merged into a semantically equivalent aggregation by the dnode. Ideally we would prefer equivalent aggregation rewritings. However, as in answering queries using views, we may only be able to find a contained rewriting. Section 4.6 shows how we use existing query rewriting techniques in aggregation rewriting. Aggregation rewriting extends the query rewriting family by introducing aggregations to the translated query. Traditionally, a user needs to fully specify the aggregation (i.e., both the aggregate function and the group by clause). Instead, we created a fully automatic query rewriting process. Because aggregation rewriting decomposes compounds to discover the grouping of components for the queried compounds; we call this type of query a decomposition aggregation query. Formally: Definition 4.4.5. [decomposition aggregation query (DAQ)] A Decomposition Aggregate Query (DAQ) is a semantic integration query over compounds that requires aggregation rewriting over the 3-role structure to process aggregations over components. The transformation from compounds into components is called decomposition. Example 4.4.1. • Query Q(x, d) :− Cell(x, s, d), area(s) < 20, 000 88 • A 3-role structure T where – qT = I: Cell(cellid, shape, damage); – dT = J: BdnGIS (bid, pos); – aT = {K1: BdnDmg (Bid, Intensity, Damage)} • mapping from q-node to d-node Mapping 4.4.1. [decomposition mapping I → J] Q1 (x, s) :− Cell(x, s, d) (4.4.1) Q2 (a, p) :− BdnGIS(a, p) (4.4.2) m1 (x, a) :− Q1 (x, s), Q2 (a, p), in area(p, s) (4.4.3) • mapping from a-node to d-node Mapping 4.4.2. [K1 → J] Q3 (r) :− BdnDmg(r, V III , ) (4.4.4) Q4 (a) :− BdnGIS(a, ) (4.4.5) m2 (r) :− Q3 (r), Q4 (r) (4.4.6) • Aggregate binding (Cell.damage, avg(), BdnDmg.Damage) or the equivalent rule: H(y, avg(s)) :− Cell(x, , y), BdnDmg(r, s), G We now show how a DAQ, specifically a conjunctive non-aggregate query, is translated and answered by a set of aggregate queries. The aggregation rewriting algorithm is the fundamental operation for the DAQ to be answered. In Section 4.5 a basic version is presented and analyzed; and in Section 4.6 the algorithm is extended to support a wide class of general queries. Specifically, conjunctive queries having no long self-joins or negations can be efficiently translated. 89 on q-node offline Identify 3-role structure on a-node on d-node Perform Aggregation rewriting Execute partial aggregation queries Compute query answers Figure 4.3: General steps to process a DAQ 4.5 4.5.1 DAQ Processing with 3-role Structure Overview of DAQ answering Example 4.5.1. [DAQ answering] As depicted in Figure 4.1, data sources I, J, K1 and K2 form the 3-role structure to process the DAQ Q in Example 1.1.1. First, the cells in Q are decomposed to buildings on the d-node J. Then, aggregation rewriting translates Q into aggregate queries over buildings for a-nodes K1 and K2 ; each covers a fraction of the buildings decomposed from the cells. The results of the aggregates are then merged at J and the final answer is returned to the q-node I, thus completing DAQ answering. As shown in Figure 4.3, processing a DAQ requires 4 steps. Step 1 (on the q-node) finds the 3-role structure to process the DAQ. Step 2 (on the d-node) performs the aggregation rewriting and obtains aggregate queries for a-nodes. In Step 3 (on the a-node), a-nodes receive aggregate queries, process them locally and return aggregate answers to the d-node. In step 4 (on the d-node) the d-node merges partial aggregates and returns DAQ answers to the q-node. In the process described above, the d-node distributes the aggregation load to a-nodes by requesting and receiving partial aggregates from the a-nodes. When an aggregation requires data from multiple sources, the translated aggregate query is considered as a new query. Specifically, the d-node takes the following strategy. First, it sends the translated query to the a-node which runs the aggregation. Then, the a-node sends queries to other a-nodes to retrieve the data required to finish processing the query. Compared with other alternatives, e.g., the d-node performing the aggregation or the d-node forwarding data from other sources to the a-node processing aggregates, this approach poses less overhead to the d-node and the 90 network. Note that DAQ are “targeted” queries that the objects being queried are fully defined at the q-node. Thus the succeeding decomposition and aggregation rewriting steps on the d-node and a-nodes work on a closed set of decomposed components. This is different from standard “open” queries, where any tuples satisfying the conditions of the query are considered as query answers. Such “targeted” query processing requires transmitting data between the data sources. Compared to standard query rewriting of select project join queries in previous semantic integration frameworks, DAQ answering has the following distinct features: 1. The 3-role structure uses an additional data source to decompose compounds and perform aggregation rewritings. 2. Query rewriting uses new types of meta-information: decomposition mappings and aggregate bindings. 3. Targeted query rewriting transmits component references across data sources, in addition to translated queries. 4. Answers to partial aggregates from a-nodes are collected by the d-node to further merge into the final answer. Aside from the aggregation, the biggest change in query answering is the use of the 3-role structure. With the help of the 3-role structure, data sources which do not answer a query, but decompose compounds and merge partial aggregates into DAQ answers can participate in query processing and therefore allows domain heterogeneous data sources holding objects at different granularity to be integrated. A hierarchical aggregation structure similar to what is used in sensor networks [29] is used for DAQ processing. As we compare in detail in Section 4.3, our focus is different because the purpose in sensor networks is to save sensor energy whereas we focus on query rewriting and source collaboration. The technology described here can be applied to sensor networks when sensors have schema heterogeneity problems. 91 4.5.2 Planning the aggregation rewriting The planning step finds appropriate 3-role structures for DAQ answering. While the 3-role structures are usually discovered off-line, the planning process pays a small cost of exchanging a small amount of meta data to determine the 3-role structures for the given query, thus avoids the high cost of wasteful query rewriting and data transmission. The planning step resembles routing in computer networks except that the destination is not known prior to the planning. The following example shows that usually only some of the data sources contribute to the query answering; discovering them early can save substantial effort. S3 Search for "damage" match at S3 di Ca n d at e S1 r 1 ou t e S2 Ca n d id at e r ou t e 2 ag am " d S5 r f o at ch d ar d en Se e a D S4 Search for "damage" match at S4 S5 e" Figure 4.4: A 5-source example 3-role structure Consider answering a cell-damage query with the 5 data sources in Figure 4.4. In this setting, S1 is the q-node where the relation cell is defined; S2 and S5 are two d-nodes where the cells are defined as sets of buildings; and S3 and S4 are two a-nodes hosting building damage assessment data. Query planning starts from S1 and searches for possible answering data sources. S3 and S4 are chosen since they contain damage assessment data. Although S2 has no such data itself, it is still chosen because it decomposes cells on S1 to buildings that S3 and S4 both map to. S5 is considered at the beginning as cells in S1 also map to it, but it is finally 92 dropped as no a-node is found to answer the query. The planning step is performed by flooding a probe request to the data sources. All returned routes containing an edge that is a decomposition mapping and a node that matches at least one of the aggregate bindings is recorded and others are discarded. Additionally, the routes are grouped together by the discovered dnode. The planning process does not require a node with knowledge of all the data sources and is carried out in a fully distributed fashion. The planning step outputs the discovered 3-role structures in the form of (qnode, d-node, {a-node}) triple, where each triple represents a set of query answering routes sharing the same d-node. It only relies on the mappings (mapping rules or the schema of mapping tables) and the aggregate bindings to discover usable 3-role structure. The planning step can be performed with little computational (no query rewriting/evaluation is required) and networking (no data sets are sent) overheads. From the description we can see that the routes that answers the DAQ must be included in the returned 3-role structure. While depending on the query translation, the compounds being queried and the components available to the a-nodes, not all such structures finally answer the DAQ, the planning step still makes the DAQ answering more targeted and efficient. 4.5.3 The basic aggregation rewriting algorithm This section describes how to translate a query on compounds at a q-node into aggregation queries on components on a-nodes. Example 4.4.1 illustrates the algorithm. Basically, the algorithm consists of 4 steps to carry out on the q-node and the d-node in a 3-role structure. The queried compounds are first identified at the q-node then the compounds are decomposed into components at the d-node where the grouping functions are generated with the decomposition. Then, an aggregate query is generated for each of the a-nodes in the 3-role structure. Although the example is simple and we assume a single a-node covers the components in the aggregation, it suffices to show the rewriting steps and the input/output of each node. The DAQ query class supported by aggregation rewriting is broad to include conjunctive queries without recursions, long self-joins or negations. Section 4.6 presents extensions for multiple a-nodes each contributing partial aggregates and 93 pre-processing general, complex queries. Step 1: find compounds to decompose (q-node) Unifying [122] the DAQ with the decomposition mapping’s r1 yields the compounds to decompose (Algorithm 1). In Example 4.4.1, this yields Qq (x, s, d) :− Q(x, d) ∧ Q1 (x, s). Expanding yields Qq (x, s, d) :− Cell(x, s, d), area(s) < 20, 000. Qq gives the desired set of compounds since the unification computes the “intersection” of the compounds that satisfy both Q (compounds queried) and Q1 (compounds decomposable w.r.t. decomposition mapping Mapping 4.4.1). Because the required database exists only on the q-node, Qq is evaluated and the result, Q1 , needs to be sent to the d-node. When data sources are guaranteed availability during query answering, “fetch-on demand (FoD)” is used for data transmission. That is, instead of sending Q1 directly to the target d-node, a temporary view V is created at the q-node. The d-node is sent a selection query to V . When the d-node needs to process the query, it fetches the data from the q-node on demand. This strategy applies to all operations that involve sending blocks of relational data and we denote it by “send(FoD)” in the algorithm description. Algorithm 1: Find Queried Compounds (q-node) input : DAQ Q input : Decomposition mapping M1 (q1 , q2 , m1 ) (q-node to d-node) input : Aggregate Binding (R1 .a, Agg, R2 .b) output: Query Qq to run on the q-node 1 begin 2 if Q contains predicate on R1 .a then 3 drop the predicate with R1 .a in Q; 4 end 5 substitution B ← 0; / 6 Qq .body ← unification(Q, q1 , B); 7 Qq .head ← q1 .head; 8 Apply B to Qq .head; 9 return Qq 10 end In Algorithm 1 and all following algorithms, the unification operation returns the body of a query that unifies the two input rules, as we described in Section 4.2. 94 Step 2: decompose to components (d-node) Qq , the result sent (FoD) from the q-node, is used by the d-node as Q1 in Mapping 4.4.1. In the example, we replace Q1 in m1 (rule Equation 4.4.3) with Qq , and create a new rule m1 : m1 (x, a) :− Qq (x, s, d), Q2 (a, p), in area(p, s) :− Qq (x, s, d), BdnGIS(a, p) Evaluating m1 on the d-node returns the decomposed component set for the compounds in the d-node. However, we need to find the components on the a-node. Thus, rather than evaluating m1 on the d-node, we unify m1 with rule Equation 4.4.5 in Mapping 4.4.2 (Algorithm 2). In our example this yields: Qd (x, a) :− m1 (x, a) ∧ Q4 (a) :− Qq (x, s), BdnGIS(a, p), in area(p, s) Qd uses Mapping 4.4.2 to compute which components in the decomposed set may be mapped on the a-node. Qd is evaluated at the d-node; result Q4 is sent to the a-node. Algorithm 2: Decomposing (d-node) input : Q1 , the result of evaluating Qq on the q-node input : Decomposition mapping M1 (q1 , q2 , m1 ) (q-node → d-node) input : Mapping M2 (q3 , q4 , m2 ) (a-node → d-node) output: Query Qd to run on the d-node 1 begin 2 substitution B ← 0; / 3 Qt .body ← unification(q2 , q4 , B); 4 Qt .head ← q4 .head; 5 Apply B to Qt .head; 6 Qd .head ← Qt .head ∪ Q1 .head; 7 Qd .body ← m1 .body; 8 Replace the subgoal q1 in Qd .body with Q1 ; 9 Replace the subgoal q2 in Qd .body with Qt ; 10 Expand Qd with Qt ; 11 return Qd 12 end 95 Step 3: get the grouping function (a-node) Continuing our example, the result (sent(FoD)) of evaluating Qd on the d-node is used by the a-node to replace Q4 in rule Equation 4.4.6. This transforms m2 to m2 (x, r) :− Q3 (r), Qd (x, r). m2 associates the compounds (x) with the components (r) on the a-node. Expanding m2 results in the required grouping function, G: G(x, r) :− BdnDmg(r, V III , ), Qd (x, r) Algorithm 3: Finding the grouping function (d-node) input : Q4 , the result of evaluating Qd on the d-node input : Mapping M2 (q3 , q4 , m2 ) output: Grouping function G for the a-node output: Query Qu for unmapped components 1 begin 2 G.body ← m2 .body; 3 G.head ← m2 .head ∪ Q4 .head; 4 Replace the subgoal q4 in G.body with Q4 ; 5 Qu .head ← Q4 .head; 6 Qu .body ← G.body; 7 Move Rd in Qu .body to the right of q3 ; 8 Replace the join operator in Qu .body with the right anti-join operator; 9 Expand G with q3 , Qu with q3 ; 10 return G, Qu ; 11 end Algorithm 3 formalizes Step 3. In addition to the grouping function G, Algorithm 3 generates an anti-join 2 query Qa that computes the components sent from the d-node that are not mapped by the components on the a-node. The a-node evaluates Qu and sends the unmapped components back to the d-node. Section 4.6 discusses a generalization where the d-node partitions the decomposed component set using the result, Qu , for multiple a-nodes. Step 4: generate the aggregate query Using the aggregate binding and group2 antijoin( ): R S = R−R S 96 ing function G from Step 3, the aggregate queries are generated for the a-nodes: Qagg (x, avg(s)) :− BdnDmg(r, V III , ), G(x, r) :− BdnDmg(r, V III , ), Qd (x, r) where avg(s) replaces the variable y in the original query, relation R3 replaces R1 in the binding and the grouping function is G from Step 3. Algorithm 4 formalizes Step 4. Algorithm 4: Aggregation Rewriting (d-node, a-node) input : DAQ Q input : Aggregate binding (R1 .a, f , R2 .b) input : Grouping function G from Algorithm 3 output: Aggregate query Qagg for the a-node 1 begin 2 substitution B ← 0; / 3 Qagg .head ← Q.head; 4 u ← the variable at the position of R1 .a in Q; 5 Qagg .body ← unification(R2 (x), G.body, B); 6 Apply B to Qagg .head; 7 if Q contains predicates on R1.a then 8 add the predicates with R1.a in Q to Qagg .body; 9 end 10 v ← the variable at the position of R2 .b in Q.body; 11 Replace all u in Qagg with aggregation f (v); 12 return Qagg ; 13 end At this point, the aggregation rewriting is complete. Because the aggregate query Qagg will be evaluated on the a-node, we expand G in the final aggregation rewriting. In Example 4.4.1, we purposely chose disjoint sets of variables for the mapping rules so that the rewriting is fairly straightforward. In the real world, unification 97 requires computing and applying substitutions. For example, given Q1 (x, z) :− R1 (x, y), R2 (y, z) Q2 (a, c) :− R1 (a, b), R2 (a, c) unifying Q(x, z, a, c) :− Q1 (x, y) ∧ Q2 (a, c) requires discovering the mappings a → x, b → y, a → y, c → z, reducing it to a substitution [122] (a, b) → x, c → z and applying the substitution to get Q(x, z) :− R1 (x, x), R2 (x, z). Although the query in Example 4.4.1 does not contain predicates on the aggregation attribute y, the algorithm considers this case. In Step 1, predicates containing variable y are removed to ensure the decomposition. In Step 4, those predicates are put back into to the aggregate query with y replaced by the aggregation f (v). 4.5.4 Analysis The aggregation rewriting algorithm, presented in four steps above, focuses on translating a DAQ into (partial) aggregations to be processed on the a-nodes in the 3-role structure. Existing algorithms for view selection and performing maximally contained rewritings, e,g., the MiniCon algorithm [100] is still required and the complexity follows. Specifically, in Step 2 when the d-node tests a-nodes for the subset of components in partial aggregates, the MiniCon algorithm is applied and the resulting maximally contained queries are used as Qd ; and then in Step 3 and 4, partial aggregates are generated accordingly. Therefore, the general complexity of query translation [1, 51, 100] still applies here3 . Here we analyze the complexity in addition to what caused by the MiniCon algorithm. Unification dominates the complexity of the query translation algorithms for the q-node, d-node and a-node. We first analyze the time complexity of the unification operation: if neither of the two Datalog rules to be unified contains self-joins, the complexity of the unification operation is linear in the number of subgoals in the input Datalog rules. The pseudo-code for unifying two atomic formulas is as follows [33]: B = unify(F1,F2) : returns a unifying substitution; { 3 This is one of the reasons we restrict the query and mappings to be conjunctive and negation-free in its predicates. 98 if (P != Q or k != m) then return(NIL); C = collection containing each of the symbols in different sets; for i = 1 to k do { let E(Ti) and E(Vi) be the sets in C containing Ti and Vi; if E(Ti) and E(Vi) contain different constants then return(NIL); if (Ti != Vi) then merge Ti with Vi in C;} B := {}; for (S in C) { if (S has more than one element) { if (S contains constant symbol X) for (each variable Q in S) add Q -> X to B; else { X = some variable in S; for (Q != X in S) add Q -> X to B;} } } } Unifying the rules requires an additional step to pair the subgoals of the two rules and apply the unify function to each pair. Theorem 4.5.1. The time and space complexity of the unification algorithm are both linear in the number of subgoals in the input Datalog rules, given that the Datalog rules do not contain self-joins. Proof. Suppose the two Datalog formulas P(T1 , .., Tk ), Q(W1 , ...,Wk ) are to be unified. Upon initialization, set C contains 2k elements. The first loop of length k will group the elements into at most k groups (of size 2). So the complexity of unifying two formulas is linear in the length of the formulas. Suppose the body of rule P has m subgoals {R1 , ..., Rm }, and that Q has n subgoals{S1 , ..., Sn }. Assume m ≤ n, there are at most m subgoals in P that match the subgoals in Q. The time it takes to group variables is linear in the number of variables (consider multiple occurrences of the same variable/constant at different 99 positions) in the two Datalog rules, because each time two groups are merged, the group size increases at least by one. A variable is added to one and only one group so it takes linear time to group the variables/constants. Generating the substitution is also linear time. For each group containing a constant, all variables map to this constant. For groups containing no constant, an arbitrary variable in the group is chosen to which all other variables map. Therefore, unification takes linear time. For space complexity, the size of the set grouping the variables equals the distinct number of variables/constants in the two Datalog rules. The substitution size is no larger than the distinct number of variables/constants of the two Datalog rules. Corollary 4.5.2. The time complexity of the rewriting algorithms (Algorithms 1, 2 and 4) for the q-node, d-node and a-node are all linear in the size of the input. Theorem 4.5.3. If self-joins are allowed in the Datalog rules, the time and space complexity of the unification algorithm is exponential in the number of atomic formulas in the two Datalog rules, in the worst case. Proof. Assume without loss of generality that rule P has a length k self-join R()1 , R()2 ..R()k and rule Q has a length l self-join for the same relation R. Then there are max(k, l)!/(|k − l|!) ways to map variables/constants in the two rules, and in the worst case, no two mappings contain each other. If one of k, l is of order O(n) where n is the length of the input Datalog rule, the complexity is Ω(2n ) (more pre√ cisely, it is O( ne−n nn )). If there are O(n) such self-join groups with size each of (or bounded by) a constant B, then the number of ways to map variables/constants for P and Q is O(B!O(n) ) = O(2n ). Therefore, the worst-case time complexity of unification is exponential in the size of the input Datalog rules. It follows that it requires exponential number of rules to represent the result of the unification, so the worst case space complexity is exponential. Note that DAQ processing over the 3-role structure requires query translation over two mappings. The worst case complexity is thus dominated by the complexity of mapping composition, which is NP-hard. The above theorem tells that the hardness comes from the unification operation. Another layer of query translation complexity comes from searching for maximally contained rewritings (e.g., 100 the complexity class in the MiniCon algorithm), which is independent to what is discussed above. In general cases where multiple views (mappings) are available, the MiniCon algorithm is applied first to determine the views (mappings) to use and then the aggregation rewriting algorithm is applied. 4.6 Generalization Above we presented a basic aggregation rewriting algorithm in which the input to the algorithm are conjunctive queries stripped down to only elements required to be transformed into aggregations. In practice, queries are usually more complicated. There can be more joins; not all queries fit directly with mappings (e.g., keys are not explicit in the query body). Also, an a-node does not necessarily contain all the components in the compounds in an aggregation. We describe the following generalizations to the technique described in Section 4.5.3: (1) the 3-role structure has multiple a-nodes, where each a-node covers only part of the compounds; and (2) the original query contains more body relations and variables than the aggregate bindings and decomposition mappings. 4.6.1 Collaboration of multiple a-nodes Frequently, in the discovered 3-role structures, there are multiple a-nodes mapped to the d-node. Figure 4.4 illustrates such a case where S3 and S4 share the same d-node, S2. Consider Example 4.4.1 where we let Q ⊆ Q1 , Q2 ⊆ Q4 — if the answering node contains adequate information, it covers the decomposed components. In the real world, several a-nodes may have to collaborate to compute the aggregation as none of them fully covers the components decomposed from the compounds in the DAQ. Figure 4.4 shows a 3-role structure with 2 a-nodes: S3 and S4 each covers part of the components decomposed from the queried compounds. To show how a DAQ is processed under such a scenario, we modify Example 4.4.1 to form the structure described in Figure 4.4. We change the settings in Example 4.4.1 to include 2 a-nodes S3 and S4, and replace Mapping 4.4.2 with the following two mappings. 101 Mapping 4.6.1. [S3 →d-node] Q3 (r) :− R3 (r, s), r < 5 Q4 (a) :− R2 (a) m2 (r, a) :− Q3 (r), Q4 (r) Mapping 4.6.2. [S4 →d-node] Q3 (r) :− R3 (r, s), r > 3 Q4 (a) :− R2 (a) m2 (r, a) :− Q3 (r), Q4 (r) Under the new setting, processing the same DAQ Q(x, y) :− R1 (x, y), x < 10 in Example 4.4.1 raises two new issues. Assume a database of R1 (x, y) contains 3 tuples, t1 (2, ?),t2 (4, ?), and t3 (8, ?), where ? denotes the y attribute to be computed via aggregation. We can see from Mapping 4.6.1 and Mapping 4.6.2 that neither of the two a-nodes cover the decomposed components on the d-node from t1 to t3 . For duplication insensitive aggregates like min, max and topn, the d-node just performs an aggregation rewriting for each a-node. Otherwise, the d-node partitions the decomposed components and sends each partition to an a-node that may cover it. Continuing with the assumed database for t1 to t3 , the following two partitions of the components (conditioned on a) in Table 4.2 are valid. In Table 4.2, the partition on the left favors S3 and lets it cover as much as possible while the one on the right favors S4. Using a greedy strategy to find the partition, the d-node performs the aggregation rewriting with one a-node. The anode follows Algorithm 3 and will return the unmapped components. Then the d-node sends those not covered by previous a-nodes to the next un-tested a-node until all the components are allocated or all a-nodes are tested. 102 R1 (x) t1 (2, ?) t2 (4, ?) t3 (8, ?) t3 (8, ?) R2 (a) a<2 a<4 a<5 5≤a<8 To S3 S3 S3 S4 R1 (x) t1 (2, ∗) t2 (4, ∗) t2 (4, ∗) t3 (8, ∗) t3 (8, ∗) R2 (a) a<2 a<3 3≤a<4 a<3 3≤a<8 To S3 S3 S4 S3 S4 Table 4.2: Two valid partitions favoring S3 and S4 When multiple mappings exist from one a-node to the d-node the situation is similar to the multiple a-nodes case. In many cases the mappings map disjoint sets of components on the a-node but sometimes they overlap. For the latter case, a partition across the mappings is usually needed. The difference from what we do for the multiple a-nodes case is that the d-node just sends the data to the a-node where the partitioning is performed. With multiple a-nodes cooperating in an aggregation, the query processed at each a-node is a partial aggregation and they need to be merged when received by the d-node. For example, the partial aggregations for t3 in Table 4.2 need to be merged. Many aggregation functions can be computed incrementally from partial aggregations: sum, min, max and topn are straight-forward; avg, stddev can be maintained with count computed together with the partial aggregation. Other aggregations like median and quantile can be approximated. 4.6.2 Preprocessing for aggregation rewriting A query on the q-node is not usually in the form necessary for the algorithm in Section 4.5.3. Such a query needs to be preprocessed before performing aggregation rewriting. Without loss of generality, we consider preprocessing for cases where the head of the query contains a mixture of attributes where some matches aggregate bindings and others do not. Also we consider the original query to have subgoals that are not related to aggregations. The general preprocessing strategy is to break the original query into smaller DAQ queries, process them separately and then use lossless joins to compose the answer into an answer to the original query. Normally in a relational database for a compound, one row represents one compound instance and there is a key identifying each row. Suppose the cell schema 103 is Cell(cid, cname, type, damage) where the “type” of the cell can refer to one of “utility” (u), “residential” (r), and “office” (o). The semantics of query Q( r , y) :− Cell(cid, n, r , y) is “compute the damage for each residential cell”. Therefore, the correct grouping is on “cid” instead of “type”. The aggregation rewriting checks whether at least one key is included in the query and modifies the head of the query if necessary; e.g., the head is changed from Q( r , y) to Q(cid, r , y). The head of a general query contains attributes that are not the grouping variables. In the Datalog syntax for aggregation (Section 2.1.2), we require that all non-aggregation variables in the head of an aggregate query are groupings. Because the original query is not an aggregate query, it may contain non-key variables in the head. Also, the body of the query may contain some relations not mapped to the d-node. We can pre-process a general DAQ so that after preprocessing the algorithm in Section 4.5.3 can be applied. The solution is to use the key of the relations that appear in the query to decompose the original query, then perform the aggregation rewriting on subgoals that are aggregation answerable and finally join the results back. Consider a general DAQ query in the following form: Q(x, y) :− R1 (x1 , y1 ), R2 (x2 , y2 )... , where x are variables that can be computed by aggregation and yi s are not. Let ki be the key for Ri ; we decompose each subgoal Ri using ki so that the body becomes Ra1 (k1 , x1 ), Rs1 (k1 , y1 ), Ra2 (k2 , x2 ), Rs2 (k2 , y2 )... , (4.6.1) where the superscript a (for aggregate) indicates that this subgoal is for aggregation rewriting, while those s (for SPJ) subgoals will stay on the q-node. Each Rai subgoal, it is further decomposed to Rai (ki , xi ) :− Ri (ki , xi,1 ), Ri (ki , xi,2 )... , i.e., a join of subgoals each containing one x variable. It is easy to verify that the aggregation rewriting correctly processes each Qi (ki , xi, j ) :− Ri (ki , xi, j ), pred(ki , xi, j ). 104 Also, both the decomposition of Q.body and Rai are lossless, which guarantees the correctness of joining the answers to find the answer to the original query. 4.7 Evaluate the Query Rewriting Algorithms The data set we use to evaluate the aggregation rewriting algorithm is from JIIRP [64]. We generated a synthetic data set is generated by scaling up the JIIRP data set. JIIRP’s infrastructure damage assessment data set contains 31 cells defined on 364 buildings; 95 pipelines defined on 665 pipeline segments and 10 roads defined on 445 road segments. The cell, pipeline and road databases are the q-nodes and the building, pipeline segment and road segment databases are the a-nodes. The d-node is a GIS database with 3 complete map layers for the buildings, pipelines and road segments respectively. Our synthetic data has the following properties: 1. The 3 GIS map layers were exported to 3 relational databases; each represents a d-node for one type of component; 2. We generated scenarios with 50 − 500 data sources; 3. We generated compounds consisting of many components, so that multiple a-nodes were needed to process a DAQ; The a-nodes were assigned data and mapping rule sets as follows: we ran a selection with a randomly chosen predicate (e.g., id > 3) on the complete JIIRP database; the selected data set formed the database for the a-node. We added the predicate to the mapping rules so that an a-node maps to a d-node. This is consistent with JIIRP, where many groups maintain and map small databases to the GIS. We distributed both real and synthetic data in WestGrid [125]. The cluster that we used has 840 dual core CPUs (3.0GHz Xeon, linux 2.6 32bit). We only simulated up to 500 data sources because typically only around 200 can be allocated at one time. We used WestGrid’s network file system to pass messages between cluster nodes. Our file system test showed that WestGrid performs quite stably under simultaneous access. 105 4.7.1 Evaluation of the aggregation rewriting We tested the performance of the aggregation rewriting algorithm using different mapping sizes. We scaled up the original JIIRP schema by creating long rules with chain queries (e.g., a join R1 (x, y), R2 (y, z) is a chain query of length 2) with no self-joins. We varied the number of subgoals in the generated rules from 5 to 45. We also scaled the length of a relation (i.e., the number of attributes) to 10, which is large enough for practical use. The queries we used are SPJ queries with 3 joins and an additional arithmetic comparison of the form var const, where is one of < or >, e.g., “x > 5”. Figure 4.5 shows the total time needed for an aggregation rewriting on a 3-role structure with a d-node to a-node fan-out of 30. For a fixed relation length, query translation time increases linearly with the size of the first two rules in a mapping. This matches our analysis in Section 4.5.4. Overall, aggregation rewriting was very fast; all cases finished within a second. 1000 relation length=2 relation length=4 relation length=6 relation length=8 relation length=10 total translation time(ms) 900 800 700 600 500 400 300 200 100 0 5 10 15 20 25 30 35 size of the first two rules in a mapping 40 45 Figure 4.5: Query translation time v.s. mapping size 4.7.2 Impact of mapping rule sizes Next, we studied how the size of mapping rules on the q-nodes, d-nodes and anodes (q-rule size, d-rule size and a-rule size respectively) affects the aggregation rewriting time. We independently increased the mapping rule sizes from 5 to 45. 106 We timed each of the 93 configurations, and repeated this for relation lengths from 2 to 10. The results for relations with 10 attributes are reported in Figure 4.6; in each sub-figure one of the three rule sizes is fixed at 45 (e.g., the q-rule size is fixed in Figure 4.6(a)). Figure 4.6 shows how the three parameters dominate each other. 1100 1000 200 1000 time(ms) time(ms) time(ms) 400 900 800 500 0 40 20 10 10 a−rule size 20 30 40 d−rule size (a) fixed q-rule size 40 30 20 10 10 a−rule size 20 30 30 40 20 10 d−rule size q−rule size (b) fixed d-rule size 10 20 30 40 q−rule size (c) fixed a-rule size Figure 4.6: Testing how the size of the q-rule, d-rule, and a-rule affect the rewriting time Figure 4.6(a) and (c) show that the d-rule size is a dominating factor in the aggregation rewriting time. We can see that for a fixed d-rule size, changing the q-rule size or the a-rule size does not greatly affect the total time needed by the aggregation rewriting. Figure 4.6(b) suggests that the a-node size and q-node size do not dominate each other. The overall trend of Figure 4.6 follows the expectation that larger rules lead to longer aggregation rewriting times. As in shown Figure 4.5, aggregation rewriting always finished in under one second. Our query rewriting experiments show that query translation for the aggregation rewriting can be performed very efficiently; aggregation rewriting potentially scales to many queries and complicated mappings. 4.7.3 The benefit of “fetch on demand” As described in Section 4.5, data transmission is “fetch on demand”. This allows the integration network to avoid sudden “peaks” in network load when data chunks are sent. We compare “fetch on demand” with “push” query processing to show how “fetch on demand” smooths out peak network loads for nodes (especially the d-node) with large fan-in. Figure 4.7 compares the two data dissemination methods’ networking overhead for a d-node. 400 concurrent queries are fed to 40 q-nodes in batches of 10. “Push” 107 push data load fetch on demand load 50 #queries 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 time(sec) Figure 4.7: Fetch on demand’s smoothing effect results in peaks when translated queries are sent from the q-nodes. Under “fetch on demand” the network load is smoothed to a more stable, lower overhead. As the d-node is central in query processing, especially for the general case where multiple a-nodes process one DAQ, fetching on demand efficiently avoids possible network bottlenecks. This is especially useful in disaster management where the network backbone is often a bandwidth-limited, sharing-based wireless network. Next we investigate the query optimization problems for DAQs. DAQ query answering is designed to be carried out in two phases. In phase 1, we investigate ways to optimize selection of a-nodes to contribute in returning answers compatible with the relational model. Optimizations for phase 1 focus on lowering processing overhead and speeding up query answering; while for phase 2 it focus on efficient computing of DAQ answer distribution. 108 5 DAQ QUERY OPTIMIZATION In Chapter 4 we proposed the decomposition aggregation query (DAQ) query answering framework and described the query translation algorithm. We proposed a two-phase query answering scheme (Section 4.4.1) to fit the new semantics of DAQ answers but deferred the details of the two-phase query answering and optimization to this chapter. Although the aggregation query rewriting algorithm ensures that a DAQ can be correctly answered by aggregate queries, the SemIS needs an systematic and efficient way to carry out the query answering process and the ability to return desired query answering results beyond just scalar aggregate values because the semantics of a DAQ answer is now a distribution. The efficiency requirement demands not only fast processing of a DAQ and finding correct answers but also optimal use of computational resources — in the PDMS case, to use only necessary data sources and to minimize the communication overhead. As we will see in Section 5.3, selecting an appropriate set of data sources to compute the scalar answer to a DAQ helps to optimize the phase 1 answering. In Section 5.4, we propose a sampling based scheme to estimate the DAQ answer distribution. The optimization for phase 2 focuses on using as few samples as possible to output high quality distribution estimation. The two-phase query answering interleaves with the aggregation rewriting operations as illustrated in Figure 4.2. The phase 1 optimization is actually carried out before the aggregation rewriting takes place. The optimization operations and 109 the aggregation rewriting process together make the full DAQ processing not only semantically correct but also efficient and scalable. Additionally, we find that the method we developed for phase 2 optimization can be generalized to benefit query optimization beyond the scope of DAQ processing. After we summarize related work for the DAQ query optimization in Section 5.2, the rest of this chapter is divided into two parts for phase 1 and phase 2 query answering and optimizations respectively. In Section 5.3, we describe the source selection optimization problem for phase 1 answering and our solutions. Section 5.4 describes the phase 2 query answering and optimizations. Empirical studies are presented inline with Section 5.3 and Section 5.4. 5.1 5.1.1 Preliminaries Hierarchical aggregate processing Whether they handle aggregations or not, most SemISs (e.g., PDMSs) use a hierarchical method for query processing: queries are processed at different sources, and the answers are propagated back to the original querying node. To process aggregations, the natural choice is to extend this by using the hierarchical aggregation scheme in parallel and distributed data processing platforms (e.g., the map-reduce [34] and sensor network architectures [79]). In such a scheme, the data sources are treated as if they were in a hierarchy rooted at the data source that initialized the query. Partial aggregate queries rewritten from the original aggregate query are sent to data sources in the SemIS along the schema mappings between data sources. Figure 5.1 shows such a hierarchy formed by 5 data sources and rooted at data source A. Treating these sources as a hierarchy distributes the aggregation load between data sources and makes best use of local computation so that the communication overhead is reduced. 5.1.2 Bootstrap sampling Bootstrap sampling, or bootstrapping, is a resampling method to improve an estimate’s quality, especially when the original sample size is small. Bootstrapping 110 A 150 id 4 value 50 C B 70 30 D id 2 3 E value 10 20 id 1 5 value 30 40 Figure 5.1: a hierarchical aggregation scheme with 5 nodes: nodes B, D, E sends partial sum aggregates up the hierarchy; the full aggregate (for ids 1-5) is computed at root node A. begins with a (usually small) sample set S; the system draws samples from S to obtain n, (n > 1) resamples (B1 , . . . , Bn ). The estimator is then applied to the resamples. The output — a set of estimates E1 ..En — in many cases helps to derive a tighter confidence interval for the estimate. For example, let Ei be the mean of Bi for a set of resamples of S. Bootstrapping uses the median of the Ei s as the estimate to the mean of the distribution. Bootstrap aggregating (bagging) [18] is a popular ensemble method that uses bootstrap resampling to improve the performance of regression. We use bagging with kernel density estimation (KDE) to obtain a smooth and stable probability density function for the viable answer distribution. 111 Extract distribution statistics Unbaised Sample I The UniC and UniS Sampling II minimize sample BootStrapping size for desired & Bagging confidence interval III Density Estimation N VI Stability Analysis 3. CoverageVII Interval Optimizer Y Multi-mode ? IV Confidence Interval for point estimates Output : aggregate answer statistics Assess how dist. changes on source removal 2. 1. V Mode seeking Figure 5.2: The workflow of estimating viable answer distribution 5.1.3 Kernel density estimation A probability density function (PDF) describes how likely it is that a variable occurs at a given point. Kernel density estimation (KDE) estimates the PDF of a distribution using a sample set drawn from the distribution. Let the sample set be {xi }. The basic idea of the kernel method is to measure a sample point (xi )’s contribution to the distribution density function using a function K((x − xi )/h), where K is called the kernel and h is called the “bandwidth”. The bandwidth is a parameter that controls the “localness” of a point’s impact on the distribution. A large h will give a smooth density function but is likely to under-fit, whereas a small bandwidth fits better on the sample points but is likely to over-fit. When kernels are applied ∑n1 K( xi −x h ). Usually we use the same kernel function K on all the points. In our solution, we use to all xi ’s, then the density function is estimated by f (x) = 1 nh 2 Gaussian kernels (K(x) = x √1 e− 2 2π ) centered at each point in the sample set. Other popular kernels include the Epanechnikov kernel (K(x) = 34 (1 − x2 )1|x|≤1 ) and the 35 Triweight kernel ( 32 (1 − x2 )3 1|x|≤1 ) [54]. Selecting an appropriate h value is a challenging problems widely studied in KDE literature. In [16] the authors described an adaptive method to automatically choose a good h value depending on the sample set fed into the KDE process. KDE can be used together with bagging. In our case, we perform density estimation for each bootstrap sample set and use the normalized point-wise mean 112 of all the estimates as the density function for the viable answer distribution. 5.1.4 Distance measures for probability distributions To compare two probability distributions, a measure is used to quantify their “distance”, i.e., to measure how different two distributions p and q are. In our case, we want to quantify the changes on viable answer distribution under different data source settings. Thus a distance function would provide a direct and compact measure. Several such distance functions have been proposed. For example, the L2 measure DL2 = (p(x) − q(x))2 dx uses the pairwise Euclidean distance and the Bhat- tacharyya distance [14]. DB (p, q) = p(x)q(x)dx uses the integral of point-wise product of the two distributions. Other widely used distances are Kullback-Leibler divergence (KL-divergence) [15] and the earth-mover-distance (EMD) [71]. As we will see in Section 5.4.5, not all distance functions can be efficiently computed for our stability analysis and the computability largely depends on the mathematical property of the distance function. Later we show that efficient computation exists for stability scores measured by L2 and Bhattacharyya distance functions. 5.2 Related Work Resource selection is studied in information retrieval [91]. In an IR context resource selection is usually used to route queries to resources that are likely to contain required documents. Solutions to the resource selection problem often involve estimating the probability of a hit [110]. [124] also applies a greedy algorithm for source selection. However, our semantic integration setting is very different from IR and the purpose of selecting sources differs fundamentally; therefore the IR approaches do not apply to our problem. The work in [3] also allows a range of viable answers. In a data exchange/semantic integration setting, a component to be aggregated can take different values — possible databases in data exchange and multiple contributing sources in semantic integration — resulting different yet viable answers. The set covering problem (SCP) is a classic NP-hard problem [96]. Many 113 solvers have been developed, including greedy heuristics [24], local search based methods [59], branch and bounding [8] and probabilistic methods [13]. Since SCP is hard to approximate, exponential algorithms also exist [32]. Usually exponential algorithms achieve better results at the cost of higher time/space complexity. However, for query processing we usually cannot afford the time complexity and have to make trade-offs with fast and sub-optimal algorithms, e.g., using the IGJB [59] heuristics. The maximum-clique problem is a well studied NP-hard problem. In [96], the authors survey complexity results, popular approximate methods, and exact maximum-clique algorithms. Various maximum-clique solvers include local search [66], constraint programming [105], branch-and-bounding [94], and greedy heuristics [61]. We used a sequential greedy algorithm [96] in our implementation to exploit the speed of the greedy heuristic. While our MCI implementation uses equi-depth histograms [56], more complicated histogram structures could be applied. Implementing a good MCI representation closely connects to join-size estimation [49], which includes sampling methods such as bifocal sampling [47]. Research in wireless sensor networks [35, 62, 78, 79] presents a similar setting for computing aggregates using distributed data sources. In a sensor network, sensor motes form an ad-hoc network and collaborate to transmit their sensor probe readings to a centralized repository that is usually beyond the range of a single mote. This is performed by transmitting data in a hierarchical aggregate network rooted at the central repository. Although the hierarchical aggregate network has a lot in common with sensor networks, the operations are essentially different. We focus on processing aggregate queries on the hierarchy where a query usually requests a (small) part of data in the databases maintained in the distributed data sources; in the sensor network case, sensor motes have to upload all their data to the central repository. This results in different optimizations on the data transmission. We face the source combinatorial explosion and use sampling to make estimations, while a sensor network optimizes routing, transmission cost and seeks load balancing among battery powered sensor motes. In [92], the authors described a protocol to adaptively request data from remote sensors over time so as to control the error of approximate answers, while our work focuses on the snapshots of 114 answer distributions when an aggregation is processed. The stability score helps maintain continuous queries but is not a direct approximation measure for a query. Research in databases with uncertainty is also related to the work we reported here. There exist a number of models including uncertain databases [89], possibilistic [107] and probabilistic [63] databases, and inconsistent databases [6]. In data exchange, aggregate semantics is discussed with possible worlds [3]. Among the various models, the uncertain database [89] is closest to our setting: it also uses the relational model and the value of an attribute is a discrete distribution with positive probability on a finite set of values. In our setting, different values on different data sources can be transformed to values in a similar distribution, thus the semantics of processing aggregate queries under the two settings becomes very close. In [89] the authors discussed processing aggregate queries on uncertain databases where aggregates uses expectations of values in computation. Their work avoids the combinatorial explosion simply by disallowing exhaustive aggregates and does not give information about the viable answer distribution. Moreover, in our semantic integration setting, computing the expectation of all component values is simply impractical as it requires collecting all single values from all data sources. Also, in our stability model, removal of one source will result in invalidating all components’ values on that source. Therefore, the techniques cannot be applied to processing aggregates and providing distribution information as we need here. 5.3 5.3.1 Query Optimization for Phase 1 Answering Motivating source selection When multiple data sources are available to process a DAQ, it becomes very important to select the right set of data sources in query processing. Unlike in answering “open” queries where the more sources to use the better (more complete) the answers are, for targeted queries like DAQs, a good answer comes only from a carefully selected set of data sources. We consider the source selection step as an important query optimization task in answering DAQ. The following is an example: Example 5.3.1. Figure 5.3 shows a scenario where a DAQ, qa , is processed with 6 data sources where A is the q-node, B the d-node and C– F are 4 a-nodes. 115 With the help of B, a query qa on an object in A is answered by an aggregation “sum(value)” over the 5 components and where the required values can be found on sources C – F. The question is how the a-nodes should be used. Figure 5.3 shows one solution; B chooses C and D, which then process two partial aggregates, qc and qd . The answers to the partial aggregates are merged at B, thus answering the DAQ. One topic this work considers is how to appropriately select data sources to process an aggregation. Mappings: Rb (x, y):-R Rb (x, y):-R Rb (x, y):-R Rb (x, y):-R (_, x, y), y<2002 (_, x, y), y>2002 (_, x, y), 2000<y<2008 e (_, x, y), y>2004 f c d C Rc q c(sum(x)):- Rc(x,y,z), z <2002 D B A Rd q d (sum(x)):- R (x,y,z) , z >2002 d DAQ q a Rb (x,y) Schema: R c,d,e,f (value, id, year) Rb ( id, year) E Re F Rf Figure 5.3: A DAQ processing scenario Data instances at C-F E C B Value ID year Value ID year ID year 100 B001 2000 165 B002 2001 B001 2000 150 B002 2001 220 B003 2004 B002 2001 330 B004 2006 B003 2004 B005 2007 B004 2006 Value ID year B005 2007 200 B003 2004 440 F Value ID year 5 components to be aggregated 300 B004 2006 310 B004 2006 400 B005 2007 420 B005 2007 D Figure 5.4: Data instances for Example 5.3.1 116 Processing a non-aggregate query in a semantic integration architecture generally requires translating the query to all available data sources and returning the union of all the answers. This widely accepted method for SPJ queries must be reconsidered for aggregate queries since using all data sources may introduce duplicated components to the aggregation, which introduces error. E.g., in Example 5.3.1, the preferred solution is to use either data sources {C, D} or {C, E} to answer the aggregation; adding additional sources only adds undesired duplications. Using the mappings shown and choosing data sources C, D to answer the query, the query sent to C is qc (sum(x)) :− Rc (x, y, z), z < 2002; the query for D is qd (sum(x)) :− Rd (x, y, z), z > 2002. As shown in this example, effective source selection both lowers overhead and improves answer quality. We generalize these points to the following observations: 1. Only a subset of data sources is needed and desired to to answer an aggregation. 2. Multiple combinations of data sources and assignments of components are often available. These combinations may result in different answers. 3. Duplicates can exist even with carefully selected source combinations. Observation 1 demands the development of query answering techniques other than the exhaustive search used for traditional select-project-join semantic integration queries. Since adding more sources may negatively affect query answering, source selection is required to process aggregate queries. Observation 2 shows another issue that increases the complexity of aggregate query processing. Theoretically, any combination of data sources covering the requested components can be used for the aggregation. E.g., if the aggregation uses data sources C and E, the sources have different values for “B002” (150 and 165 respectively); either can be used. In general, for an aggregation of n components, where each component has K data sources covering it, the space of “possible” aggregation values is as large as K n . In practice, we only return one value for an aggregation because the semantics of an aggregate query requires returning one value, and it is computationally hard to calculate all possible aggregation values. Observation 3 suggests that error may still exist in answers. As we explain in Section 4.4.1, there are several factors that introduce error to query answering. Making good selections can reduce duplicates and enable better approximate an117 swers. As shown in Example 5.3.1, a new challenge in answering aggregate queries is the combinatorial complexity — the number of possible combinations of data sources to answer an aggregation is often too large to enumerate in practice. We answer the following questions. 1. What is the right semantics for aggregate queries, given that choosing different source combinations can yield different answers? 2. Among the large number of possible combinations, how to choose data sources so that high quality answers can be obtained while query processing costs can be kept low? 3. What information should we provide on the distribution of answers from different source combinations? 5.3.2 The source selection optimizations Figure 5.5 extends the DAQ querying processing diagram in Section 4.4.1 to show the workflow of the two phase query answering. In this section, we mainly discuss the optimizations for phase 1 query answering and the major components of the query optimization are highlighted in the diagram. The task of phase 1 answering is to select an appropriate set of sources to cover the components decomposed from the compounds in the DAQ, and then obtain an aggregate answer to return. The source selection process is defined as following: Definition 5.3.1. [source selection] Given an aggregation Q, a set of data sources A, and a constant K, the source selection outputs up to K sets of data sources S = {s1 , . . . sK }, si ⊆ A, and each si , called a cover, can be used to answer Q. As depicted in Figure 5.5, first, a set of data sources is selected to form a cover of the target aggregation (step I). Then, duplication in the selected cover is estimated (step II) and eliminated by partitioning (step III). We optimize the source selection steps so that the phase 1 answer is computed quickly. Step I (source selection) has three inputs: (1) a set of candidate data sources, (2) the mapping coverage information (MCI), and (3) the components to aggregate. Step I outputs up to K sets of covers, where K is a constant for phase 2 sampling. 118 Decomposition Input: Data sources MCI Components I. Find Source Covers Output: Selected sources (up to K groups) N Y II. Decide for partition? Output: selected sources III. Partition Output: selected sources w/ partitioned sets IV. Rewrite query with mappings V. Rewrite query with partitioned sets phase 1 answering Process aggregation with selected anodes Output: modify K fast return, scalar value K full aggregates Sampling DAQ answers phase 2 answering estimate the DAQ answer distribution Figure 5.5: Overview of the source selection flow. The source selection is positioned between decomposition and query rewriting in DAQ processing. Step III partitions the covers obtained in step I to reduce duplication among the selected data sources by adjusting the ranges of the components to be included in the partial aggregates for the selected data sources and ensures that they do not overlap. For example, partitioning the data sets {C, E} in Example 5.3.1 determines that only “B001” in C should be used, thus removing duplication. Steps IV and V both rewrite aggregate queries to process on the selected data sources. The inputs to the two steps are slightly different and they contribute differently to query processing cost. The output of source selection is fed into the query rewriting algorithm where (partial) aggregate queries for data sources selected in the covers are prepared. For a data source ai with a set of components E as the partitioned results, the following additional operations are taken, compared with 119 rewriting directly with mappings: 1. E is sent to ai together with the rewritten query; 2. an additional join is performed on ai to compute aggregation for components only in E. Cover-finding outputs K sets of sources each covering the components to be aggregated. Aggregate answers computed from these K sets are needed for the phase 2 query answering. The overall optimization goal is to select sources and process the (rewritten) aggregate queries quickly and use minimal computational resources of the distributed system. This includes minimizing the overhead for query rewriting, communication and query processing on the a-nodes. We first introduce MCI, a statistics maintained to let the d-node quickly determine an a-node’s coverage to the aggregation. 5.3.3 Mapping direction and MCI In practice, mappings are often directional. Figure 5.6 compares two mappings, M1 and M2 , in different directions. ID year ID year B001 2000 B001 2000 B002 2001 B002 2001 Value ID year B003 2004 B003 2004 200 B003 2004 B004 2006 Value ID year B004 2006 300 B004 2006 B005 2007 200 B003 2004 B005 2007 400 B005 2007 Rd -A D B Rb ( id, year) D B Rb ( id, year) R d (value, id, year) Mapping M 1 : Q1 (x):-R d (v,x,y), y>2002 Q2 (x):-R b (x,y) m(t):-Q 1 (t), Q 2 (t) Rd -B R d (value, id, year) Mapping M 2 : Q1 (x):-R b (x,y), y>2002 Q2 (x):-R d (v,x,y) m(t):-Q 1 (t), Q 2 (t) Figure 5.6: Two different-direction mappings The direction of a mapping is reflected by the order of the mappings rules in the 3-rule system. In Figure 5.6, the first rule (Q1 ) indicates the source schema and the second rule (Q2 ) is written in the target schema. The semantics of a directional 120 mapping is “for all tuples satisfying Q1 in the source schema, there exist mapped tuples satisfying Q2 in the target schema.” We can verify that both data instances Rd -A and Rd -B are consistent with M1 , but only Rd -B is consistent with M2 : in the latter case, the tuples (“B003”, “B004”, “B005”) satisfy Q1 of M2 , and thus must have their mapped tuples in the target instance D, but Rd -A lacks them. It is easy for B to use M2 , for which the direction is from B to D, to rewrite aggregation queries. B can determine the set of components mapped in D by evaluating the mapping M2 locally. In the reverse mapping, M1 , B cannot determine the data instance on D locally — both Rd -A and Rd -B are possible databases. Unfortunately, in DAQ processing there are more mappings like M1 than M2 . This is because B works as a decomposition node — a “standard repository” for data sources to map to while an answering node (D) adopts the open world assumption and provides only partial coverage to the aggregation. Selecting a source requires knowing both what instances can exist on a data source and also what instances actually exist. To let the node of the target schema know the database on the source schema, we define mapping coverage information (MCI): Definition 5.3.2. [MCI] Mapping coverage information (MCI) is parameterized by triple (m, σ , τ) where m is a mapping from source σ to target τ. The mapping coverage of a database D on τ, denoted by MCI(D), is a structure that can be queried by τ to return how tuples on τ are mapped by tuples in σ . This definition specifies the functionality but not the implementation of the MCI. A trivial MCI implementation is to query σ every time with D and return the set of mapped tuples in τ. Another implementation is to apply D on the inverse of m, but it can be difficult to obtain the inverse mapping. We propose to use a “coverage ratio” that tells the percentage of tuples in τ being queried that are mapped by tuples in σ ; and use histograms to support efficient querying of the MCI. See 5.3.4 for the details of the implementation. The MCI of data sources are queried for the cover-finding step. Depending on the components to be aggregated, the MCI of a data source changes from aggregation to aggregation. As shown later, our source selection algorithm prefers sources with high coverage ratio. Next we give an implementation of the MCI. 121 5.3.4 An implementation to MCI The definition of MCI in Definition 5.3.2 does not specify how the mapping coverage information is represented. Our MCI implementation uses a “coverage ratio” to measure the number of mapped tuples. Definition 5.3.3. [coverage ratio] Let the database on source schema (σ ) be Dσ . Let |Q[D]| denote the number of results when evaluating rule Q over a database D. The coverage ratio is defined as cr(M, Dσ ) = is the first rule in M and D∗σ |Q1 [Dσ ]| |Q1 [D∗σ ]| where M is the mapping, Q1 is the union of all possible source databases consistent with M. For the example on the left hand side of Figure 5.6 where D is the source schema and B is the target schema, |Q1 [D∗σ ]| = |Q2 [Dτ ]| = 3 because there are 3 tuples (B003, B004, B005) in target instance Rb that can be mapped by M. If we take Rd -A as the data instance on D, then by Definition 5.3.3, cr(M, Rd -A) = 1/3; and if Rd -B is taken, cr(M, Rd -B) = 1. The above coverage ratio (Definition 5.3.3) can directly be used as a stand alone MCI implementation where the value is determined solely by the mapping M and the data source. However, this implementation only gives a coverage ratio for all the data on a data source; thus is not very useful because it seldom requires to aggregate all the components on a source. Therefore, we need an MCI representation that both enables querying with a desired component set and avoids excessive processing overhead. Querying the coverage ratio for a given set of components is analogous to join size estimation. A histogram is a powerful summarization tool often used in join size estimation [49, 60, 99]. In a basic version of join size estimation where two relations R and S are to be joined on attribute a, histograms are built on the domain of a and partition the range of a into a N buckets. The counts of tuples in R and S that fall in the corresponding buckets are recorded by the histograms. Given a range of tuples to be joined in R, the histogram can be queried to estimate the number of tuples in S that join with R. Normally, the more buckets a histogram has, the more accurate an estimation can be made. We find that a similar technique can be used to implement the MCI. We can translate an MCI query, w.r.t. a mapping M into a join size estimation query as follows: 122 1. the components from the source and target schemas satisfying the first 2 rules Q1 and Q2 of M map to R and S respectively for a join size estimation. 2. the third rule m of M maps to the join condition. Given a mapping M(σ , τ) where σ denotes both the source schema and the data source, and τ denotes both the target schema and the data source, we build equidepth histograms [56] Hσ and Hτ on σ and τ respectively for components mapped by M. Then we copy Hσ to τ to enable local querying of the MCI at τ. Given a set of components in an aggregation, the number of components mapped (joined) in the source schema is estimated using the histograms. This result of join-size estimation functions as |Q1 [Dσ ]| in Definition 5.3.3; the number of components in the aggregation and also satisfying Q2 of M (computed by querying the local database or using Hτ ) functions as |Q1 [D∗σ ]|. Thus we can compute the coverage ratio. Since the join size estimation is performed locally, the cost of computing coverage ratio for a set of components to aggregate is very low. The maintenance overhead occurs when the database on the source schema is updated, in which case the corresponding histograms are updated and re-synchronized to the target schema. Estimating with histograms is quite accurate when the number of components in the aggregation is large and especially when these components form a continuous region in the histogram. When either the number of components to be aggregated is small or the components are extremely “scattered”, estimating with histograms produces large errors. In these cases we query the candidate data source at runtime to obtain more accurate coverage. 5.3.5 Optimize the cover-finding step The first source selection step is finding a set of sources to cover the aggregation’s components. An optimization problem immediately follows: what is the best cover? An immediate and intuitive answer is to use “the cover that produces an accurate query answer.” By Definition 4.4.1, any answer in the interval [inf(V ), sup(V )] is valid and thus accurate. Assuming there is no extra information guiding the system to prefer one data source to another, any two viable answers are equally 123 accurate. In phase 1 we choose to answer an aggregation using the fewest number of data sources as possible, which saves the cost of querying extra data sources. Also, if one set of sources has fewer duplicated components than another, it is preferred because duplications generally introduce error. A natural question is why minimize duplication in the cover-finding step when a separate partitioning step later reduces duplication. The consideration is twofold. First, if the selected sources contain substantial duplication, partitioning is likely to adjust more data sources to reduce duplication, which causes an increase in the communication cost to send translated (partial) aggregates to the data sources. Second, when views with aggregations are in the mapping, e.g., mappings in [27], we may not be able to freely adjust the range of components in a partial aggregation. In this case, the query must be processed using the path Step I→ Step II→Step IV. Therefore, minimizing duplication in cover-finding step is still desired. We can use a cost function defined on 2A → R to capture the above mentioned optimization goals, where A is the set of available data sources. Equation 5.3.4 and Equation 5.3.5 give two such cost functions: Definition 5.3.4. [min-source cost function] Given a set of data sources, A = {a1 , a2 , . . . , an }, and a set of components, S = {s1 , s2 , . . . , sn }, for aggregation, a min-source cost function, cost1 : 2A → R, is defined as cost1 (V ) = |V |, ∪ai ∈V ai = S; ∞, ∪ai ∈V ai ⊂ S. (5.3.1) where V is a set of selected data sources, and ai denotes both data source i and the components it covers. Definition 5.3.5. [min-duplication cost function] Given a set of data sources A = {a1 , a2 , ..an } and a set of components S = {s1 , s2 , ..sn } for aggregation, a minduplication cost function, cost2 : 2A → R, is defined as cost2 (V ) = ∑ai ,a j ∈V,i< j |ai ∩ a j |,∪ai ∈V ai = S; ∞, ∪ai ∈V ai ⊂ S. (5.3.2) where V is a set of selected data sources, and ai denotes both data source i and the 124 components it covers. In Example 5.3.1, cost1 ({C, D}) = 2, cost2 ({C, D}) = 0, and cost1 ({C, E}) = 2, cost2 ({C, E}) = 1. Thus, both sets are equally good according to the min-source measure, but {C,D} is better according to the min-duplication measure. Unified by the above cost functions, source selection optimization searches for a subset of the candidate sources that covers the required components in the aggregation and minimizes a given cost function. We define it as follows: Definition 5.3.6. [cover-finding problem] Given a set S of components to be aggregated, a set of data sources A in which each data source covers a set of components in S and a cost function c : 2A → R, the cover-finding problem selects data sources that cover S and minimize c. The following theorem shows the complexity: Theorem 5.3.1. The cover-finding problem is NP-hard. The NP-hardness can be proved using a reduction from the weighted set covering problem. A formal definition of the weighted set covering problem, e.g., in [55] is: Definition 5.3.7. [weighted set covering problem (W-SCP)] Given a set S, a set A of subsets of S and a cost function cost() : C → R, the weighted min-cover problem finds a cover C ⊆ A for S such that cost(C) is minimized. Proof. The reduction from a W-SCP instance to a cover-finding instance is straightforward. Thus, the cover-finding problem is NP-hard. The connection also suggests the use of W-SCP solvers on the cover-finding problem. We transform an instance of cover-finding to an instance of W-SCP as follows. The W-SCP instance contains Sscp (the set to cover), Ascp (the subsets to use), and costscp (the cost function for the weights). Let ai denote both a data source and the set of components that can be covered by data source i: 1. Ascp = {ai |MCI(ai ) > η, ai ∈ A}. We use only data sources whose MCI coverage exceeds a minimum threshold η, where η is a constant (e.g., 0.9). 125 2. Sscp = ∪ai ∈Ascp ai . 3. costscp = c, where c is a cost function like cost1 (Equation 5.3.1) or cost2 (Equation 5.3.2). The above construction of Sscp ensures that at least one cover exists. The constructed Sscp in (2) above may contain fewer components than the set S of components to be aggregated. If this happens, it means that the data sources with good coverage (coverage above η) do not cover all the components to be aggregated. It can be the case that components in set Sl = S − Sscp are covered by some data sources in the remaining set Al = A − Ascp — some sources with low coverage, or the missing components are not covered by any data sources. Our strategy is to run the W-SCP solver over the constructed W-SCP instance (Ascp , Sscp and costscp ) to find the top K covers; then we query the data sources in Al for components in Sl and run the W-SCP solver again for one best cover of Sl using the data sources in Al . Let {vi }, i ∈ 1..K be the top K covers output by the solver after the first run, and vl be the cover found in the second run, the finally selected K covers to process the aggregation is V = {vi ∪ vl , i ∈ 1..K}. Because the two runs of the W-SCP solver execute with disjoint inputs, it is easy to verify that if vi is an optimal solution for Ascp and vl is an optimal solution for Al , then vi ∪ vl is an optimal solution for A. We use the IGJB solver [59] for cover-finding. Section 5.5.1 shows that IGJB is fast and finds high quality covers for the tested datasets. The original IGJB is modified to suit the need of being a W-SCP solver. 5.3.6 Modifications to the IGJB algorithm We modify the IGJB solver [59] for the cover-finding problem. The IGJB solver uses local search heuristics. It starts with a greedy cover solution; the local search process is controlled by simulated annealing (SA). The local search process iteratively“destroys” the current solution and greedily “re-constructs” a hopefully improved new cover. We modify the SA process of IGJB so that the parameter to use for SA is easier to determine when different cost functions (e.g., cost1 and cost2 in Section 5.3.5) are used in optimization. The acceptance function of the simulated annealing (SA) 126 as described in the IGJB solver [59] is Pr(V |V ∗ ) = e−A(cost(V )−cost(V ∗ ))/T , where A is a constant, T is the temperature of the current round, cost is the cost function and V , V ∗ are the current cover and best known cover respectively. There are two challenges to using this acceptance function in our cover-finding problem: 1. we do not know the value of cost(V ) beforehand, so it is hard to choose parameter A for the desired acceptance probability and 2. for different cost functions (e.g., the cost function cost1 Equation 5.3.1 and cost2 Equation 5.3.2), the value of cost(V ) can be very different. We solve this by changing the “absolute” difference in the acceptance function to a “relative” difference: Pr(V |V ∗ ) = e−A(cost(V )−cost(V ∗ ))/cost(V ∗ )T . By normalizing the “cost difference”, this new acceptance function can be used universally with any input instance and cost function, and the parameter A can be pre-determined to implement a desired acceptance rate. For example, if we want a 20% worse solution to be accepted with probability 0.8 for temperature T = 1000, then by solving e−0.2A/1000 = 0.8, we set A = 5000 ln 1.25 ≈ 1115.7. We choose IGJB as the solver for W-SCP instead of other, even more recent solvers e.g., [8, 12] for a number of reasons. First, local search algorithms are easy to implement and searching can be very fast. W-SCP is an NP-hard problem, so essentially in a given running time (e.g., 1 minute), the more search steps performed the probability of a better solution being discovered is higher. Second, the use of SA makes the searching time predictable. When SA is run, the number of searching steps are fully determined by the cooling schedule of SA so that the number of searching steps can be adjusted to meet the condition of available computational power and the allowed processing time. This is a very desirable feature for the cover finding step. 5.3.7 Optimize the partition step Duplications in the cover must be reduced before partial aggregate queries are sent to data sources. To do so, we partition the cover from step I. Partitioning also ensures that a component is included in only one partial aggregate. The duplication is measured by the duplication ratio: 127 Definition 5.3.8. [duplication ratio] Let V = {ai |i ∈ [1..n]} be a set of selected sources, and let ai denote both the data source and the components in the aggregation it covers. The duplication ratio is defined as r = ∑ai ,a j ∈V,i< j |ai ∩a j | |∪ai ∈V ai | For example, the duplication ratio of set {C, E} in Example 5.3.1 is 0.2. Step II checks if the duplication ratio of a cover exceeds a pre-determined threshold, θ . If so, a partitioning is performed; otherwise the cover is directly used for query rewriting. In practice, θ is small, so duplication below this threshold only negligibly affects the result. We also make a linear adjustment to the final result obtained to further offset the marginal inaccuracy. For example, if v is the answer for a sum over 1000 components but the total population with duplication is 1005, then we adjust the result to v∗ = v × 1000/1005. Many aggregations do not require adjustment, e.g., avg, median, variance and stddev. Consider the following example: three sources A = [1..4], B = [3..7], C = [6..8] cover the set [1..8] with duplication. There is more than one way to eliminate duplication, e.g., partitions S1 = (A1 = {1, 2, 3, 4}, B1 = {5}, C1 = {6, 7, 8}) and S2 = (A2 = {1, 2}, B2 = {3, 4, 5, 6}, C2 = {7, 8}) both achieve the goal. All data sources are needed to cover the components in the aggregation since each covers at least one component that is not coverable by any other sources. Therefore partitioning does not change the number of data sources needed to process partial aggregates. Although all partitions contain the same number of data sources, they are not equally preferable. When components are deleted from a data source, we say the data source is adjusted. Here, S1 is preferred because it adjusts fewer sources than S2 . We give preference to partitions that adjust fewer data sources because this reduces the cost to process partial aggregates. If a data source is adjusted to exclude some components, the translated partial aggregates for it contains an additional join. The overhead of adjusting a data source lies in this additional join, which can be expensive when the component set is large. Unlike minimizing duplication in cover-finding (Section 5.3.5), communication overhead is not the biggest concern. This is because the total number of excluded components in all possible partitions is always the same and therefore the communication overhead is a constant for a given set of data sources. Therefore, the optimization goal for partitioning is to 128 minimize the number of adjusted data sources so that the least number of joins needs to be performed in succeeding partial aggregates. Hence we have the minadjustment partition problem: Definition 5.3.9. [min-adjustment partition problem] Given a set of sources A = {a1 , . . . , ak } which contains duplications and let a hamming function f (x, y) = 1 iff x = y and 0 otherwise. The min-adjustment partition problem finds a partition A = {ai } of A, so that Σi f (ai , ai ) is minimized. i.e., the number of data sources adjusted is minimized. The following theorem shows the complexity: Theorem 5.3.2. The min-adjustment partition problem is NP-hard. The complexity of min-adjustment partitioning problem is associated with the maximum-clique problem [95]. Definition 5.3.10. [maximum-clique problem] Given an undirected graph G = (V, E), where V is the vertex set and E is the edge set. A maximum clique is a subset of C ⊆ V such that for every two vertices i, j in C, edge (i, j) ∈ E and the size of the clique, |C|, is largest. Proof. Given a maximum-clique instance G(V, E) where V is the vertex set and E is the edge set, an instance A of the min-adjustment partition problem is constructed by creating a set ai for each vi ∈ V . For each ai , ai ∩ a j = 0/ iff e(vi , v j ) ∈ E. It is easy to verify that this reduction is polynomial. The vertices in the maximum clique of G then correspond to the set of ai ’s that is the optimal solution to the min-adjustment partition instance A. Thus we can use a maximum-clique solver to find the min-adjustment partition. An instance G(V, E) of a maximum-clique problem can be obtained from an instance of the min-adjustment partition problem by creating one vi ∈ V for each ai ∈ A, and edges e(vi , v j ) ∈ E iff ai ∩ a j = 0. / Greedy algorithms and local search algorithms are used by popular solvers and perform well in practice [95]. We use a sequential greedy solver which repeatedly adds a vertex with the largest degree to the clique until no new vertex can be added. 129 Our empirical study in Section 5.5.1 shows that the greedy solver runs quickly and performs well. The IGJB and the greedy algorithm selected for optimal cover finding and deduplication are approximate algorithms that have no guarantees on approximation ratio. For IGJB, the trade-off between time and expected approximation ratio depends on the parameters selected for the simulated annealing process; for the greedy partitioning, the performance solely depends on the input. The two algorithms are chosen because their efficiency and performance are shown to be very good in practice. Our empirical study in Section 5.5.1 shows that both algorithms give satisfiable approximation quality in practice. 5.4 5.4.1 Query Optimization for Phase 2 Answering Motivating phase 2 answering Aggregate queries are important and fundamental to relational databases. Business logic often heavily relies heavily on aggregations for analytical processing [20, 87]. The importance of aggregate queries remains when we shift from querying traditional RDBMSs to SEMantic Integration Systems (SemIS) [2, 3, 26], where multiple independently managed databases called data sources cooperate to answer queries in data integration systems (e.g., [23, 37, 88]), or a peer data management system (PDMS) (e.g. [7, 53, 93]). Data sources in a SemIS collaborate by mapping to each other’s schemas. Queries are answered by translating queries over these mappings across different data sources. For example, weather information databases maintained by individual cities form a SemIS to allow querying larger areas, where the data sources collect data from local meter stations or extract data from weather-forecast archives. One characteristic of query answering in a SemIS is that the answer is typically not unique. For example, when processing Select-Project-Join (SPJ) queries, SemISs generally return the union of answers from different data sources; the burden of deciding the answer to use is left to the users, much like the result of a search engine. The presence of multiple answers makes processing aggregate queries (e.g., avg) over a set of components in a SemIS challenging. The components have 130 to be retrieved from multiple data sources, which may hold different values. Hence, depending on which sources are used, there may be a wide variance in the finally aggregated answer. We call the range of aggregate answers produced by different component-source assignments “viable answers”. Having multiple “conflicting” answers to an aggregate query is a natural consequence of allowing data sources to independently collect and manage their data. Often, no global mediator is available to decide which value from which data source should be used for an aggregation. When these values are aggregated, this problem is compounded. In such cases, choosing one “best” value to return is hard. Even if we define the semantics to allow returning any one of the viable answers, the SemIS faces the problem of providing consistent scalar answers to users when the same query is re-evaluated. For example, consider querying the temperature of British Columbia (a province in Canada). This query can be answered by taking an “average” over meter station readings managed in multiple data sources using the following query: SELECT avg(t) as temperature, month FROM SemIS GROUP BY month HAVING avg(t) > 20 and using a SemIS to process the different readings from different data sources. The temperature readings on different data sources do not all agree with each other and one data source does not cover the whole province. Thus to process this query, the system needs to use a combination of data sources. Different such combinations yield different viable answers. Unlike the treatment for SPJ queries, we cannot simply return all viable answers because the number of viable answers rises dramatically with the number of possible combinations and the scalability of query processing is quickly doomed by the combinatorial explosion of possible data source combinations if all viable answers are to be returned. Suppose the aggregation contains N components and for each component there are B distinct values retrievable from data sources; the population of viable answers can be as large as BN where B can be several dozen and N as large as several hundreds. 131 Our answer to this problem is to treat the aggregate answers for semantic integration as a distribution formed by the viable answers and to answer an aggregate query by estimating the viable answer distribution and extracting distribution statistics. We use the process (which we describe in detail in future sections) in Figure 5.2 to come up with the following three outputs: 1. Key statistics for viable answer distribution: Point statistics such as mean, variance, and skewness provide general knowledge about the viable answer distribution. These help users to correctly choose and use scalar values for aggregate answers. 2. Near-optimal high coverage intervals: The “high coverage interval” information which we compute from the answer distribution tells where the majority of viable answers are concentrated. This “shape” sensitive information is particularly useful when the distribution is multimodal1 . 3. Stability measure for aggregate queries The “stability score” for an aggregate query tells how the viable answer distribution is likely to change when changes to data sources happen (e.g., peers become unavailable in a PDMS). One use for the stability score is that the SemIS can dynamically control the priorities of updating results for continuous queries. Our contributions in phase 2 query answering thus focus on designing efficient algorithms to extract accurate statistics for aggregate queries. Specifically: • We redefine aggregation answers in semantic integration as a distribution of viable answers computed from different data source combinations. • We define the three tasks above as specific requirements for answering aggregations: 1. returning point statistics with user defined confidence intervals; 2. providing “hot ranges” to reflect distribution shape information and 3. measuring the stability of an aggregation. • We provide algorithms to efficiently extract the desired information, including 1. estimating point statistics using sampling and ways to reduce sampling 1A distribution which the probability density function (pdf) has two or more significant peaks. 132 overhead while keeping the confident intervals tight; 2. a fast, greedy algorithm to extract hot ranges; and 3. quickly calculating stability scores under a probabilistic model without needing to simulate source removals. Overall, we can support online extraction of aggregation statistics. • We performed an empirical study using real-life and synthetic data sets, further verifying the theoretical and algorithmic claims on effectiveness and efficiency. 5.4.2 Overview of phase 2 answering Sample 400 points Bootstrap sampling 50 Boostrap samples each of size 400 use median stability score 6.6442 Identify high coverage intervals ({intervals}, length=33.65%, coverage=90.11%) Figure 5.7: An example distribution and outputs of answer estimates This section describes the set of statistics that we extract as well as how to find them. Figure 5.2 shows the workflow of the operations (marked with Roman numerals) and desired outcomes (in the output box). As mentioned in the introduction, the output of the answer distribution statistics consists of 3 parts: 1. the point estimate of mean and variance to the distribution with confidence intervals; 2. a high coverage intervals structure tells the range of values that viable answers have high probability to be in; and 3. the stability score for the aggregate query. We describe how to find these in detail in Sections 5.4.3 – 5.4.5 respectively. The remainder of this section provides a system overview and some examples. Our system begins by sampling the viable answers in steps I and II. Step III estimates the density function of the viable answers using the samples obtained in 133 step II. Depending on the multimodal-ness of the probability function, we either (for single-mode case) return point statistics (such as mean, variance) with confidence information or (for the multi-mode case) return high coverage intervals in addition to the point statistics (e.g., the shaded areas in Figure 5.7). Computing the high coverage interval also depends on two additional operations over the probability density estimates: mode seeking (step V) and coverage interval optimization (step VII). Finally in step VI, the stability analysis computes the stability score for the aggregate query. Figure 5.7 shows an example of the outputs that users may expect from processing an aggregation. In this example, 400 sample points (a.k.a. viable answers) are taken from the answer distribution, and 50 bootstrap sample sets, each of size 400, are taken to help obtain the 90% and 85% confidence intervals for mean and standard deviation (stddev). The right hand side of Figure 5.7 illustrates how the most significant value intervals (shown by shading) are identified with the greedy algorithm we describe in detail in Section 5.4.4. For the particular distribution in this example, eventually we output 3 intervals of length 33.65% of the value range that covers about 90% of the estimated probability distribution. Finally, a stability score 6.6442 is computed for the query. 5.4.3 Sampling and kernel density estimation To ensure efficiency and the quality of statistical inference, our sampling procedure should satisfy two requirements. First, a sampling scheme needs to ensure that every data source has an equal chance at inclusion and is selected independently to contribute to the aggregation. Second, the sampling scheme should work on the hierarchical aggregation structure so that the sampling workload is distributed to the data sources in the hierarchy. Note that the viable answers are not precomputed; therefore, the process of taking a viable answer sample is to (1) decide (for each sample point) the assignments of aggregate components to data sources (through a random walk) and (2) compute the aggregation answers using the chosen assignments. We designed a sampling scheme called “uniS” that satisfies both requirements. uniS sampling ensures the uniformity over data sources; it works as follows. As- 134 sume that each node has a set of components. UniS uniformly chooses one data source S and uses all the components in S for the aggregation. This is repeated until the chosen data sources (L) cover all the components in the aggregation. In our implementation, we use a bitmap (M) to record the assignment of aggregation components and test if a cover has been reached. The bitmap is initialized as a set, M0 , containing all the components in the aggregation and a list of chosen data sources L0 = 0. / Then a random data source is chosen and (M0 , L0 ) is sent to it. Suppose at the i-th step, data source S j is chosen and it receives (Mi−1 , Li−1 ). Let A j = Mi−1 ∩ S j If A j = 0, / then S j updates Mi = Mi−1 − A j and Li = Li−1 ∪ { j}. If Mi = 0, / S j then pick another data source to send the updated bitmap to, otherwise it sends a message to all the data sources in list L to start the computation of a viable answer using the assignments achieved in this sampling process. For data source S j in the list L, the partial aggregate is computed with components in A j . Figure 5.8 illustrates the process of uniS sampling for two different paths: (A, B, E), and (B, A, E). Assume that the goal is to find values for the set {s1, s2, s3, s4, s5}. For path (A, B, E), the algorithm begins at node A and uses all of the components in A; thus, M1 contains the remaining two components (s4 and s5). When UniS next samples B, it only uses B’s value for s4, since it has already sampled a value for s1. Finally, all values are sampled once uniS samples s5 at E. Following the path (B, A, E) yields a different viable answers since uniS takes both s1 and s4’s values from B since B is visited first. It is easy to verify that uniS satisfies our sampling requirements. Another advantage of uniS sampling is that because we always greedily assign to a chosen data source as many components as possible, uniS often very quickly assigns all the components in the aggregation to data sources; thus, a viable answer is sampled. Drawing a sample from the hierarchy involves processing the aggregate query with the selected source/component assignment. Although hierarchical aggregation helps to distribute the computational load of each aggregation, it is still considered as a costly operation; thus it is desirable to minimize the aggregations performed. To this end, we apply bootstrap sampling on the sampled viable answers to improve the confidence for statistical estimates such as mean and variance. Be135 {B, A, E} {A, B, E} A D B C UniS Sampling Node Components A {s1,s2,s3} B {s1, s4} E {s1, s5} F E Node Path (A,B,E) M1:{s4, s5} A L1:{A} AA:{s1,s2,s3} Node Path (B,A,E) M1:{s2,s3,s5} B L1:{B} AB:{s1, s4} B M2:{s5} L2:{A, B} AB:{s4} A M2:{s5} L2:{B, A} AA:{s2, s3} E M3:{} L3:{A, B, E} AE:{s5} E M3:{} L3:{B, A, E} AE:{s5} Figure 5.8: UniS sampling results for the selection paths of {A, B, E} and {B, A, E} respectively. Mi shows which components still need to be covered, Li shows which sources have been used so far, and Ai shows which components are chosen from the given node/data source. The arrows show the different selection paths for Mi and Li . cause bootstrapping usually outputs tighter confidence intervals, aiming for a fixed confidence (e.g., 95%) and desired confidence interval length, using bootstrapping requires a smaller sample set than not using bootstrapping. In our solution, we report confidence intervals for both mean and variance for whatever level of confidence the user requires. Although statistics such as mean and variance are generally useful in describing distributions, their usage is limited when the shape of the distribution deviates from well-known shapes (e.g., the bell shape of Gaussian distributions). Differently shaped distributions can share the same mean and variance but deliver very different information to the query answers. For example, the two distributions shown in Figure 5.9 have the same mean (5.0) and variance (5.0). It is easy to observe that the mean value is useful for the one on the left and is less informative for the one with two modes. 136 Density 0.1 0.05 Density 0 −5 0.2 0 5 10 15 20 0 5 10 15 20 0.15 0.1 0.05 0 −5 Figure 5.9: Two distributions with the same mean (5.0) and variance (5.0), but having very different shapes. To better understand how the query answer is distributed, we need to estimate its probability density function. We used kernel density estimation (KDE — Section 5.1.3) for this task. Two methods are used in addition to the standard KDE (Section 5.1.3). The adaptive method in [16] automatically chooses bandwidth and bagging [17, 18] method to combine KDE and bootstrapping. The above two methods help to obtain a density estimation that is both smooth and stable, which is required by later operations of extracting high coverage intervals and computing stability scores. 5.4.4 High coverage intervals and optimization The density function obtained from the bootstrap samples allows us to perform more intensive analysis on the shape of the distribution, which is impossible with only point estimates. The shape of the density function delivers a lot of information for aggregation answers. In particular, for a distribution with density function, f , defined on the range, R, of viable answers, we return a set of “high coverage intervals” defined as follows: Definition 5.4.1. [high coverage intervals] 137 A high coverage interval is a triple (I, L,C) where I = {Ii , ci } is a set of disjoint intervals within R and ci = Ii f (x)dx is the coverage of Ii ; L = ∑i |Ii | |R| is the fraction of the intervals’ total length to the viable answer range; and C = ∑i ci is the total coverage. The intervals, I, in Definition 5.4.1 can be used by databases that support numerical values with uncertainty (e.g., [89]). In such databases, an attribute is represented as a set of (value, probability) pairs V = {(v, pr)} where v is a possible value (range) of the attribute and pr is the probability for the attribute to take v. We can use V = (Ii , Cci ) if v is allowed to be a range and use the center of Ii if the uncertainty model allows only scalar values for v. In this way, the high coverage interval can be used directly as the aggregate answer for uncertain databases. A user defines a desired coverage θ (e.g., 90%) and our task is to find a high coverage intervals such that the total coverage C ≥ θ and L is minimized. We define it as the “coverage interval optimization problem (CIO).” Formally: Definition 5.4.2. [coverage interval optimization problem] Given a density function f for a distribution defined on a finite range, a coverage threshold 0 ≤ θ ≤ 1 and a constant k. The coverage interval optimization problem finds up to k intervals I1 ..Ik to k minimize ∑ |Ii | i=1 k sub ject to ∑ f (x)dx ≥ θ i=1 Ii For distributions with a single mode (e.g., a Gaussian), the optimal solution is the classical 100% ∗ θ confidence interval around the mode (it does not necessarily have to be symmetric w.r.t. the mode depending on the skewness). The coverage intervals are more useful for distributions that are multi-modal, i.e., where the locations of high coverage intervals are more informative. The following observation shows that to solve CIO, it is often a good idea to find intervals around the modes in the density function of the answer distribution. 138 Theorem 5.4.1 (CIO mode containment property). If the probability density function, f , of a distribution (1) has t modes and (2) is 2nd order differentiable everywhere on its range, and the optimal solution for CIO has k ≤ t intervals, then the largest k modes are contained by the (k) intervals in the optimal solution. Proof. If an optimal solution S for CIO has k intervals, but the i-th largest mode (xi , f (xi )), i ≤ k is not in S, then there must exist an interval s which f (x) < f (xi ) for all the points x ∈ s; therefore we can construct a new interval around xi and improve the previous result. Inspired by the mode containment property, we developed a greedy algorithm to search for CIO solution. The algorithm is illustrated in Algorithm 5. The algorithm works on the probability density function we obtained from kernel density estimation (step III in Figure 5.2) and greedily extracts intervals to solve CIO. The inputs to the greedy algorithm are the PDF f , the desired coverage θ and t modes2 of f . During the process, the algorithm greedily picks up new intervals around the modes (lines 5–7) or extends previously picked intervals (lines 9–11) for the highest t − 1 modes. For the last mode, the algorithm sets the interval to cover the average amount of the additional coverage needed (lines 17,18). The algorithm returns the obtained high coverage intervals if the desired coverage is met or it has finished searching all the t modes. Because (1) the returned intervals may not reach the desired coverage, and (2) optimally choosing the next interval to extend coverage requires picking the +/− I j that has the minimal | f (xt )| (i.e., the largest incremental on coverage), the greedy algorithm is an approximation to CIO. Obtaining an optimal solution requires knowing the first derivative of the density function at every point, which carries a substantial computational overhead. Therefore, we decided to use the above greedy algorithm. Another reason for not pursuing the full optimal solution is that the density function itself is an estimation thus has a built-in error. Our empirical study suggests that the greedy algorithm gives a pretty good approximation and it runs quickly. While CIO is useful in many situations, the dual of CIO is desired when we 2 Because f is one dimensional, the modes are easily computed numerically. We omit details on mode seeking. 139 Algorithm 5: Greedy algorithm for CIO input : Estimated density function f over range R input : Desired coverage θ input : A set M = {(xi , mi )} containing t modes of f output: a high coverage intervals structure (I, L,C) 1 begin 2 C ← 0.0; i ← 1; array s; Ω ← 0; / 3 Sort M by mi in descending order; 4 while C < θ , i ≤ (t − 1) do 5 xi− ← largest x s.t. x < xi , f (x) = mi+1 ; 6 xi+ ← smallest x s.t. x > xi , f (x) = mi+1 ; 7 s[i] ← (xi− , xi+ ); Ω ← Ω ∪ s[i]; C ← Ω f (x)dx; j ← 1; 8 while C ≤ θ do 9 x−j ← largest x s.t. x < x j , f (x) = mi+1 ; 10 x+j ← smallest x s.t. x > x j , f (x) = mi+1 ; 11 S[ j] ← (x−j , x+j ); Ω ← Ω ∪ s[ j]; C ← Ω f (x)dx; 12 j ← j + 1; 13 end 14 i ← i + 1; 15 end 16 if C ≤ θ then x+ 17 Find (xt− , xt+ ) s.t. xt− < xt < xt+ , x−t f (x)dx = 1t (θ −C); t 18 19 20 21 22 23 24 25 26 s[t] ← (xt− , xt+ ); Ω ← Ω ∪ s[t]; C ← Ω f (x)dx; end Let ω1 ..ωk be disjoint intervals s.t. k1 ωi = Ω; foreach ωi do ci = ωi f (x)dx; end I ← {(ωi , ci )}; L ← ∑k1 |ωi |/|R|; return (I, L,C); end are constrained to a pre-determined interval length and is asked to return the best possible coverage. Formally: Definition 5.4.3. [dual problem of CIO] The dual problem of CIO is to optimize the selection of intervals to maximize 140 the coverage. The optimization is to k maximize ∑ f (x)dx i=1 Ii k sub ject to ∑ |Ii | = L i=1 , where L is a parameter for the optimization constraint specified by the user. The greedy algorithm can be easily modified for the dual problem of CIO. We only need to modify the termination criteria to check if the total length of the current set of intervals exceeds length, L, and return the size of covered region. Figure 5.7 shows how the high coverage intervals are discovered for a distribution whose density function has 3 modes: the greedy algorithm starts from the highest mode and extends the coverage to lower modes until the coverage of the currently discovered intervals meets the coverage requirement. In the example, eventually the 3 intervals are reported together with the total coverage of 90.11%. The returned intervals carry important information about the aggregation answers. For a fixed coverage θ (e.g., 0.9) and a single mode distribution, if the returned interval length is small, it means that different combinations of sources result in similar aggregation answers. It often indicates that the user can be quite confident of the returned answer. We still need to be careful to correctly interpret the result. A small interval length does not necessarily mean that all sources are holding the same value for the same component. It can be the case that the values of some components dominate others. For example, some components are significantly larger than others in a sum aggregation. 5.4.5 The stability score for query answers The point statistics and coverage intervals (outputs 1 and 2 in Figure 5.2) provide static information on how viable answers are distributed. However, most SemISs (e.g., PDMSs) are dynamic: data sources may freely leave the system. One natural question is how to keep aggregate answer statistics up-to-date given that the departure of data sources will affect the aggregate answer distribution. To answer this question, we define a new property: the stability of the aggregate query. 141 Stability can be quantified by measuring the difference between the distribution of the answers with and without the removal of a small number of data sources. One method is to use the distance measures that we introduced in Section 5.1.4. Let M be such a measure and P, Q be the viable answer distributions3 without and with some data sources removed respectively; the scalar distance M(P, Q) tells how much difference is caused by the removal of those data sources. The remaining difficulty is that removing different sources results in different M(P, Q) value and we need a measure of “change on average”. We can model the source removal as a stochastic process and use the expectation of M(P, Q), E[M(P, Q)] as the stability measure for the answer distribution, which we define as the stability score of an aggregation. Definition 5.4.4. [stability score of an aggregation] Let P be a viable answer distribution w.r.t. a set of data sources D, M be a distribution difference measure, and Qs be the viable answer distribution after randomly removing |s| sources from D. The stability score of the aggregation is defined as . StabM (x) = − log(E[M(P, Q)]) = − log( M(P, Qs ) f (s)ds) (5.4.1) s∈S where E stands for expectation, S is the set of all possible choices of removing |s| data sources, and f (s) is the probability of choosing (a particular) s. We use the negative logarithmic over the expected distribution difference as the stability score so that a higher stability score indicates that the viable answer distribution is expected to change less against data source changes, i.e., it is more stable. When no additional knowledge says which source is more likely to leave, we apply an “equal chance” assumption on data source removals thus f (s) is a constant |D| f (s) = 1/(|s| ) given T the number of data sources to remove. Another assumption needed for stability analysis is that the number of data sources to be removed is small i.e., |s| |D|. This assumption is valid because if a large number of data sources are updated, then the system will re-evaluate all the queries regardless of 3 We use P, Q (capitalized) for distributions and p, q (lower-cased) for density functions. 142 the stability scores. The stability analysis is useful when the system monitors aggregation answers from multiple queries. When a small number of sources are leaving the system or the data on some sources is updated with new values, the stability scores tell the system which query being monitored will first need an update. Next we describe how to compute a stability score. The output of the kernel density estimation (Section 5.4.3) is used as the answer distribution with all the data sources. The density function p(x) (for distribution P in Equation 5.4.4) is expressed as p(x) = 1 n x − xi K( ). ∑ nh i=1 h where xi are a set of n sample points and K, h are the kernel function and bandwidth parameter respectively. To estimate the density function of the viable answer when a set s of data sources are removed, let Ts be the set of sample points(xi s) that computing an xi in Ts require at least one data source in the removed set s. To estimate the new distribution, we need to exclude the points in Ts , which gives the new distribution qs (x) = x − xi 1 K( ) (n − |Ts |)h x∑ h ∈T / i s and we can use the following form, qs (x) = n 1 x − xi p(x) − ). K( ∑ n − |Ts | (n − |Ts |)h xi ∈Ts h where the first term with p(x) is a constant given a query and the difference is only on the second term, which changes with different choices of Ts . One can simulate different choices of Ts and use the empirical expectation 1 ˆ E[M(P, Q)] = M(p(x), qs (x)) ˆ ∑ˆ |S| s∈S to compute the stability score, where Sˆ is the set of simulated source removals. For a simulation that removes up to |s| sources from a total of |D| data sources, |D| the space S of possible source removals can be as large as (|s| ). For simulations 143 ˆ that produce good E[M(P, Q)], the size of Sˆ need to be in the size of O(|S|) (e.g, ˆ = ε|S| for a constant ε ≤ 1). Therefore, the simulation is computationally hard |S| when |D| is large. We find that under the equal chance assumption for source removals, analytical stability score can be obtained for probability difference measure L2 therefore no simulation is needed. The result is stated in Theorem 5.4.2. Theorem 5.4.2 (L2 stability score). Given a set of n samples of viable answers x1 to xn , the stability score (Stab(x)) for L22 is 1 c 1 2 √ ∗ StabL2 (x) = − log( (1 − Ψ)) 2 n(n − 1) 2nh π 1 − c where h is the selected bandwidth in KDE (using Gaussian kernel) and the change ratio c is estimated by c = 1 − (1 − t |s| ) |D| in which t is the average number of sources needed for an answer and Ψ is called the “mutual impact factor” 2 /4h2 Ψ = ∑ e−(xi −x j ) i, j Proof. To proof the results in Theorem 5.4.2, first, observe the following properties for Gaussian distributions. Let f1 (x) ∼ N(µ1 , σ 2 ), f2 (x) ∼ N(µ2 , σ 2 ) (same variance as we needed), then 2 2) 1 2 − (x−µ1 )2 +(x−µ 2σ 2 e f1 (x) f2 (x) = √ 2πσ 2 1 µ1 + µ2 σ 2 − (µ1 −µ22 )2 = √ , )e 2σ N( 2 2 2 πσ 2 144 2 As a special case, let f (x) = − (x−µ) √ 1 e 2σ 2 2 2πσ ∼ N(µ, σ 2 ), we have 2 1 1 − (x−µ) [√ e 2σ 2 ] f 2 (x) = √ 2 πσ 2 2πσ 2 σ2 1 N(µ, ) = √ 2 2 πσ 2 With the above preparation, we can compute the integral of the square of the density estimation function. Recall in KDE using Gaussian kernel, the density x−xi n 1 nh ∑1 K( h ) f 2 (x)dx function is estimated as: f (x) = square of the density, let α = α = = We now expand the integral on the n 1 x − xi 2 [ K( )] dx ∑ 2 (nh) 1 h 1 (nh)2 2 + (nh)2 (5.4.2) n x − xi )dx h 1 x−xj x − xi ∑ K( h )K( h )dx i, j ∑ K2( (5.4.3) (5.4.4) The kernel function has a Gaussian form: K( (x−xi )2 x − xi 1 ) = √ e− 2h2 h 2π , so h1 K(x) ∼ N(xi , h2 ). Switching the integral and summation in Equation (5.4.3) and: (5.4.4), (5.4.3) is simplified to 1√ , 2nh π and (5.4.4) can be written as 1 √ 2 n h π ∑ e− (xi −x j )2 4h2 i, j . We can see that the integral (α) is a function over all the data points. Now recall the stability analysis using the square L2 distance. Let f (x), fS (x) be the original and augmented density. We want to compute E[ ( fS (x) − f (x))2 dx]. Since E( fS (x)) = f (x), this is the integral of the pointwise variance of the aug- 145 mented function. It thus equals: DL2 = E[ 2 fS2 (x)dx] − f 2 (x)dx . Using the above results on Gaussian square, and letting 2 /4h2 • Ψ = ∑n1 e−(xi −x j ) • fS = 1 h(n−Ts ) 2 /4h2 s −(xi −x j ) e ∑n−T 1 • E[Ts ] = c ∗ n for some constant c we have DL 2 = 2 1 c 2 √ ∗ (1 − Ψ) n(n − 1) 2nh π 1 − c and thus the stability score formula in Theorem 5.4.2 follows. We can see that the distance is related to the viable answer distribution P, and the average fraction affected answers when some sources are removed. Also it is easy to verify that when all the data points coincide, the distance is 0, which means most stable (the corresponding stability score is ∞). Now we try to find out when s sources are removed from a total of S data sources, how many answers are likely to be affected. This number relates to the number of data sources are required for an answer. Suppose on average we need t (also called the weight) out of S data sources for a viable answer, then the fraction c= (tS )−(tS−s ) (tS ) is an estimate for the fraction that get affected. We acknowledge that not all t combinations are answers to the query but this is still a good estimate when information of sources’ individual coverage is not available. Moreover, the weight itself includes some of this information. Let p be the average source coverage for components, then larger p is, the smaller t will be. Another way is to estimate the expected number of samples that becomes invalid when s sources are randomly removed (so all the sampled viable answers depending on the removed sources become invalid). This can be done by simulation with the sample set or using c = 1 − (1 − St )s which assumes that data sources uniformly contribute to aggregate answers. This completes the proof of Theorem 5.4.2. 146 The L2 measure is applied to the viable answer distribution P and by applying a measure over P2 , we can assess the 2nd moment stability for a viable answer distribution. I.e., the L2 measure gives the expected change of viable answers, and a 2nd moment stability gives the variance of such changes. We chose the Bhattacharyya distance as the distribution difference measure for the stability score because under this measure the 2nd moment stability can be analytically derived without the need of source removal simulations. Corollary 5.4.3. The stability score for BD (Bhattacharyya distance) over the square of the viable answer distributions is StabBD (x) = − log( 1 1 √ + 2 √ Ψ) 2nh π n h π where n, h and Ψ are as defined in Theorem 5.4.2 Proving the corollary requires the same technique for Theorem 5.4.2. Since the expectation of the density qs is just p, the order of expectation and integral must be changed accordingly; the result follows. The stability score measures the likelihood of changes to query answers along with data source availability and updates. The users and the system can use the stability score to decide if a query needs to be re-evaluated, especially in a scenario where multiple continuous queries are managed, the stability score helps to determine the answer updating priority. 5.5 5.5.1 Empirical Study Empirical study for phase 1 answering We now present results from an empirical evaluation of the proposed query optimization for phase 1. The IGJB algorithm for cover-finding and the sequential greedy algorithm for partitioning are implemented. Our implementation was done in C++. We implemented the IGJB local search algorithm for the cover-finding step. For partitioning we apply a sequential greedy algorithm. The MCI was implemented with equi-depth histograms. We ran our experiments on an Intel Core 147 2 Quad 9400 CPU (2.66GHz) with 8GB of memory running SUSE Linux. The program is single threaded, so only one core was used during evaluation. Source selection was simulated in a program which connected the source selection framework with query rewriting and a relational database to complete the query processing. A MySQL database on the same machine provided schemas, data instances, and process relational queries. While porting the framework to a grid-based, distributed evaluation platform is in progress, the source selection operation does not make use of parallel computing, thus the observations made here also applies to what will be in a Grid environment. The experiments use the real life JIIRP data set described in the introduction. The data set has 31 cell instances in S1.Cell. Cells are defined over 364 buildings in schema S2.Buildings. The data for S2.Assessment is in 16 assessment tables following the same schema. Because monetary loss assessment is performed with multiple criteria (e.g., the year, the material (concrete, wooden) and the functionality of buildings), buildings in the 16 assessment tables overlap. The data source S2.building is used to perform source selection and each of the 16 assessment tables acts as a data source, cooperating to answer queries. Because the current JIIRP data set is small (all query processing finished within 1 second), it is hard to test scalability. Thus we generated 3 synthetic data sets to further test source selection performance. The synthetic data sets scale up the JIIRP data set; they use the same schema as the JIIRP data set and process the same query for generated, scaled up cells. The synthetic data sets, D1, D2 and D3, have cells defined over 600, 800 and 1000 buildings respectively. We generated 600 data sources in schema S2.Assessment for each synthetic data set. For data sources in D1, the number of buildings assessed in S2.Assessment is uniformly distributed in range [20, 40], matching the mean number of buildings in the JIIRP data set; the distribution is uniform(10, 20) for D2 and uniform(30, 50) for D3. We also tested the source selection performance with different number of data sources. The data sets and parameters used are summarized in Table 5.1 and Table 5.2 respectively. Throughout these experiments we process sum aggregations over buildings distributed over data sources; this is the scenario described in the introduction. 148 Dataset JIIRP D1 D2 D3 # of buildings 364 1000 1000 1000 # of sources 16 600 600 600 source size distribution max:66, min:12, avg: 31 uniform(20,40) uniform(10,20) uniform(30,50) Table 5.1: The 4 data sets used in empirical study Parameter N S D c (A, T, dt, tsize) θ K Default value 800 400 D1 min-source (1k, 1k, 40, 50) 0.05 5 Meaning aggregation size # sources data set name cost function for IGJB IGJB parameters dup. ratio threshold source covers to output Table 5.2: Parameters in experiments Results of aggregate queries Figure 5.10(a) and Figure 5.10(b) show the aggregation results for querying 2 cells consisting of 800 and 1000 buildings respectively. For each cell, the figure shows 500 aggregation results from 100 query processing runs where all 600 data sources are allowed in source selection. We also show the range of valid aggregation answers [inf(V ), sup(V )] calculated by taking the max (min) value of one building’s assessment across all the data sources and summing them. 4 4 1.8 x 10 Aggregation Value Aggregation Value 1.6 1.4 1.2 1 0.8 0 200 400 600 800 x 10 1.6 1.4 1.2 0 1000 runs (a) N=800 200 400 600 runs 800 (b) N=1000 Figure 5.10: Aggregation over 800 and 1000 buildings 149 1000 All the aggregation results obtained are in the valid interval. This includes aggregations computed (with linear adjustment) from covers containing duplication (θ > 0.95). Most of the results are in the middle of the valid range as expected. The results also suggest that we can relax the threshold for partitioning as the computed results are quite far from the upper and lower limits. The graphs in Figure 5.10suggest correlation between answers. This is because some data sources are frequently selected by the IGJB optimization process. Cover quality : # of sources and duplications We measured the quality of the cover finding process and report the results for the min-source cost function and the min-duplication cost function in Figure 5.11 and Figure 5.12 respectively. 80 100 IGJB (min−source) Greedy Random 80 average number of selected sources average number of selected sources 100 60 60 40 40 20 0 IGJB (min−source) Greedy Random 20 N=600 N=800 0 N=1000 (a) 400 sources N=600 N=800 N=1000 (b) 600 sources Figure 5.11: Number of sources needed in cover-finding using the minsource cost function Figure 5.11 shows the average number of sources in the output of the coverfinding step. We compare the number of sources selected from the initial greedy heuristic to the final optimized result after performing the IGJB local search algorithm and to the number of sources needed if a “random” selection of sources is performed. The random selection picks sources randomly until all required components are covered. The greedy and IGJB optimizations greatly reduce the number of sources required in query processing. The greedy heuristic works well in most cases and the IGJB local search adds a further 15% gain on average. Figure 5.12 shows the results for the min-duplication cost function, which minimizes duplication among the selected data sources. We used the same settings as 150 3000 4000 IGJB(minoverlap) Greedy Random total duplications total duplications 4000 2000 1000 0 N=600 IGJB(min−overlap) Greedy Random 3000 2000 1000 0 N=800 N=1000 (a) 400 sources N=600 N=800 N=1000 (b) 600 sources Figure 5.12: Total duplication observed in cover-finding using minduplication cost function in Figure 5.11, but this time we measured duplication among the selected data sources. Both the greedy and IGJB optimizations help choose data sources with significantly smaller duplications than a random selection; IGJB optimization further improves the greedy results with local search. For the min-duplication cost function, the search improvement is between 20% − 25%, a bit higher than observed with the min-source cost function. Figure 5.13(a) and Figure 5.13(b) show the cover finding quality with the 3 data sets that differ in the number of buildings in each assessment data source using the min-source cost function and the min-duplication cost function respectively. 4000 80 IGJB(min−source) Greedy Random total duplications average number of selected sources 100 60 40 20 0 D1 D2 3000 IGJB(min−duplication) Greedy Random 2000 1000 0 D3 (a) c =min-source D1 D2 D3 (b) c =min-duplication Figure 5.13: Cover-finding performance Figure 5.13 shows that our observations on data set D1 in Figure 5.11 and Figure 5.12 apply to the other data sets. Since the data sources in D2 contain fewer records, the same query requires more sources. Similarly, fewer sources 151 are selected from D3 to cover components in a query. Figure 5.13(b) shows that because a cover needs more sources for D2, it is harder for the IGJB optimization to improve the greedy results. Partition quality We use the number of adjusted data sources and the overall message overhead to measure the performance of the partitioning step. Figure 5.14 shows the number of adjusted sources when min-source and min-duplication measures are used in the cover-finding step. Again, we compare with the case when random selection is applied in the cover-finding step. On each box in Figure 5.14, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers. The population of each statistics is 200. 0.65 #adjusted sources / # sources selected #adjusted sources / # sources selected 0.65 0.6 0.55 0.5 0.45 0.4 N=600 N=800 N=1000 0.6 0.55 0.5 0.45 0.4 (a) c=min-source N=600 N=800 N=1000 (b) c=min-duplication Figure 5.14: Partition: ratio of adjusted sources For both min-source and min-duplication, the number of data sources that need to be adjusted in order to fully eliminate duplicates is about half the number of the sources selected from the cover-finding step. Comparing Figure 5.14(a) and Figure 5.14(b) shows that although about half of the selected data sources need to be 152 adjusted, this ratio is smaller when the cover-finding step uses the min-duplication cost function to minimize the duplications among data sources. While the minsource cost function discovers covers using fewer sources, the absolute number of adjusted data sources are similar for the two cost functions. Query processing speed Our experiments show that the IGJB local search algorithm used to find source covers dominates the source selection time. The greedy partitioning algorithm takes less than 1/10 of a second to finish. Figure 5.15 shows the time used by the cover finding step. We can see that the time required by cover-finding is linear in the number of searching steps in the simulated annealing process of IGJB in all the compared cases, and the algorithm finishes 5000 searching steps within 1 minute. The two solid-red lines show that IGJB performs faster with the min-source cost function than with min-duplication cost. This is caused by the extra time needed to compute duplication every time a new cover is formed. The 3 blue-dotted lines shows that it takes longer to cover a bigger set and the 3 cyan-dashed lines shows that it takes more time for larger candidate sources. 80 time(second) 60 40 min−source min−duplication N=1000 N=800 N=600 S=600 S=500 S=300 20 0 0 1000 2000 3000 4000 searching steps 5000 6000 Figure 5.15: Time used in cover-finding Our results also show that increasing the search time has only marginal improvement on the results. This is observed both for the min-source measure and the min-duplication measure on experiments over all the three data sources. In 153 most runs, the best answer is discovered within 1 minute. Our experiments empirically verify the feasibility and scalability of the proposed source selection framework. Although we have not purposely selected and optimized the solvers (IGJB for cover-finding and greedy for partitioning), they already demonstrate strong support for the source selection process in answering aggregation queries. While we acknowledge that the simulation platform and the use of synthetic data sets may introduce some distortion, the results we obtained from them encourage us to perform further research. 5.5.2 Empirical study for phase 2 answering We empirically tested the extraction of aggregate statistics using synthetic and Canadian climate data [25]. The real-life climate data results show the applicability of our techniques in real applications, while the synthetic tests allowed us to scale various parameters to verify the observations and predictions made in the analysis. Table 5.3 summarizes the data sets that we used. Table 5.4 summarizes the parameters used for the experiments. Data D1 D2 Notes Single Gaussian A mixture of four Gaussians D3 A mixture of Gaussians, Cauchy and Gamma (Real-life data set) Monthly climate data 2006 C Parameters µ ∈ [10, 20], σ = 2.0 µ ∈ [10, 20], [25, 35], [40, 50], [55, 65], σ = 0.5, weight = 12 : 5 : 2:1 µ ∈ [10, 20], σ = 1, ∞, 1 1672 stations, 104 measuring districts Table 5.3: Data set details 154 Notation S N n 1−α V B Meaning #data sources aggregation size sample size confidence interval bootstrap size bootstrap sample size Default value 100 500 400 90% 50 =n Table 5.4: Parameters in empirical study Sampling and bootstrap improvements We first present the benefit of the bootstrap method in deriving tight confidence intervals for point statistics. Table 5.5 compares the confidence interval reported by the bootstrap process and the required sample size for direct inference. In Table 5.5, the length of confidence interval (CI) returned from direct inference is used as the baseline for each individual sampling and we report the maximal, average and minimal improvements using a improvement ratio defined as r= n 200 200 400 400 CI length f rom direct in f erence CI length by bootstrapping 1−α 0.8 0.9 0.8 0.9 max r 4.248 3.309 2.896 2.293 average r 2.556 2.119 2.001 1.655 min r 1.010 1.012 1.063 1.010 Table 5.5: Bootstrap improves confidence interval We can see that by using bootstrap sampling (with BCa ) with sample sizes of 200 and 400, the confidence intervals returned by bootstrapping is half (ratio=2) of that guaranteed by direct inference method; it is 3 to 4 times tighter when the sample size is small (200 in the case), we recorded up to 7 times tighter confidence intervals with samples of size 100. Table 5.6 shows the savings on the required sample size in order to reach the 155 confidence interval achieved by using the bootstrap method. Similar to the improvement ratio for Table 5.5, the tighter confidence intervals are translated to the savings on the required size of samples if the same CI length that bootstrapping reports needs to be achieved with direct inference. Hence the saving ratio is defined as s= sample size that direct in f erence needs sample size that bootstrapping uses n 200 200 400 400 1−α 0.8 0.9 0.8 0.9 max s 18.1 10.96 8.39 5.26 average s 7.36 4.84 4.28 2.82 min s 1.02 1.02 1.13 1.02 Table 5.6: Savings on required sample size We can see from Table 5.6 that the average savings on the sample size is about a factor of 4, which is pretty significant. We also observed that the numbers roughly agree with the numbers we have for confidence interval length in Table 5.5, where we showed that for k greater than one, achieving a k times tighter confidence interval requires observing k2 as many samples. From the result we can see that the confidence interval reported with bootstrap sampling is much tighter than using direct inferencing especially when the theoretical upper-bound of variance is large. High coverage intervals We present the results of high coverage intervals detected and the stability scores for the 4 “sum” aggregations S1 to S4 where S1 and S2 sum climate data in C and S3 and S4 sum up values generated in D3 (mixture of 3 distributions). First, we compare the 4 density estimates shown in Figure 5.16. We see that aggregations on climate data sets show multimodal distributions for viable answers and when components in the aggregation are differently distributed, there can be many modes in the resulting viable answer distribution (8 in the case shown in the rightmost sub-figure). We can see that by returning intervals in “dense” areas, we use a small set of 156 −3 2 −3 x 10 2 2 intv. cover 85.72% area, length= 1791.83 (22.723944%) 1.5 1 0.5 0.5 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 2 intv. cover 85.44% area, length= 1910.71 (24.749817%) 1.5 1 0 0.6 x 10 0 0.6 1.5 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 4 x 10 (a) S1 (Climate Data) (b) S2 (Climate Data) −3 −3 2 x 10 2 9 intv. cover 73.80% area, length= 3769.69 (37.637296%) 1.5 1 0.5 0.5 0.8 1 1.2 1.4 1.6 x 10 10 intv. cover 92.12% area, length= 5288.17 (55.528435%) 1.5 1 0 0.6 1.5 4 x 10 1.8 0 0.6 0.8 1 1.2 1.4 1.6 1.8 4 4 x 10 x 10 (c) S3 Mixture of 3 dist. (d) S4 Mixture of 3 dist. Figure 5.16: Multi-mode distributions and high coverage intervals intervals (under 25% for S1 , S2 with 2 modes, 37% for S3 with 7 modes) to cover the majority of the distribution. For the figures shown, the mean values of all 4 distributions are in the central flat area; expanding confidence intervals centering at the distribution’s mean will result in very wide confidence intervals. We can see that for the multi-modal case, the intervals we reported are much tighter than other conventional choices. Although the greedy algorithm does not guarantee the optimal solution that has the smallest possible interval, the output of the greedy algorithm gives a good approximation. Table 5.7 reports the approximation ratio of the greedy algorithm for the 4 distributions. The optimal answer is constructed by slicing the range of the density function uniformly into 4096 pieces and then taking the sum of the top t slices that cover the desired probability measure. Note that although this method is more likely to return intervals that are “tighter” than our greedy method, it does not guarantee the continuity of the returned intervals; thus we used the greedy method in our solution. The approximation is better if the “approx.” value is closer to 1.0 (it will always be larger or equal to 1.0). We can 157 fig a b c d Greedy 0.2272 0.2475 0.3764 0.5552 Optimal 0.2272 0.2475 0.2724 0.5150 Cover 85.72% 85.44% 73.82% 92.12% approx. 1.0 1.0 1.38 1.08 Table 5.7: Approximation ratio of the greedy CIO algorithm see that the coverage intervals in S3 have some room to improve; as shown in the table, S3 ’s approximate ratio is 1.38, which is larger than the other 3 distributions. Stability We designed a simulation process to test the effectiveness and sensitivity of the L2 stability score. The simulation worked as follows: we deleted one data source from the 100 sources and drew samples from viable answers that are computable from the remaining 99 sources (e.g., for climate data set, it is 1 out of 104 reporting districts). We recorded the means of the viable answers computed from 99 data sources (103 sources for climate data set). The deviation maps in Figure 5.17 show the changes on the sample means when different data sources are disabled. For each circular graph, the center is the mean (µ) of the valid distribution when no data source is removed; the points represent the means (µs ) of the viable answer distribution when different data sources are removed. The distance d between the data points and the center is defined as d = |µs −µ| µ . Comparing the four deviation maps in Figure 5.17, we can see that answer distributions that have a higher stability score are more “stable”, as reflected by the valid distribution means when a data source is randomly removed. Note that the L2 -stability does not directly assess the change of distribution means. For example, Figure 5.9 suggests that two distributions with large differences may have the same mean. We observed this as a consistent trend in our empirical study. While we can confirm that queries with higher L2 stability scores are more stable, we are not yet able to answer questions like “Will the mean of viable answers shift for more than 10% for a query with score 6.3?” Our suggested use of the stability score is that when multiple queries are to be re-evaluated, the system chooses to update those with lower stability scores. We plan to work on deeper evaluation of similarity 158 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 (a) S1 L2 score=6.5882 (b) S2 L2 score=6.4139 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 (c) S3 L2 score=6.4217 (d) S4 L2 score=6.3204 Figure 5.17: Deviations of empirical means (of sample size 400) in simulations taking out 1 data source: the four deviation figures correspond to the four distributions in Figure 5.16. Numbers indicate the relative changes (0.02 = 2%) from the distribution mean when no data source is removed. 159 scores as future work. Processing overhead of operations We recorded the time spent on the key operations in the workflow. We ran the experiments on a PC with 2.5GHz Intel Core 2 duo CPU, running Matlab version 7.6.0 (R2008a). The algorithms tested are implemented with Matlab. The time reported in Figure 5.18 does not include any networking overhead. The results presented in Figure 5.18 give an idea on the processing overhead of bootstrap re-sampling, performing kernel density estimation (KDE) and searching for high coverage intervals. The time needed for computing the stability scores for both the L2 and BD measures is very short (short enough that running it 200 times completed in less than a millisecond) and can be ignored. From Figure 5.18 Processing time breakdown (repeat 50 times) boostrap sampling Samplee size 800 greedy CIO algorithm 5.2 400 4.04 4.68 200 100 KDE 4.04 4.5 3.7 4.04 4.04 time (seconds) Figure 5.18: Time breakdown of operations, showing the total time needed for repeating each operation 50 times. we can see that KDE dominates the processing overhead for extracting statistics after a set of viable answers are sampled. Recall that in order to use the ensemble method, KDE must be performed on all the bootstrap sample sets; in our case KDE takes about 5 seconds on 50 sample sets. The greedy CIO algorithm is performed using a density function containing 4096 points; thus its running time remains constant when the sample size changes. The bootstrapping time increases with the sample size but takes less than 3/50 = 60ms per run. We estimate the time needed for computing one viable answer to be 200ms; this is pretty optimistic 160 since sampling over a distributed hierarchy usually takes up to several seconds when the networking overhead is considered. Therefore, the operation of taking samples from viable answers will dominate the overall time needed for sampling and extracting statistics (e.g., 80 seconds in sampling and 5 seconds to extract statistics.) This suggests that our technique runs very fast and further optimizations should focus on more efficient aggregate computation. 161 6 CONCLUSION In this thesis, I reported and summarized our work on supporting domain heterogeneous data sources in a SEMantic Integration System (SemIS). We chose the PDMS architecture and investigated two important problems related to integrating relational databases. We proposed efficient and scalable algorithms to optimize the topology of a PDMS semantic overlay network (SON) so that domain heterogeneity does not hurt the query answering effectiveness of a SemIS. Our theoretical analysis and empirical study showed that the acquaintance selection (AcqSel) operation is very necessary for SemISs. While the acquaintance selection (AcqSel) work helps domain heterogeneous data sources to co-exist and collaborate without negatively affecting query answering, the other work reported in this thesis on decomposition aggregation query (DAQ) processing explores new semantic integration opportunities for domain heterogeneous data sources. This new type of query bridges the gap between data objects of different granularity by automatically translating non-aggregate queries on one type of objects into aggregate queries over objects of finer granularity, and most often from other domains. We developed the aggregation rewriting algorithm, which uses a novel 3-role structure and makes use of new meta-information modeled in the decomposition mappings and aggregate bindings. Analysis of the aggregation rewriting algorithm shows that the algorithm supports a wide class of queries and is generally efficient. The semantics of the decomposition aggregation query (DAQ) are also identified as we continue to investigate the query optimiza162 tion process. We proposed a two-phase query answering scheme and optimize the query processing to satisfy different aspects of query answering. Our optimizations for the phase 1 answering guarantees the retrieval of correct answers and at the same time reduces the query processing cost; and the compact DAQ distribution estimating algorithms return high quality statistics that helps the users as well as the SemIS to better interpret the query answers. The aggregation rewriting techniques together with the two phase query answering and optimization techniques form a complete DAQ query processing framework that is practically feasible for implementation. The research on data integration has witnessed more than 20 years of development. While many fundamental problems have received intensive study and many important results are obtained, the world still has not seen a wide adoption of data integration systems and a majority of relational databases are still being manually integrated or not integrated at all. On the other hand, we see the rapid growth of the use of relational databases from large enterprise data warehousing to embedded systems for handholds and other mobile devices. This sharp contradiction suggests the hardness of semantic integration but also lead us to think about the short plank of current research: how can semantic integration models and theories be effectively applied to real applications? My research tries to explore a way that allows generic, feasible and scalable system implementation to evolve from a solution to a specific research problem (the integration of JIIRP data). Therefore, the work presented in this thesis, instead of continuing on the previous trend of investigating query rewriting techniques for more profound query semantics, provides query optimization solutions and techniques that ensures system scalability. Additionally, we worked hard to make our solutions compatible with existing frameworks so that our techniques of supporting DAQ and efficient aggregate query processing can be added to architectures that are designed to work well with standard SPJ queries. Although as part of the research outcome we can only implement a prototype that verifies the central techniques, we hope our work that handles the full query processing flow could accelerate the development of a usable semantic integration system. My study to integrate domain heterogeneous data sources is just a beginning in discovering new challenges in this direction. For short term future research, I 163 wish to investigate components that are required for a system to support DAQs but are not yet discussed in our solutions, which includes the (semi)automatic creation of decomposition mappings and aggregate bindings; dynamic updating DAQ query answers upon source/data updates and cross-aggregate query optimization for continuous query support. For a long term research plan, I believe there is a lot more new semantics to explore for domain heterogeneity. A SemIS should support not only the traditional query types in relational mode but also new types of queries with new semantics and query answering requirements defined exclusively for the semantic integration environment. The DAQ we studied in this thesis already demonstrated the need of special treatment (e.g., the two phase query answering). Also, similar to how we have used a 3-role structure for DAQs, the topology optimization (acquaintance selection) may need support on optimizing certain topology patterns. While each of the above future direction is attractive, I believe the current work has established a good foundation for future expansion. Specifically, the flexible mapping effectiveness measure and acquaintance selection criteria allows the reuse of the existing AcqSel framework for new measures and criteria, and the modulated DAQ processing framework allows individual upgrading of query optimization and query rewriting techniques. 164 Bibliography [1] S. Abiteboul and O. Duschka. Complexity of answering queries using materialized views. In Symposium on Principles of Database Systems (PODS), pages 254–263, 1998. → pages 98 [2] F. N. Afrati and R. Chirkova. Selecting and using views to compute aggregate queries (extended abstract). In ICDT, pages 383–397, 2005. → pages 130 [3] F. N. Afrati and P. G. Kolaitis. Answering aggregate queries in data exchange. In PODS, 2008. → pages 81, 82, 113, 115, 130 [4] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. J. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR(2), 2005. → pages 30 [5] Amazon Public Datasets. Amazon public datasets. http://aws.amazon.com/publicdatasets/. → pages 1 [6] M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. Spinrad. Scalar aggregation in inconsistent databases. Theoretical Computer Science, 296:405–434, March 2003. ISSN 0304-3975. → pages 115 [7] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, and J. Mylopoulos. The hyperion project: From data integration to data coordination. SIGMOD Record, 32(3):53–58, 2003. → pages 1, 12, 18, 82, 130 [8] E. Balas and M. C. Carrera. A Dynamic Subgradient -Based Branch-and-Bound Procedure for Set Covering. Oper. Research, 44(6), 1996. doi:10.1287/opre.44.6.875. → pages 114, 127 [9] C. Baquero, P. S. Almeida, and R. Menezes. Fast estimation of aggregates in unstructured networks. In on Autonomic and Autonomous Systems, 2009. → pages 82 165 [10] F. Barry Williams. Libraries of data models. http://http://www.databaseanswers.org/data models/. → pages 63 [11] S. Basu Roy, S. Amer-Yahia, A. Chawla, G. Das, and C. Yu. Constructing and exploring composite items. In SIGMOD, pages 843–854, 2010. ISBN 978-1-4503-0032-2. doi:http://doi.acm.org/10.1145/1807167.1807258. → pages 77 [12] J. Beasley and P. Chu. A genetic algorithm for the set covering problem. EJOR, 94(2), 1996. ISSN 0377-2217. doi:DOI:10.1016/0377-2217(95)00159-X. → pages 127 [13] P. Beraldi and A. Ruszczynski. The Probabilistic Set-Covering Problem. Oper. Research, 50(6), 2002. doi:10.1287/opre.50.6.956.345. → pages 114 [14] A. Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society, 35:99C110, 1943. → pages 113 [15] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. → pages 28, 29, 113 [16] Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. Annual of Statistics, 38(5), 2010. → pages 112, 137 [17] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. ISSN 0885-6125. → pages 137 [18] L. Breiman and L. Breiman. Bagging predictors. In Machine Learning, pages 123–140, 1996. → pages 111, 137 [19] M. Castro, P. Druschel, Y. C. Hu, and A. Rowstron. Proximity neighbor selection in tree based structured peer-to-peer overlays. Technical report, Microsoft, 2003. → pages 29 [20] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Record, 26:65–74, March 1997. ISSN 0163-5808. → pages 130 [21] V. Cholvi, P. Felber, and E. W. Biersack:. Efficient search in unstructured peer-to-peer networks. In SPAA, 2004. → pages 29 [22] B.-G. Chun, B. Y. Zhao, and J. Kubiatowicz:. Impact of neighbor selection on performance and resilience of structured p2p networks. In IPTPS, 2005. → pages 29 166 [23] C.-W. Chung. Dataplex: an access to heterogeneous distributed databases. Communications of the ACM, 33:70–80, January 1990. ISSN 0001-0782. → pages 130 [24] V. Chvatal. A greedy heuristic for the set-covering problem. MOR, 4(3): 233–235, 1979. doi:10.2307/3689577. → pages 114 [25] Climate Canada. Canada climate data. http://climate.weatheroffice.gc.ca/climateData/canada e.html, 2010. → pages 154 [26] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using views. In PODS, pages 155–166, 1999. → pages 130 [27] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using views. In Symposium on Principles of Database Systems (PODS), pages 155–166, 1999. → pages 124 [28] S. Cohen, W. Nutt, and Y. Sagiv. Rewriting queries with arbitrary aggregation functions using views. TODS, 31(2):672–715, 2006. → pages 19, 20, 81 [29] Considine, Hadjieleftheriou, Li, Byers, and Kollios. Robust approximate aggregation in sensor data management systems. TODS, 34(1), 2009. ISSN 0362-5915. doi:http://doi.acm.org/10.1145/1508857.1508863. → pages 91 [30] J. Considine, F. Li, G. Kollios, and J. Byers. Approximate aggregation techniques for sensor databases. In ICDE, pages 449–460, 2004. → pages 82 [31] C. Cramer and T. Fuhrmann:. Proximity neighbor selection for a dht in wireless multi-hop networks. In Peer-to-Peer Computing, 2005. → pages 29 [32] M. Cygan, L. Kowalik, and M. Wykurz. Exponential-time approximation of weighted set cover. Information Processing Letters, 109(16), 2009. ISSN 0020-0190. doi:DOI:10.1016/j.ipl.2009.05.003. → pages 114 [33] E. Davis. Inference in datalog. http://cs.nyu.edu/faculty/davise/ai/datalog.html. → pages 98 [34] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51:107–113, January 2008. ISSN 0001-0782. → pages 110 167 [35] A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588–599, 2004. ISBN 0-12-088469-0. → pages 114 [36] A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, 26(1):83–94, 2005. → pages 24 [37] A. Doan and R. McCann. Building data integration systems: A mass collaboration approach. In IIWeb, pages 183–188, 2003. → pages 130 [38] P. S. Dodds, R. Muhamad, and D. J. Watts. An experimental study of search in global social networks. Science, 301:827–829, 2003. → pages 63 [39] X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In VLDB ’07, pages 687–698. VLDB Endowment, 2007. ISBN 978-1-59593-649-3. → pages 33 [40] X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. In PVLDB, 2010. → pages 64 [41] S. Dorogovtsev, J. Mendes, and A. Samukhin. Metric structure of random networks. Nucl. Phys. B, 653, 2003. → pages 58 [42] S. Dubnov, R. El-Yaniv, Y. Gdalyahu, E. Schneidman, N. Tishby, and G. Yona. A new nonparametric pairwise clustering algorithm based on iterative estimation of distance profiles. Machine Learning, 47(1):35–61, 2002. → pages 30 [43] G. Elidan, M. Ninio, N. Friedman, and D. Shuurmans. Data perturbation for escaping local maxima in learning. In AAAI/IAAI, pages 132–139, 2002. → pages 53 [44] R. Fagin, P. G. Kolatis, L. Popa, and W. C. Tan. Composing schema mappings: Second-order dependencies to the rescue. In PODS, pages 83–94, 2004. → pages 26, 37 [45] M. J. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4): 27–33, 2005. → pages 1 [46] Freebase Datasets. Freebase datasets. http://download.freebase.com/datadumps/. → pages 1 168 [47] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. SIGMOD Rec., 25(2), 1996. ISSN 0163-5808. → pages 114 [48] G. Gottlob and A. Nash. Data exchange: computing cores in polynomial time. In PODS, pages 40–49, 2006. → pages 82 [49] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Fixed-precision estimation of join selectivity. In PODS, pages 190–201, 1993. → pages 114, 122 [50] P. Haase and R. Siebes. Peer selection in peer-to-peer networks with semantic topologies. In ICSNW’04, pages 108–125, 2004. → pages 29 [51] A. Halevy. Answering queries using views: A survey. VLDB Journal, 10 (4):270–294, 2001. → pages 1, 20, 98 [52] A. Halevy, A. Rajaraman, and J. Ordille. Data integration: The teenage years. In VLDB, 2006. → pages 1 [53] A. Y. Halevy, Z. G. Ives, D. Suciu, and I. Tatarinov. Piazza: Data management infrastructure for semantic web applications. In ICDE, pages 505–516, 2003. → pages 1, 11, 82, 130 [54] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics). Springer, 2nd edition, 2009. → pages 112 [55] H. Hoos and T. Sttzle. Stochastic Local Search: Foundations & App. Morgan Kaufmann, 2004. ISBN 1558608729. → pages 125 [56] Y. Ioannidis. The history of histograms (abridged). In VLDB, 2003. → pages 114, 123 [57] Z. Ives, T. Green, G. Karvounarakis, N. Taylor, V. Tannen, P. Talukdar, M. Jacob, and F. Pereira. The orchestra collaborative data sharing system. SIGMOD Record, 37(3):26–32, 2008. → pages 12, 82 [58] H. J.A., M. J.R., J. J., and S. K.D. Dynamic islanding of critical infrastructures, a suitable strategy to survive and mitigate critical events. In CNIP, Rome, 2006. → pages 2 [59] L. W. Jacobs and M. J. Brusco. Note: A local-search heuristic for large set-covering problems. Naval Research Logistics, 42:1129–1140, 1995. → pages 114, 126, 127 169 [60] H. V. Jagadish, V. Poosala, N. Koudas, K. Sevcik, S. Muthukrishnan, and T. Suel. Optimal histograms with quality guarantees. In VLDB, 1998. → pages 122 [61] A. Jagota and L. A. Sanchis. Adaptive, restart, randomized greedy heuristics for maximum clique. Journal of Heuristics, 7(6):565–585, 2001. ISSN 1381-1231. doi:http://dx.doi.org/10.1023/A:1011925109392. → pages 114 [62] N. Jain, D. Kit, P. Mahajan, P. Yalagandula, M. Dahlin, and Y. Zhang. Star: Self-tuning aggregation for scalable monitoring. In IN VLDB, 2007. → pages 114 [63] T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, pages 346–355, 2007. → pages 115 [64] JIIRP. Public safety canada joint infrastructure interdependencies research program. http://www.ece.ubc.ca/∼jiirp/. → pages 2, 105 [65] M. J.R., H. J.A., V. C., and J. J. Design for survival real-time infrastructures coordination. In CNIP, Rome, 2006. → pages 2 [66] K. Katayama, M. Sadamatsu, and H. Narihisa. Iterated k-opt local search for the maximum clique problem. In EvoCOP, pages 84–95, 2007. ISBN 978-3-540-71614-3. → pages 114 [67] A. Kementsietsidis, M. Arenas, and R. Miller. Mapping data in peer-to-peer systems: Semantics and algorithmic issues. In SIGMOD, 2003. → pages 80, 85 [68] I. A. Klampanos and J. M. Jose. An architecture for information retrieval over semi-collaborating peer-to-peer networks. In SAC, 2004. → pages 29 [69] P. G. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, pages 61–75, 2005. → pages 18 [70] A. Kosowski, M. Malafiejski, and P. Zylinski. On bounded load routings for modeling k-regular connection topologies. In ISAAC, pages 614–623, 2005. → pages 63 [71] E. Levina and P. J. Bickel. The Earth Mover’s Distance is the Mallows distance: Some insights from statistics. In International Conference on Computer Vision, pages 251–256, 2001. → pages 113 170 [72] W.-S. Li and C. Clifton. SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. DKE, 33(1):49 – 84, 2000. ISSN 0169-023X. doi:DOI:10.1016/S0169-023X(99)00044-0. → pages 87 [73] L. Libkin. Data exchange and incomplete information. In PODS, pages 60–69, 2006. → pages 18, 82 [74] L. Libkin and C. Sirangelo. Data exchange and schema mappings in open and closed worlds. In PODS, 2008. → pages 82 [75] L. Liu. Prototyping and cells modelling of the infrastructures interdependencies simulator i2sim. Master’s thesis, The University of Bristish Columbia, 2007. → pages 2 [76] Z. Liu, K. C. Sia, and J. Cho. Cost-efficient processing of min/max queries over distributed sensors with uncertainty. In ACM SAC, 2005. → pages 82 [77] A. Loser, F. Naumann, W. Siberski, W. Nejdl, and U. Thaden. Semantic overlay clusters within super-peer networks. In DBISP2P, pages 33–47, 2003. → pages 29 [78] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. Tag: a tiny aggregation service for ad-hoc sensor networks. In IN OSDI, 2002. → pages 114 [79] S. Madden, R. Szewczyk, M. J. Franklin, and D. Culler. Supporting aggregate queries over ad-hoc wireless sensor networks. In Workshop on Mobile Computing and Systems Applications, pages 49–58, 2002. → pages 82, 110, 114 [80] J. Madhavan and A. Y. Halevy. Composing mappings among data sources. In VLDB 2003, 2003. → pages 18, 26, 35 [81] H. A. Mahmoud and A. Aboulnaga. Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In SIGMOD ’10: Proceedings of the 2010 international conference on Management of data, pages 411–422, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0032-2. doi:http://doi.acm.org/10.1145/1807167.1807213. → pages 64 [82] F. Mandreoli, R. Martoglia, S. Sassatelli, P. Tiberio, and W. Penzo. Using semantic mappings for query routing in a pdms environment. In SEBD, 2006. → pages 29 171 [83] A. Manjhi. Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In In SIGMOD, pages 287–298, 2005. → pages 82 [84] C. Martel and V. Nguyen. Analyzing kleinberg’s (and other) small-world models. In PODC ’04, pages 179–188, New York, NY, USA, 2004. ACM. ISBN 1-58113-802-4. doi:http://doi.acm.org/10.1145/1011767.1011794. → pages 63 [85] J. Marti, J. Hollman, C. Ventura, and J. Jatskevich. Dynamic recovery of critical infrastructures: Real-time temporal coordination. International Journal of Critical Infrastructures, 4(1/2), 2008. → pages 2 [86] J. Marti, C. Ventura, J. Hollman, K. Srivastava, and H. Jurez. I2sim modelling and simulation framework for scenario development, training, and real-time decision support of multiple interdependent critical infrastructures during large emergencies. In NATO (OTAN) MSG-060 Symposium on ”How is Modelling and Simulation Meeting the Defence Challenges out to 2015?”, 2008. → pages 2 [87] R. B. Messaoud, O. Boussaid, and S. Rabas´eda. A new olap aggregation based on the ahc technique. In DOLAP, pages 65–72, New York, NY, USA, 2004. ACM. ISBN 1-58113-977-2. → pages 130 [88] T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In A. Gupta, O. Shmueli, and J. Widom, editors, VLDB, pages 122–133. Morgan Kaufmann, 1998. ISBN 1-55860-566-5. → pages 130 [89] R. Murthy and J. Widom. Making aggregation work in uncertain and probabilistic databases. TKDE, 2010. → pages 115, 138 [90] C. H. Ng, K. C. Sia, and C.-H. Chan. Advanced peer clustering and firework query model in the peer-to-peer network. In WWW(Poster), 2003. → pages 29 [91] H. Nottelmann and N. Fuhr. Evaluating different methods of estimating retrieval quality for resource selection. In SIGIR, pages 290–297, 2003. ISBN 1-58113-646-3. → pages 113 [92] C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous queries over distributed data streams, 2003. → pages 114 [93] B. C. Ooi, Y. Shu, and K.-L. Tan. Db-enabled peers for managing distributed data. In APWeb, pages 10–21, 2003. → pages 130 172 ¨ [94] P. R. J. Osterg˚ ad. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, 120(1-3):197 – 207, 2002. ISSN 0166-218X. doi:DOI:10.1016/S0166-218X(01)00290-6. → pages 114 [95] P. M. Pardalos and J. Xue. The maximum clique problem. Journal of Global Optimization, 4, 1993. → pages 129 [96] V. T. Paschos. A survey of approximately optimal solutions to some covering and packing problems. ACM Comput. Surv., 29(2):171–209, 1997. ISSN 0360-0300. doi:http://doi.acm.org/10.1145/254180.254190. → pages 113, 114 [97] M. Pavan and M. Pelillo. Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29 (1):167–172, 2007. → pages 30 [98] W. Penzo, S. Lodi, F. Mandreoli, R. Martoglia, and S. Sassatelli. Semantic peer, here are the neighbors you want! In EDBT ’08: Proceedings of the 11th international conference on Extending database technology, pages 26–37, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-926-5. doi:http://doi.acm.org/10.1145/1353343.1353351. → pages 30 [99] V. Poosala, V. Ganti, and Y. Ioannidis. Approximate query answering using histograms. IEEE Data Engineering Bulletin, 22:5–14, 1999. → pages 122 [100] R. A. Pottinger and A. Y. Halevy. Minicon: A scalable algorithm for answering queries using views. VLDB Journal, 10(2-3):182–198, 2001. → pages 22, 98 [101] P. Raftopoulou and E. Petrakis. icluster: A self-organizing overlay network for p2p information retrieval. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, and R. White, editors, Advances in Information Retrieval, volume 4956 of Lecture Notes in Computer Science, pages 65–76. Springer Berlin / Heidelberg, 2008. URL http://dx.doi.org/10.1007/978-3-540-78646-7. 10.1007/978-3-540-78646-7. → pages 30 [102] P. Raftopoulou and E. G. Petrakis. A measure for cluster cohesion in semantic overlay networks. In LSDS-IR ’08: Proceeding of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval, pages 59–66, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-254-2. doi:http://doi.acm.org/10.1145/1458469.1458480. → pages 30 173 [103] P. Raftopoulou, E. Petrakis, and C. Tryfonopoulos. Rewiring strategies for semantic overlay networks. Distributed and Parallel Databases, 26: 181–205, 2009. ISSN 0926-8782. URL http://dx.doi.org/10.1007/s10619-009-7046-7. 10.1007/s10619-009-7046-7. → pages 30 [104] M. K. Ramanathan, V. Kalogeraki, and J. Pruyne. Finding good peers in peer-to-peer networks. In IPDPS, 2002. → pages 29 [105] J.-C. R´egin. Solving the maximum clique problem with constraint programming. In CPAIOR, 2003. → pages 114 [106] A. Robles-Kelly and E. R. Hancock. Pairwise clustering with matrix factorisation and the em algorithm. In Conference on Computer Vision, pages 63–77, London, UK, 2002. Springer-Verlag. ISBN 3-540-43744-4. → pages 30 [107] E. Rundensteiner and L. Bic. Evaluating aggregates in possibilistic relational databases. Data Knowl. Eng, 7:239–267, 1992. → pages 115 [108] A. D. Sarma, X. Dong, and A. Y. Halevy. Bootstrapping pay-as-you-go data integration systems. In SIGMOD, pages 861–874, 2008. → pages 13 [109] N. Shental, A. Zomet, T. Hertz, and Y. Weiss. Pairwise clustering and graphical models. In Neural Information Processing Systems, 2003. → pages 30 [110] L. Si and Callan. Relevant document dist. estimation method for resource selection. In SIGIR, 2003. → pages 113 [111] L. Sidirourgos, G. Kokkinidis, T. Dalamagas, V. Christophides, and T. Sellis. Indexing views to route queries in a pdms. Distributed and Parallel Databases, 23:45–68, 2008. ISSN 0926-8782. URL http://dx.doi.org/10.1007/s10619-007-7021-0. 10.1007/s10619-007-7021-0. → pages 29 [112] H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3): 425–440, 1955. → pages 63 [113] K. Smith, C. Bonaceto, C. Wolf, B. Yost, M. Morse, P. Mork, and D. Burdick. Exploring schema similarity at multiple resolutions. In SIGMOD ’10: Proceedings of the 2010 international conference on Management of data, pages 1179–1182, New York, NY, USA, 2010. ACM. 174 ISBN 978-1-4503-0032-2. doi:http://doi.acm.org/10.1145/1807167.1807313. → pages 64 [114] K. Sripanidkulchai, B. M. Maggs, and H. Zhang:. Efficient content location using interest-based locality in peer-to-peer systems. In INFOCOM 2003, 2003. → pages 29 [115] M. Steyvers and J. B. Tenenbaum. The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science: A Multidisciplinary Journal, 29(1):41–78, 2005. → pages 63 [116] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, 11(1):17–32, 2003. → pages 29 [117] I. Tatarinov and A. Y. Halevy. Efficient query reformulation in peer data management systems. In SIGMOD, pages 539–550, 2004. → pages 26 [118] N. E. Taylor and Z. G. Ives. Reliable storage and querying for collaborative data sharing systems. In ICDE, pages 40–51, 2010. → pages 82 [119] C. Tempich, A. Loser, and J. Heizmann:. Community based ranking in peer-to-peer networks. In ODBASE, 2005. → pages 29 [120] D. Tsoumakos and N. Roussopoulos:. Agno: An adaptive group communication scheme for unstructured p2p networks. In Euro-Par, 2005. → pages 29 [121] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley Publishing, 1977. ISBN 0201076160. → pages 72 [122] J. D. Ullman. Principles of Database and Knowledge - Base Systems, volume II, chapter 12, pages 760–766. Computer Science Press, 1988. → pages 21, 94, 98 [123] C. Ventura, K. Thibert, and H. Juarez. Seismic assessment. In JIIRP Industry Symposium, Vancouver, 2007. → pages 3 [124] X. Wang and S. Ju. A set-covering-based approach for overlapping resource selection in distributed information retrieval. WCCSIE, 4, 2009. doi:http://doi.ieeecomputersociety.org/10.1109/CSIE.2009.702. → pages 113 175 [125] Westgrid. Western canada research grid. http://westgrid.ca. → pages 105 [126] Z. Y. Xi Tong, Dalu Zhang. Efficient content location based on interest-cluster in peer-to-peer system. In ICEBE, 2005. → pages 29 [127] J. Xu and R. Pottinger. Optimizing acquaintance selection in a pdms. International Journal Of Cooperative Information Systems, 20(1):39–81, 2011. → pages 24 176
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Supporting domain heterogeneous data sources for semantic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Supporting domain heterogeneous data sources for semantic integration Xu, Jian 2011
pdf
Page Metadata
Item Metadata
Title | Supporting domain heterogeneous data sources for semantic integration |
Creator |
Xu, Jian |
Publisher | University of British Columbia |
Date Issued | 2011 |
Description | A SEMantic Integration System (SemIS) allows a query over one database to be answered using the knowledge managed in multiple databases in the system. It does so by translating a query across the collaborative databases in which data is autonomously managed in heterogeneous schemas. In this thesis, we investigate the challenges that arise in enabling domain heterogeneous (DH) databases to collaborate in a SemIS. In such a setting, distributed databases modeled as independent data sources are pairwise mapped to form the semantic overlay network (SON) of the SemIS. We study two problems we believe are foremost to allow a SemIS to integrate DH data sources. The first problem tackled in this thesis is to efficiently organize data sources so that query answering is efficient despite the increased level of source heterogeneity. This problem is modeled as an “Acquaintance Selection” problem and our solution helps data sources to choose appropriate acquaintances to create schema mappings with and therefore allows a SemIS to have a single-layered and flexible SON. The second problem tackled in this thesis is to allow aggregate queries to be translated across domain heterogeneous (DH) data sources where objects are usually represented and managed at different granularity. We focus our study on relational databases and propose novel techniques that allow a (non-aggregate) query to be answered by aggregations over objects at a finer granularity. The new query answering framework, named “decomposition aggregation query (DAQ)” processing, integrates data sources holding information in different domains and different granularity. New challenges are identified and tackled in a systematic way. We studied query optimizations for DAQ to provide efficient and scalable query processing. The solutions for both problems are evaluated empirically using real-life data and synthetic data sets. The empirical studies verified our theoretical claims and showed the feasibility, applicability (for real-life applications) and scalability of the techniques and solutions. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-08-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-ShareAlike 3.0 Unported |
DOI | 10.14288/1.0052113 |
URI | http://hdl.handle.net/2429/36583 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2011-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-sa/3.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2011_fall_xu_jian.pdf [ 1.21MB ]
- Metadata
- JSON: 24-1.0052113.json
- JSON-LD: 24-1.0052113-ld.json
- RDF/XML (Pretty): 24-1.0052113-rdf.xml
- RDF/JSON: 24-1.0052113-rdf.json
- Turtle: 24-1.0052113-turtle.txt
- N-Triples: 24-1.0052113-rdf-ntriples.txt
- Original Record: 24-1.0052113-source.json
- Full Text
- 24-1.0052113-fulltext.txt
- Citation
- 24-1.0052113.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052113/manifest