Data Coordination by Michael Lawrence A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) February 2013 ➞ Michael Lawrence 2012 Abstract Data coordination is the problem of updating a contingent database C as a result of changes to a database B on which it depends. For example, a general contractor’s construction cost estimate (C) needs to be updated in response to changes made by an architect to a building design (B). Although these two databases are related in a very specific way, they contain information about fundamentally different types of objects: the cost estimate is composed of items which represent work results, and the building design is composed of physical objects. Motivated by scenarios such as design-cost coordination, we propose an approach to coordinating data between autonomous, heterogeneous sources which have a base-contingent relationship. We propose the use of declarative mappings to express exact relationships between the two. Using materialized views to maintain state, we give an overall approach for coordinating sets of updates from B to C through view differencing and view update translation. We adopt ideas from data exchange and incomplete information to generate the set of all possible updates which satisfy a mapping. We propose methods for assisting a user (C’s administrator) in choosing amongst the possible updates, and experimentally evaluate these methods, as well as the overall benefit of semiautomatic data coordination, in a usability study. We then discuss practical challenges in applying our general techniques to the domain of architecture, engineering and construction, by interviewing practitioners and analyzing data from two construction projects. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Data Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Cost Estimating in Architecture, Engineering and Construction . . . . . . . . . . . 2 1.3 Goals of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Data Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Data Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 View Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Materialize and Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Incremental View Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4.1 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4.2 Methods 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Update Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Insertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 iii 4.2.1 Chase Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.2 Constrain Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Deletions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Combining Insertions and Deletions 4.5 Experimental Evaluation 5 Update Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 Assistance Strategies 5.2 A Data Coordination User Study 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.1 Experimental Prototype 5.2.2 Experimental Instance 5.2.3 Experimental Procedure Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 A Case Study in AEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.1 AEC Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Current Practice: On-Screen Takeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 A BIM Solution: Innovaya 6.4 Mapping Types 6.5 A System for AEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7 Summary and Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2.1 Arithmetic Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2.2 Multiple Mappings 7.2.3 Multiple Base Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.2.4 Using Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Appendices A Subset of UBCO XML Data for Case Study . . . . . . . . . . . . . . . . . . . . . . 93 B Datasets for Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 C Introductory Script for Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . 100 C.1 About this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 iv C.2 Introduction: Cost Estimating at Foo Construction C.3 The Building Database ‘B’ . . . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 C.4 The Cost Estimate Database ‘C’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 C.5 Familiarization Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 C.5.1 Layer Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 C.5.2 Project Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 C.6 B-C Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 C.6.1 Mapping 1: Concrete Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 C.6.2 Mapping 2: Doors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 C.7 SuperCoordinator: A Data Coordination System . . . . . . . . . . . . . . . . . . . . 106 C.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 C.7.2 Example Coordination Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 C.8 Completing This Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 C.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 C.9.1 Building database ’B’ documentation . . . . . . . . . . . . . . . . . . . . . . 118 C.9.2 Cost estimate database ’C’ documentation . . . . . . . . . . . . . . . . . . . 119 D Usability Study Update Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 v List of Tables 2.1 Example building design B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Example cost estimate C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 The result of evaluating qB (B), which is equal to the result of evaluating qC (C). . . 8 2.4 Example building design B after applying the update from Example 2.2. . . . . . . 11 2.5 The result of evaluating qB (B ), which differs from the result of evaluating qC (C) as shown in Table 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.1 Initial view corresponding to Mapping 1 from Figure 5.8. It consists of a single aggregate value. We use the column names from query qB,1 . . . . . . . . . . . . . . . 62 5.2 Initial view corresponding to Mapping 2 from Figure 5.9. It consists of a set of door types and the number of doors of each type. We use the column names from query qB,2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Results of an independent samples t-test comparing the mean update completion time of Assistance Levels NONE and LOW for each update case. . . . . . . . . . . . 65 5.4 Results of an independent samples t-test comparing the mean update completion time of Assistance Levels LOW and HIGH for each update case. . . . . . . . . . . . 66 6.1 A snippet of the UBCO project Cost Plan #2 showing items relating to metal railings. 74 6.2 A snippet of the UBCO project Cost Plan #2 showing items relating to concrete footings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3 A snippet of the UBCO project Cost Plan #2 showing items for wood doors. . . . . 78 6.4 A snippet of the UBCO project Cost Plan #1 showing lump sum items for millwork. 79 vi List of Figures 1.1 Coordinating two data sources B and C. Over time, B is updated to B’, and C must be updated to C’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 How standards proliferate (from xkcd.com, by Randall Munroe) . . . . . . . . . . . . 5 2.1 The data coordination scenario between B and C. The top row shows instances of relational schema B at different times, while the bottom row shows instances of C at different times. The initial instance C is coordinated with the instance B. At some point B is updated to state B , so C must be updated to a state C which is consistent with B . Eventually, B is updated to become B , so C must be again updated. This may continue with additional instances over time. . . . . . . . . . . . 11 2.2 Data coordination solution overview. We first find the difference Q± B between the materialized view QB defined by qB on instance B and the materialized view QB defined by qB on instance B (1 - view differencing). We then use the result to find all possible C ± (2 - update translation), and apply one of them (3 - update completion) to C to obtain the final solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 TPC-H Query Q16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Performance of view differencing on the simple instance. . . . . . . . . . . . . . . . . 26 3.3 Performance of view differencing on the chain instance. . . . . . . . . . . . . . . . . . 27 3.4 Performance of view differencing as a function of size for TPC-H. . . . . . . . . . . . 28 3.5 Execution time for view differencing as a function of update size for TPC-H. . . . . 28 4.1 TPC-H query Q2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Performance of update translation as a function of the number of predicates (k) in qC . 49 4.3 Performance of update translation as a function of the size (N ) of C. . . . . . . . . . 50 4.4 Performance of update translation as a function of update size (n). . . . . . . . . . . 51 5.1 A screen shot of the B Tables screen of the data coordination prototype. Each table in the current instance of B is shown in a separate tab. . . . . . . . . . . . . . . . . 57 5.2 A screen shot of the Mapped Data screen of the data coordination prototype. Here, a mapping is shown side-by-side, with each side displaying the SQL query and resulting view instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 vii 5.3 A screen shot of the C Tables screen of the data coordination prototype. In this case, two tables of C are shown side-by-side. The area below is used to display the update translations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 A screen shot of the C Tables screen of the data coordination prototype. A pair of inserted tuples are shown, highlighted in green. At the bottom of the screen (I), a table shows the set of labeled nulls along with the values currently chosen. As values are entered, they are displayed in the tuples which appear in the tables (II). . . . . . 60 5.5 A screen shot of the C Tables screen of the data coordination prototype, with four deletion sets shown at the bottom of the screen. The second deletion set is selected, causing the corresponding tuples to be highlighted in the tables above. . . . . . . . . 61 5.6 The building design B schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.7 The cost estimate C schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.8 Mapping 1 between B and C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.9 Mapping 2 between B and C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.10 The time taken for subjects to complete each update case, based on assistance level. Each bar is centered on the mean, and has height of twice the standard deviation. The whiskers indicate the minimum and maximum values. . . . . . . . . . . . . . . . 65 5.11 The average time per update case spent on each screen of the prototype, for each assistance level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 A small portion of the transformed XML extracted from an Autodesk Revit model for a subset of the UBCO building design. . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Screen shot of On-Screen Takeoff conditions used to takeoff footing concrete from a structural drawing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Quantities calculated automatically from the visual conditions shown in Figure 6.2. . 75 6.4 Using Innovaya to map a managed quantity (corresponding to a particular type of interior partition wall) to a Timberline assembly. Here, the length and height attributes are mapped, as indicated by the Mapping column in the top-right section of the screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.5 An IDEF0 diagram illustrating our approach for updating cost estimates using the help of mappings. Activities A0 through A2 are performed once for each new building to be estimated, and activities B0, C0 and C1 are performed each time the building design is updated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 C.1 A floor plan of the building designed by Bar Architects . . . . . . . . . . . . . . . . 101 C.2 The schema for Bar Architects’ building database ‘B’ . . . . . . . . . . . . . . . . . . 102 C.3 The schema for Foo Construction’s cost estimate database ’C’ . . . . . . . . . . . . . 102 viii Acknowledgments I would like to thank a number of people who have helped immensely during the course of this research, both directly and indirectly. My supervisor Rachel Pottinger has provided invaluable guidance during this process. Looking back, I easily took for granted the great commitment made by Rachel in agreeing to supervise my PhD. Rachel has done more than help me focus on the important research questions, she has provided care, support, and humour during many tumultuous years of study. I thank my committee members, Sheryl Staub-French and Raymond Ng. Sheryl’s practitioner perspective was critical during the course of this work; she helped me understand the processes, challenges, and research results in Civil Engineering. Raymond helped keep my work grounded in database fundamentals, and determine the key questions from a historical database perspective. I would like to thank my master’s advisor, Andrew Rau-Chaplin at Dalhousie University, who originally encouraged me to consider a PhD at the University of British Columbia, and saw me through a successful application there. Andrew also inspired me with the area of databases and taught me invaluable research skills. The final draft of this dissertation was much improved based on feedback from my university examiners. Wang-Chiew Tan, Giuseppe Carenini, Alan Russell, and Nancy Heckman have all given excellent suggestions for improving its presentation and content. I am very grateful to them. I have had many fruitful interactions with other students, professors and postdoctoral researchers in the Data Management and Mining Lab at the University of British Columbia. Jian Xu, Doug Bateman and I had many inspired discussions about relevant papers in our field; Jian helped me understand the importance of domain heterogeneity. Solmaz Kolahi and Laks V.S. Lakshmanan provided instrumental feedback on my early material. Many other labmates took part in critiquing my talks and sharing ideas: Hongrae Lee, Min Xie, Andrew Carbonetto, April Webster, Dibesh Shakya, Amit Goyal, Ali Moosavi and Jamila Salari to name a few. The data coordination prototype described in Chapter 5 was due in large part to the programming abilities of Joyce Zhu. Joyce was a pleasure to work with and understood my vision (however vague) with ease. Thanks to Kellogg Booth and Melanie Tory for their suggestions on conducting a usability study. Thanks also to the many students who participated in pilot runs of that prototype: Melsa Smith, Alexandru Totolici, Pirooz Chubak, Arianne Dee, Simona Radu and Yun Lou; all of you provided critical suggestions and feedback. I am extremely grateful to the students and practitioners of Civil Engineering for patiently explaining features of their domain and processes. Madhav Nepal has been extremely helpful, as has Norman Peterson, Omar Sharif, and Jon Perkins. The CIRS project group as well as Stuart Olson Construction were also extremely generous in providing project data, building designs and cost estimates. After my first year of study, I was very fortunate to have an opportunity to intern as a Software ix Engineer at Google. I would like to thank Jon Nowitz for taking me in and helping break many of my poor programming habits, and teaching me how to design, write and test good software. The following year Carl Hamilton and Dale Woodford did the same. Near the end of my study, I am thankful to Peter Wood, who acted as an industry mentor, and helped me understand what the world is like outside of academia, and how to prepare for it. The administrative staff at the Department of Computer Science have been a pleasure to work with. I am grateful for the help and guidance of Holly Kwan, Lara Hall, Michele Ng, Joyce Poon, and Colleen Diamond in navigating the murky waters of university administrative procedures. My wife, Jennifer Lee deserves tremendous credit for her patience during the course of this degree. From the perspective of someone who had been in a stable career for a decade, entering a relationship with a student surely must have seemed like a step backwards. Jennifer and her parents Ted and Linda have been extremely generous and supportive, and perhaps most importantly made sure I made time for fun, family and maintained a healthy work-life balance. I owe a great amount to my parents Kerry and Victoria, and my sister Kimberley. They encouraged and supported me unconditionally to pursue interests in sports, music, and academics. My parents gave me the desire to succeed, and taught me to think for myself; they had confidence in me, and never made me feel like there were expectations to be met. Thank you. x Dedication To my soon to be born first child. May you love your pursuits, and pursue what you love. xi Chapter 1 Introduction 1.1 Data Coordination There is an increasing need to share, combine and otherwise exchange information from different sources and applications. This is due to many factors such as public open data initiatives, data journalism, data-driven decision making in business, the use of computer aided design (CAD) in industrial processes, and the increasing sharing of information online. In many cases there are dependencies between the real-world objects represented by these data sources. For example, a real estate listings database may have a boolean attribute expressing whether or not there are transit routes nearby a given property. This information is publicly available on the local transit authority’s website. Ideally, the real estate listings would be up to date with the latest transit route information, but given the size of each of these sources, coordinating them manually would be practically impossible. The problem of data coordination in general occurs when the contents of one data source depend on other, related yet independently maintained data sources. The administrator of this contingent source needs to ensure it is up to date and consistent with the latest data provided by these base sources. This is illustrated in Figure 1.1, which shows that when a base source B (e.g. the transit authority) is updated, resulting in B , that the contingent source C (e.g. the real estate database) may need to be updated as well. The problem of data coordination is similar to synchronization of database replicas as studied in distributed databases [4]. The difference is that in database replication, all sites have the same schema and data items are considered atomic. As such, questions of how the modification of a data item at one site manifests at another are largely trivial. The distinctive challenge in data coordination stems from the heterogeneous structure of and sophisticated mappings between different sources. Data coordination problems typically occur between the two sources that are domain heterogeneous in that they store data about different real world things that are related in some way. Our earlier example described coordination between information about bus routes stored in a transit authority’s database, and the “transit nearby” attribute of listings in a real estate database. These two sources are domain heterogeneous because the types of entities in one database (bus routes) are different from the types of entities in another database (real estate listings). As another example, a ferry service operator may need to coordinate its schedule information with data from a marine 1 C B coordinated ✲ coordinated B update update ✲ C Figure 1.1: Coordinating two data sources B and C. Over time, B is updated to B’, and C must be updated to C’. forecast website; or a national department of finance may need to coordinate economic indicators with data provided by local statistics gathering bodies. Recent research in sharing heterogeneous data sources primarily focuses on two problems: (1) data integration, where the goal is to provide a unified view of data sources with different schemas, and (2) data exchange, which translates data structured by one schema to that of another. The fundamental difference is that data coordination involves updating an existing data source due to changes in and relationships to an independently maintained data source. This presents new problems from those commonly found in data integration and exchange because of the need to map updates, and the effect of those updates on the global state of the coordinated instances. 1.2 Cost Estimating in Architecture, Engineering and Construction This thesis is motivated by our work with practitioners in the domain of Architecture, Engineering and Construction (AEC). The design, construction and operation of buildings is a large and collaborative task, involving people from many different organizations, with different goals and expertise. The building owners care about the functional requirements of the building and its cost; the architects produce a design to satisfy the owner’s functional requirements; engineering consultants from 2 various disciplines work with the architect to produce designs for specific building systems such as structure, electrical, mechanical; a general contractor must estimate the cost of constructing the building, is responsible for its construction, and typically hires sub-contractors to construct specific building features. The efforts of these groups are dependent on one another, and they must frequently exchange information to ensure they are adequately coordinated. Minimizing conflicts and inconsistencies is critical to ensuring a building project is completed on time and on budget, and to the owner’s satisfaction. Traditionally, coordination in AEC has been document-based, by faxing letters, drawings and specifications between groups that need to share information. Building Information Modeling (BIM) is an increasingly popular approach for “the digital representation of the physical and functional characteristics of a facility and also the process of creating, using, and maintaining such a shared knowledge resource as a tool for decision making throughout the lifecycle of a facility” [15]. Colloquially, BIM is also used as a noun referring to such a shared repository of building information. The idea is that a single BIM containing all information relevant to a building project will be used and shared between the project’s various collaborators and stakeholders. A significant academic and industrial effort has been underway to define standards for BIM and produce software with BIM capabilities. While some progress has been made in the design domains (architecture and engineering), adoption of BIM has been slow due to the diversity of stakeholders and resistance to changing traditional work processes. We have extensively studied two large building projects at the University of British Columbia (UBC) with various levels of BIM adoption. The Centre for Interactive Research on Sustainability (CIRS) project is a multidisciplinary building incorporating cutting edge ideas in sustainable design. The design process made extensive use of electronic information sharing to minimize paper use, including a shared document repository and BIM-based design. The UBC Okanagan (UBCO) Arts and Sciences Building is a 72,000 sqft, ✩21M building which includes a lecture theatre, office space, wet and dry laboratories, and animal care facilities. For the CIRS project we have obtained a copy of the shared project repository, consisting of specifications, design plans, meeting minutes, cost estimates and so on. For the UBCO project, we have obtained two detailed cost estimates (CP1 and CP2), dated at 16 weeks apart, as well as the architectural and structural drawings corresponding to CP2. Upon surveying this data and interviewing project participants, we identified that one of their major challenges is updating the building’s cost estimate corresponding to changes in the design. An accurate construction cost estimate is critical for making decisions during a building project’s design phase. In these projects and other projects on which these participants worked, the building design underwent multiple revisions before construction was started. The designs for each revision are issued without any indication of what has changed, resulting in a lot of manual work reviewing the new designs looking for changes, and determining impacts on the cost estimate. Unfortunately, BIM is not well utilized for cost estimating. Most state of the art tools support 3 the task of extracting material quantities from a building design. We believe that a contributing factor is domain heterogeneity: the building design consists of components, which are objects composing the building (e.g. cabinets, columns, doors, stairways); the cost estimate consists of items representing work results, which are defined in [22] as “permanent or temporary aspects of construction projects achieved in the production stage by subsequent alteration, maintenance, or demolition processes, through the application of a particular skill or trade to construction resources” (e.g. “build concrete forms for structural columns”). Some studies [46, 47] point to the lack of software interoperability and standards, and focus their efforts on developing such standards. Although we believe standards are an important mechanism to the exchange of information, and we support the idea of expanding BIM to include design details sufficient for cost estimating, our approach is guided by a skeptical attitude towards the success of large standards in supporting complicated tasks for a broad variety of participants, as facetiously illustrated in Figure 1.2. Venugopal and Eastman criticize standardization efforts in AEC as being overly general, resulting in substantial redundancy and ambiguity in how they are implemented [16, 57]. Kraus et al. point out that this ambiguity permits inconsistency with how BIM is used across different design firms [40]. Hartmann et al. argue that the loosely coupled structure of the construction industry limits the success of top-down technology pushing, and that a “pull” approach that aligns existing BIM tools with current work practices is more likely to be successful [31]. From examining cost estimates and speaking with cost estimators, we learned that there is little standardization to cost data, other than a rough categorization of items. Cost data is proprietary, and the approaches to cost estimating vary. As such, we need an approach which layers on top of existing work practices. 1.3 Goals of This Thesis Our main goal for this thesis is to address the problem of coordinating heterogeneous databases in a base-contingent relationship like that of the design-estimate problem. Specifically, we consider the problem from the standpoint of the administrator of a contingent database, C (e.g. the cost estimator), whose data depends on an externally managed base database, B (e.g. the building architect). Our aim is an approach which will allow the administrator of C to specify precise relationships to B, and have a system which automatically or semiautomatically determines when B is updated and how C should be updated in response. We aim to be able to represent complex relationships such as aggregation without imposing too high of a burden on the administrator of C. Our approach is aimed towards data coordination in situations like the cost estimating domain, and characterized by the following features: ❼ There is base-contingent relationship between B and C in that B affects C but not vice-versa. ❼ Both B and C are autonomous and may be updated independently at any time. This is a characteristic of sources which are maintained by independent organizations (e.g. the archi4 Figure 1.2: How standards proliferate (from xkcd.com, by Randall Munroe) tects/engineers and general contractor). ❼ The relationships between the two sources are complex, involving aggregations, etc (one is not simply a differently structured version of the other), and may call for exact matches. ❼ The sources in general may describe different but related real world objects (e.g. the building architect thinks in terms of physical building components, while the contractor thinks in terms of materials, procedures, and equipment), and hence semantic integration may not be possible. ❼ The parties involved are unwilling to use a centralized system that disrupts their existing work practices. ❼ The coordination is implemented entirely by C (e.g. the contractor); we do not assume that B (e.g. the building architect) actively participates in the process other than providing access to its data. This is relevant for coordination in two types of scenarios: where B is ambivalent about C’s goal of coordination; and where B’s data is managed by some proprietary system. In the latter case, B’s data may be available by export to a relational-like format, and we cannot rely on the normal faculties possessed by a relational database management system. This is the case in building design/cost estimate coordination, and also when B is data publicly available on the web. 5 ❼ The process needs to be repeated and refined as new mappings are discovered over time. 1.4 Thesis Outline Chapter 2 gives an overview of our approach to data coordination. We propose a mapping formalism, bidirectional dependencies (bdds) which are a variant of source to target tuple generating dependencies (s-t tgds) used in data exchange. These are declarative mappings which permit the relationships between B and C to be expressed compactly. Using bdds, we give a formal statement of the data coordination problem, and describe an approach which breaks it into two stages: view differencing, and view update translation. In Chapter 3 we propose an adaptation of incremental view maintenance to the view differencing problem. We perform an experimental evaluation which compares its benefits to a brute force symmetric difference approach under various circumstances. Chapter 4 addresses the update translation problem. This problem is related to data exchange, but poses additional challenges since our mappings are exact (as opposed to subsets), the basecontingent relationship between B and C, and that we require a human to resolve ambiguities while respecting these exact mappings. We translate inserted and deleted tuples separately. The result of update translation is a set of incomplete updates. Section 4.2 describes our approach to translating insertions, where ambiguity arises in the form of labeled nulls, as in past work on incomplete information. We propose novel structures and an optimized algorithm for deriving a formula on these labeled nulls to respect the exact mapping constraint. Section 4.3 describes our approach to translating deletions. We describe our approach for combining the insertion and deletion translations in Section 4.4, and perform an experimental evaluation in Section 4.5. Chapter 5 explores the problem of what a human user does with the incomplete update translations, and how they apply them to C. We describe a data coordination prototype and a user study which aims to determine the benefits (if any) to semiautomatic data coordination, and under what conditions. We also implement and experiment with some additional techniques to assist the user in disambiguating the incomplete update translations. Chapter 6 revisits the cost estimating problem in the AEC domain. We survey state of the art methods and recent research on using BIM for cost estimating. We describe the benefits of our general approach for data coordination as described in Chapters 2 through 5, and how to implement it in the context of real-world cost estimating projects. In Chapter 7 we give conclusions and directions for future work. 6 Chapter 2 Data Coordination 2.1 Motivating Example Throughout this Chapter we use the following example to illustrate our version of data coordination. Example 2.1 A Building Designer’s database B consists of relations Component and Material. An instance of B at time t is shown in Table 2.1. A Cost Estimator’s database C consists of relations ProjectItems and ItemRates; the Cost Estimator has created an instance C in Table 2.2 for the building design B. The Cost Estimator has also created a declarative mapping constraint using the Standard Query Language (SQL); this mapping is qB = qC , where qB is: SELECT Material.name, Component.area AS qty FROM Component JOIN Material ON Material.cid = Component.id; and qC is: SELECT ItemRates.description AS name, ProjectItems.qty FROM ProjectItems JOIN ItemRates ON ProjectItems.code = ItemRates.code; expressing the relationship between the walls in the building design and the cost items pertaining to walls in the cost estimate. The data of Tables 2.1 and 2.2 satisfy this mapping because evaluating qB on B yields the same result as evaluating qC on C (Table 2.3). (a) Component id 0 1 2 type Column Wall Wall length 1 10 4 height 1 3 3 (b) Material area 1 27 12 cid 1 1 2 name light concrete drywall heavy concrete thickness 300 15 200 Table 2.1: Example building design B 7 (a) ProjectItems code 12 14 02 qty 27 12 27 (b) ItemRates code 12 14 02 03 desc light concrete heavy concrete drywall 190mm studs rate 25.00 35.00 3.50 12.00 Table 2.2: Example cost estimate C name light concrete heavy concrete drywall qty 27 12 27 Table 2.3: The result of evaluating qB (B), which is equal to the result of evaluating qC (C). 2.2 Data Coordination We begin with some necessary definitions, which are then used to define the data coordination problem addressed by our work. A database schema A consists of a sequence of relation symbols, A1 . . . Ak , each having a sequence of attribute names. A database instance A over A is a sequence of relations of the same length and arities as A. We will often use relation symbols and relation instances interchangeably, and distinguish different instances of the same schema using superscripts such as A . A tuple u in Ai maps each attribute name in Ai to a value in some domain const of constants. We use both the algebraic notation u ∈ Ai , and the logic notation Ai (u[X]), where Ai is a predicate symbol, X is the sequence of Ai ’s attribute names and u[X] is the sequence of values which results from applying u to each element of X. Similarly, we use u[X, Y ] for the sequence of values which results from applying u to the concatenation of X and Y . We consider conjunctive queries without arithmetic comparisons. A query q over A is a statement of the form: Q(X) :− ψ1 (X1 , Y1 ), . . . ψk (Xk , Yk ) where each predicate ψi refers to some relation in A. Each of the Xi , Yi are lists of variables, where the concatenation Xi , Yi has the same length as the number of attributes in the relation to which ψi refers. The X variables are called the distinguished variables, and it is the case that X = X1 ∪ . . . Xk . The Y variables, which do not intersect with the X variables are called the free variables. The conjunction on the right hand side of the query is called its body, and Q is the head; it is a new predicate defined to be true for all values of X satisfying the body, where the Y variables are existentially quantified. We may also use a single symbol in place of a conjunction as in Q(X) :− ψ(X, Y ) 8 with Y = Y1 ∪ . . . Yk . For any tuple u which satisfies Q, we call u a witness to u if it is an extension of u onto the Y variables that satisfies ψ(X, Y ). Every u satisfying Q has at least one witness, and may have more, each having different Y values. We also use algebraic notation, where q(A) denotes the set of u such that Q(u[X]) is satisfied. Similarly to relation instances, we also use the symbol Q to refer to a materialized view (set of tuples), with superscripts to denote different instances of a view with the same definition. We will use notation such as u ∈ Q or u ∈ q(A) to indicate that a tuple belongs to a view or is the answer to a query respectively. We propose declarative mapping constraints similar to both s-t tgds used in data exchange and Global and Local as View (GLAV) mappings used in data integration [21]. The difference is that our mapping constraints specify implication in both directions, i.e. an equality relationship between views rather than an inclusion relationship. We call these bidirectional dependencies (bdds) Definition 2.1 (bidirectional dependency). A bidirectional dependency (bdd) between schemas B and C is a formula ψ(X, Z) ↔ φ(X, Y ) with ψ and φ being conjunctions over the relations of B and C respectively. The variables in X are universally quantified and the variables in Y and Z are existentially quantified. By convention we define queries qB and qC as QB (X) :− ψ(X, Z) QC (X) :− φ(X, Y ) If we let σ be a bdd that equates queries qB and qC , we say that σ is satisfied with respect to instances B and C if qB (B) = qC (C). The bdd equivalent of the mapping from Example 2.1 is Component(z1 , z2 , z3 , z4 , x1 ), Material(z1 , x2 , z5 ) ↔ ProjectItems(y1 , x1 ), ItemRates(y1 , x2 , y2 ) where z1 , z2 , z3 and z4 represent values for a component’s id, type, length and height respectively; z5 refers to the thickness of some material used in that component. These z variables are not relevant to the right hand side of this mapping. The variables x1 and x2 , which represent that component’s area and material name, are also present in the right half of the mapping, which has its own variables (y1 and y2 ) for the values not relevant to the left hand side. For simplicity of exposition here we consider conjunctive queries without arithmetic comparison. Our methods as presented in Chapter 3 discuss using a broader class of queries for qB . In Section 7.2.1 we discuss the possibility of future work extending the query class of qC to include arithmetic comparisons. The problem of creating mappings is a topic of future research. Although there has been plenty 9 of work in the creation of schema mappings [53], it is unlikely that the precisely engineered mappings needed in the applications we consider will be created automatically. We leave this responsibility to a team which contains both a domain expert (e.g. the Cost Estimator), and someone knowledgeable with databases and query languages (this is discussed more in Chapter 6). There are are tools such as Clio [48] or ++Spicy [44] which may be of assistance here. Most of the core problems addressed in this thesis assume that a correct set of mappings is given. A data coordination scenario is a triple (B, C, σ) where B and C are relational schemas and σ is a single bdd between B and C. Unlike the data exchange problem as defined in [17], we do not consider integrity constraints on either B or C. Data coordination involves updates as follows: Definition 2.2 (Update) An update A± to a database instance A consists of a sequence of pairs − + − of inserted and deleted tuples (A+ 1 , A1 ), . . . (Ak , Ak ), one pair for each relation of A. The result of applying A± to A is an instance A , where for each i − Ai = (Ai ∪ A+ i ) − Ai We assume that ∀i A− i ⊆ Ai (i.e. that we cannot delete a tuple which is not present in the database) and that ∀i A+ i ∩ Ai = ∅ (i.e. that we cannot insert a tuple which is already present in the database). As discussed in the introduction, our focus is on data coordination scenarios where B and C are autonomous. For this reason we do not assume that B pushes its changes to C; instead, C must monitor B to detect when it has changed. The monitoring policy depends on the freshness requirements of C’s administrator. The central problem addressed by our system occurs each time a change is detected in B. We define the data coordination problem as follows: Definition 2.3 (Data Coordination Problem) Given 1. Database instance B which is the result of applying some set of (unknown) updates B ± to some (unknown) instance B 2. Database instance C 3. A mapping constraint σ such that B, C satisfy σ Find all possible updates C ± (if any) such that when applied to C, the result is a C that satisfies σ with B This is illustrated in Figure 2.1, where B is updated resulting in B , and our goal is to find C . Later on B may be updated to B , and we desire C and so on. Our focus is on solving a single iteration of the data coordination problem. This is because updating C so that σ is satisfied brings us back to the initial state from which we began. If it is the case that B is updated multiple times before C is coordinated with it, these updates can be composed into single set of updates. We illustrate data coordination by continuing Example 2.1: 10 Database B B Database C C B B C C Figure 2.1: The data coordination scenario between B and C. The top row shows instances of relational schema B at different times, while the bottom row shows instances of C at different times. The initial instance C is coordinated with the instance B. At some point B is updated to state B , so C must be updated to a state C which is consistent with B . Eventually, B is updated to become B , so C must be again updated. This may continue with additional instances over time. Example 2.2 At time t0 , the data stored in B (architect’s building data) is shown in Table 2.1, while the data stored in C is shown in Table 2.2. The mapping σ is satisfied since both qB (B) and qC (C) are equal to the result shown in Table 2.3. Now, at time t1 the building architect has updated the building design by changing the type of concrete used in Wall 1 from light to medium. This is specified as the update Material+ (1, medium concrete, 250) Material− (1, light concrete, 300) The updated building design is shown in Table 2.4, and the resulting view is shown in Table 2.5. The mapping of qB and qC in Example 2.1 is no longer satisfied with respect to B and C, because the resulting view by evaluating qC on C (Table 2.3) differs from Table 2.5. (a) Component id 0 1 2 type Column Wall Wall length 1 10 4 height 1 3 3 (b) Material area 1 27 12 cid 1 1 2 name medium concrete drywall heavy concrete thickness 250 15 200 Table 2.4: Example building design B after applying the update from Example 2.2. Given Material± , it would be fairly clear that updating the CostItem table requires deleting a tuple with a desc value of “light concrete” and inserting one with a value of “medium concrete”. In this case, the challenge remains that we may not know exactly which tuple should be deleted (e.g. if desc is not a key of the CostItem table), and we also do not know which values to use for 11 name medium concrete heavy concrete drywall qty 27 12 27 Table 2.5: The result of evaluating qB (B ), which differs from the result of evaluating qC (C) as shown in Table 2.3. unmapped attributes (namely, code and rate) of the inserted tuple. However, because B and C are autonomous, in the worst case we do not even know how B has changed (i.e. Material± is unknown); we only have access to B . Nor can we attempt to determine how B has changed, the only way to do so would be to replicate the entire contents of B at C, which would be infeasible under most circumstances. 2.3 Our Approach In our approach to data coordination, the mapping σ captures the subset of B’s data which is relevant to C. We implement coordination by storing the materialized view QB on B. The contingent source C is coordinated with B when QB is identical to the materialized QC on C. Hence, if QB is unavailable, it can be recomputed from C. With respect to the data coordination problem defined above, at time t + 1, the contingent source C will have a materialized view QB on B at the time t when it was last coordinated, thus providing a record of the exact changes to B which are pertinent to C. We use QB to break data coordination into the following three stages (Figure 2.2): 1. (View Differencing) Given QB and B , find updates Q± B to QB such that the result of applying them yields QB ± ± 2. (Update Translation) Given Q± B , find all C such that the result of applying C to C satisfies σ with B . 3. (Update Completion) Given the set of all possible updates C ± , a user (i.e. C’s administrator) chooses the correct or most appropriate updates, based on their understanding of the data, and these are applied to C. Following Example 2.2, the Cost Estimator would store the view QB as shown in Table 2.3. At time t1 , when B has been updated, we would find the following solution Q± B to view differencing: Q+ B (medium concrete, 27) (2.1) Q− B (light concrete, 27) (2.2) 12 B qB QB Q± B QB qC C C± C Figure 2.2: Data coordination solution overview. We first find the difference Q± B between the materialized view QB defined by qB on instance B and the materialized view QB defined by qB on instance B (1 - view differencing). We then use the result to find all possible C ± (2 - update translation), and apply one of them (3 - update completion) to C to obtain the final solution. ± In the second step (update translation), we would translate Q± B into CostItem : ProjectItems+ (x, 27) ItemRates+ (x, medium concrete, y) ProjectItems− (12, 27) And in the third step (update completion), the cost estimator would determine that “medium concrete” is a new item and assign a new code value for x (say, 13) and rate value for y (say, 35.00). In Chapter 3 we apply a well known incremental view maintenance algorithm to view differencing, and experimentally compare this with a brute force approach which simply materializes QB and takes the symmetric difference with QB . Update translation (Chapter 4) is the more challenging stage and represents the bulk of our technical contributions. The challenge with update translation is ambiguity: there may be many possible translations of C ± for a given Q± B . This challenge is compounded by the bdd specification of exact equality: how do we find a translation which avoids side-effects? For example, choosing x = 03 makes the (false) assertion that the view contains (190mm studs, 27). Chapter 5 proposes some methods to assist in update completion, and performs a usability study with a prototype data coordination system. 13 2.4 Related Work While detailed discussions of research relevant to specific topics are provided in the chapters in which they are discussed, this section gives a brief overview of some research whose overall goal is similar. Hyperion [6] coordinates data amongst autonomous peers through event-condition-action (ECA) rules, which dictate how events at one source trigger actions at other sources. This approach is complimentary to our approach of using declarative mapping constraints. The benefits of ECA rules are that they allow the relationships and coordination behaviour to be controlled with precision, support powerful update semantics, and actions can be unambiguously selected for a given event. An example used in [6] expresses that when inserting a tuple into a flight booking database, if the flight is over sold (i.e. the new booking causes the relation size to exceed the flight’s capacity), then a new flight should be scheduled with the same source and destination containing the overbooked passengers. Mappings such as this, which depend on the size of the relation as well as the ordering of its tuples, are not possible in our approach. We liken the ECA rule approach to an imperative language, where the mappings specify the exact operations to be performed. Other relationships may be more difficult to express using ECA rules. Following the example in Section 2.1, if the quantity of a ProjectItem should be equal to the sum of all Component areas, then an ECA rule would be needed for insertion, deletion, and update actions on the Component table; the body of each rule would need to calculate the delta of the sum in order to apply it to the ProjectItem table. By contrast, only a single declarative mapping relating the sum to the quantity is needed (assuming aggregate queries are allowed for qB , a possibility discussed in Section 3.2 and exemplified in Section 5.2.2). As emphasized in [37], using declarative mappings allows the mapping designer to focus on the question of what is the relationship between B and C, leaving the much harder question of how is the relationship maintained as B changes to be answered automatically by a data coordination system. In addition, declarative mappings permit greater autonomy, since using ECA rules would necessitate the ability to monitor B for updates and detect when an update meets an event definition’s criteria. More recent work on Hyperion [33, 34, 45] focuses on coordination with mapping tables, which are used to describe pairs of associated values between two data sources. Hyperion allows either open or closed-world semantics to be used with each mapping table, which provides a way to specify the completeness of the knowledge about which values are mapped. By combining multiple mapping tables into formulas, Hyperion provides a powerful means for specifying pair-wise associations between a tuple from each source. Our approach is complimentary, since we focus on coordinating data sources which are related by more general views (i.e. involving conjunctions), and our mapping language specifies strict equality (which is comparable to a closed-world approach). Youtopia [38] is a collaborative data integration system between a set of peers, where each peer has its own set of relations. Mappings in the form of s-t tgds are used to express relationships 14 between the peers’ data. Coordination of updates amongst peers is done incrementally; each inserted or deleted tuple is translated individually into inserted or deleted tuples in mapped peers’ relations. These translations may then result in additional insertions/deletions on transitively mapped peers. Each time there is an ambiguous translation, the process blocks waiting for a user to specify how it should be resolved. The update translation problem in Youtopia is similar to our data coordination problem as discussed in Chapter 4. A key difference is that our methods operate on sets of inserted and deleted tuples, which is useful because the autonomous nature of B and C means that C may only see updates from B in batch form. A new instance of B may be given without insight into the individual edits that resulted in that instance. ORCHESTRA [24, 25] is a collaborative data sharing system (CDSS), which differs from peerto-peer in that querying is local and updates are exchanged between neighboring peers. It is motivated by the life sciences community, where there is a great need to share data, but a lack of agreement on the instance and sometimes schema of this data. ORCHESTRA tolerates differing instances of the same relation by allowing each peer to selectively reject tuples inserted by its peers. Mappings between peers in ORCHESTRA are represented using a set of s-t tgds, which is later extended to bdds in [32]. The update exchange problem is formulated as a Datalog program by 1) first partitioning each relation into four relations representing local tuples, rejected peer tuples, incoming peer tuples, and outgoing peer tuples; 2) the set of mappings is augmented with constraints that disallow rejected tuples in a peer’s local instance, and 3) transforming each mapping into a Datalog rule whose antecedent is the head and precedent is the body, with Skolem functions used for existentially quantified variables. For example, a mapping A(x1 , z1 ) → B(y1 , x1 ) would be transformed into the Datalog query B(f (x1 ), x1 ) :− A(x1 , z1 ) where f is a Skolem function used in place of the (unknown) value y1 . The least fixpoint of this transformed Datalog program is a solution to the ORCHESTRA coordination problem. In order to ensure deleted tuples are translated, information (provenance) about how each tuple is derived is used to check for derivations of a given target tuple from non-rejected source tuples. When a tuple is derived via a rule whose precedent involves a conjunction, they rely on mapping annotations (created manually) to specify which of the conjuncts should be deleted. For example the asterisk in P (x1 , x2 ), Q∗ (x2 , x3 ) → R(x1 , x2 , x3 ) specifies that a deleted tuple from R should cause a deleted tuple from Q. 15 The main difference of our work from the ORCHESTRA project is the base-contingent relationship between B and C. The inability to change B imposes an additional constraint on our problem, while at the same time permitting a broader class of queries for qB (which we discuss in the following chapter). Also, our goal is to arrive at a final and unambiguous solution for C. A universal solution containing labeled nulls or Skolem functions is appropriate if the desired functionality is to find the certain answers to queries on C, but in order to allow constant values to be assigned to these variables, additional effort must be taken to ensure that the chosen variables do not cause further mapping violations. We also permit a deleted tuple to be translated as a deletion from any conjunct in the mapping which generated it. This approach is computationally more expensive but also permits more flexible mapping interpretation. In [20], Foster et al. define a language for mapping between tree-structured databases (e.g. XML) to allow the propagation of updates. As in our approach, their language specifies views as a means of encapsulating commonalities between these databases. They introduce the concept of lenses which are similar to our queries qB and qC , except each lens must include a function for its inverse. Their language is aimed towards projection and rearrangement operations commonly used in the instances they study, and unlike our approach does not consider lenses involving joins. 16 Chapter 3 View Differencing 3.1 Related Work View differencing is very similar to incremental view maintenance [28], which finds an updated view given changes to the base relations. A small difference, which we address in Section 3.3, is that our goal is to determine the updates to the view (rather than the updated view itself). A larger difference is that we consider data coordination in scenarios where B is external to and autonomous from C. Past incremental view maintenance approaches are well suited to scenarios where the view to be maintained is managed by the same system as the underlying data. As such, the requirement of being able to access the old and new base instances is reasonable, whereas in some data coordination scenarios it may not be possible. Initial approaches to incremental view maintenance such as [26, 27] used the idea of algebraic differencing: deriving closed form expressions by using equivalence preserving transformation operations to simplify the relational algebra expressions: qB (B ∪ B + ) − qB (B) (Insertions) qB (B) − qB (B − B − ) (Deletions) Where B + and B − are the sets of insertions and deletions respectively. The main advantage of algebraic differencing is that it can be performed using the updates and either the old base data (B) or the new base data (B ), but does not require both. Two counting algorithms [12, 29] were developed independently and simultaneously, and improved upon algebraic differencing in that they can also be applied to views containing some forms of union, negation, aggregation, and linear recursion. These counting algorithms require both the old and new base data, and the set of updates. Despite the past focus on developing algorithms for incremental view maintenance, the scope of the benefits of incremental view maintenance has largely gone unexplored. In [59], the authors experimentally analyze the number of bytes transferred and I/O operations performed for incrementally maintaining a set of data warehouse views defined over remote databases. In [13], they find the counting algorithm to be more efficient than view recomputation when the number of updates is small, for a simple select-project-join view over two base relations. 17 3.2 Materialize and Compare Our first approach to view differencing is to materialize QB by evaluating qB on B , and to generate Q± B by directly comparing QB with QB . We call this Materialize and Compare (MAC). Materialize and Compare (MAC) takes as inputs the updated instance B and the materialized QB . Because our system stores QB , it will already have QB available for use when coordination needs to happen. MAC computes Q± B by taking the symmetric difference of QB and QB . This can be done by sorting and scanning. Any total ordering on the view tuples will suffice for sorting. If the mapping designer knows how the data in B is stored and how qB is evaluated, a comparison operator may be chosen such that sorting does not require moving around a significant number of tuples. Given n = max(|QB |, |QB |), the complexity is O(n|qB | + n log n) Where |qB | is the number of relations in qB . This analysis makes no assumptions about the manner in which relations are joined in qB , nor the presence of indexes for fast join computation. A lower complexity bound is possible under some circumstances. Following Example 2.1, we would materialize QB , resulting in QB (medium concrete, 27) QB (heavy concrete, 12) QB (drywall, 27) Comparing this with QB as given in Table 2.3 leads us to the solution for Q± B given in (2.1). MAC for view differencing has several benefits: 1. Since we are simply evaluating qB , there is no limit to the types of queries which can be used. This allows for mappings expressing very sophisticated relationships, e.g. aggregation, conditionals, negation etc. As will be shown in Chapter 6, these features are important. 2. Minimal support is required from source B. As stated in Chapter 1, one goal is to support autonomous sources. Since C is the source which is interested in coordination, it is imperative to be done with minimal requirements on B. For the MAC approach, we only require B to evaluate queries on its current data instance. 3. There is no cost to obtaining QB , since a copy is kept by our system from the last coordination point. Another way of looking at it is that the cost of materializing QB is amortized with the B t−1 , B data coordination instance, and the cost of materializing QB is amortized with the B, B data coordination instance. 18 3.3 Incremental View Maintenance Our second approach to view differencing is based on an application of incremental view maintenance (IVM) techniques, hence we call it IVM for View Differencing (IVM-VD). IVM updates a materialized view given changes to the base relations without recomputing that view from scratch. This problem has received a significant amount of attention [28] due to the utility of materialized views in a variety of applications. The underlying assumption is that it is cheaper to maintain views incrementally than recompute them, which is called the heuristic of inertia in Chapter 11 of [28]: In most cases it is wasteful to maintain a view by recomputing it from scratch. Often it is cheaper to use the heuristic of inertia (only a part of the view changes in response to changes in the base relations) and thus compute only the changes in the view to update its materialization. We stress that the above is only a heuristic. While this is likely the case if the changes are small compared to the size of the view, to our knowledge the boundaries of this heuristic have not been tested in the literature. We consider this a contribution of our work. Incremental view maintenance as applied to view differencing in data coordination raises a few challenges. First, in addition to the new base relations (B ) and existing materialized view (QB ), IVM also requires access to the old base relations (B) and the updates B ± . Second, the goal of incremental view maintenance is to produce the updated QB , rather than isolating the updates themselves. While these could be obtained by performing the symmetric difference, as in the MAC approach, we give a method of isolating the updates during the computation of QB . The third challenge is in implementing incremental view maintenance externally from the system which is managing B (i.e. since C is responsible for ensuring its data is coordinated with B). In this section, we describe an adaptation of the counting algorithm of [29] for view differencing. The counting algorithm associates with each tuple t in the view, a count of the number of distinct ways in which t can be derived. For example if our view is a projection onto some set X of attributes, then the count of a tuple t would be the number of tuples in the base relation having the value t for the attributes in X. We use count(t) to denote this count; it can be negative, in the case that the view is defined as a query involving negation, or when evaluating the view on deleted tuples (more on this below). An additive union operator is defined over relations of counted tuples. Specifically, for rela- tions R and S, let count R (t) and count S (t) denote the count of t in R and S respectively (which is 0, if t is not present). Then the result of R S has count(t) = count R (t) + count S (t). The updated view QB is found by performing the additive union of QB with the result of 2k delta rules, where k is the number of base relations accessed by qB . Each delta rule replaces one of these base relations with the set of inserted/deleted tuples into/from that relation. Specifically, 19 for a conjunctive query QB (X) :− ψ1 (X1 , Z1 ), . . . ψk (Xk , Zk ) the delta rules are D1+ . . . Dk+ and D1− . . . Dk− . The i-th delta rule uses the updated relations for each predicate preceding i, the insertions (or deletions) for the i-th predicate, and the original relations for each predicate following i. Formally, Di+ (X) :− ψ1 (X1 , Z1 ), . . . ψi−1 (Xi−1 , Zi−1 ), ψi+ (Xi , Zi ), ψi+1 (Xi+1 , Zi+1 ), . . . ψk (Xk , Zk ) (and vice-versa for Di− ). We illustrate by borrowing the following example from [29]. Example 3.1 A binary relation link stores pairs of URLs where one has a link to the other. A view hop has been defined as pairs of URLs with a path of length two, as follows hop(x1 , x2 ) :− link(x1 , y), link(y, x2 ) (3.1) The delta rules for updating hop, given changes to link, would be D1+ (x1 , x2 ) :− link+ (x1 , y), link(y, x2 ) D1− (x1 , x2 ) :− link− (x1 , y), link(y, x2 ) D2+ (x1 , x2 ) :− link (x1 , y), link+ (y, x2 ) D2− (x1 , x2 ) :− link (x1 , y), link− (y, x2 ) The updated view is computed by performing the additive union of the existing view with all 2k delta rules. k Di+ QB = QB Di− (3.2) i=1 Thus far we have required very little on the part of the base source, B, throughout the coordination process: only that it is able to respond to structured queries on its current database instance. The rationale for doing so is that since the direction of dependency is from B to the contingent source, C, it is solely C’s concern, and hence responsibility for ensuring consistency. In order to apply this approach for view differencing, we must assume that B provides a mechanism to uniformly query previous versions of its data, along with the inserted and deleted tuples to its data. We also need a mechanism for source C to obtain tuple counts, an implementation of counted tuple sets that can efficiently perform insertion, deletion and additive union, and a mechanism for extracting the final set Q± B of updates during the view maintenance process. 20 The authors of [29] note that in a system using bag (duplicate) semantics, duplicates will be internally represented by B with multiple copies of a tuple or with tuple counts. Unfortunately this is not an option in an autonomous setting. Our solution is for C to rewrite each of the delta rules as an aggregate query which groups tuples on all of their attributes and counts the number of tuples within each group. Using SQL views, this rewriting is simple and incurs little additional cost by B. Recall the query qB from Example 2.1: SELECT Material.name, Component.area AS qty FROM Component JOIN Material ON Material.cid = Component.id; This query would be rewritten as: SELECT Material.Name, Component.area AS qty, COUNT(*) FROM Component JOIN Material ON Material.cid = Component.id GROUP BY Material.Name, Component.area; The third challenge is how to store relations with tuple counts so that the operator can be implemented efficiently. Our solution is to store (t, count(t)) pairs in a balanced binary search tree indexed by t, where a lexical comparison is used to determine the ordering of tuples. The algorithm for performs a simultaneous in-order scan of its operands, adding the counts of each tuple and discarding those which sum to 0. We track these insertions and deletions and perform them after completing the scanning process. Since in-order traversal of a balanced binary search tree can be performed in linear time, the overall complexity is linear in the size of the operands, with additional terms of O(n+ log n) and O(n− log n) to perform n+ insertions and n− deletions to the tree (which we expect to be a small number). This is on track with the O(n log n) complexity noted in [29]. Lastly, the counting algorithm accumulates updates as a set of tuples with counts: the part of (3.2) which is in parentheses. Let us denote this set as D± . The final step is to perform QB D± . However, the results required by view differencing are sets of inserted and deleted tuples. For example, if D± contains a tuple t with count(t) = 2, then t is not necessarily a member of Q+ B, since it would also need to be a member of QB with a count of either 0 or -1. − We compute Q+ B and QB during the evaluation of QB D± by examining the transition of each tuple’s count value. This can be done with a constant amount of overhead. Specifically a tuple t is an insertion if: ❼ t is in D± with some positive count c1 , and ❼ t is not in QB , or ❼ t is in QB with a negative count which is smaller in magnitude than c1 21 Similarly, a tuple t is a deletion if: ❼ t is in D± with some negative count c1 , and ❼ t is in QB with some positive count c2 , and ❼ c1 is equal to or larger in magnitude than c2 Consider the following example Example 3.2 A view QB (with counts) consists of the following set of tuples QB = {(t1 , 1), (t2 , −1), (t3 , 2), (t5 , −1), (t7 , 1), (t8 , 3)} while D± consists of the following set of tuples/counts D± = {(t1 , −1), (t4 , 1), (t5 , 2), (t6 , −2), (t7 , 2), (t8 , −2)} The end result is Q+ B = {t4 , t5 } Q− B = {t1 } The tuple t4 is inserted because it appears in D± with a positive count, and not in QB . The tuple t5 is inserted because it appears in D± with a positive count (-2) greater in magnitude than the count of t5 (-1) as it appears in QB . Note that t7 is not inserted because it is already present in QB with a positive count. The tuple t1 is deleted because it appears in D± with a negative count equal in magnitude to its positive count in QB . The tuple t6 however, is not deleted because it does not appear in QB , and t8 is not deleted because its count has a greater magnitude in QB . 3.4 Experimental Evaluation We experimentally evaluated both MAC and IVM-VD in order to assess their performance in a variety of circumstances. Our main goals were to determine whether they are both feasible under reasonable conditions, and to what features they are most sensitive. Therefore, our experiments measured the total execution times of both approaches while varying the amount of data, update size, schema, and view definitions. IVM-VD should be more efficient than MAC when the updates are small and recomputing the view from scratch is expensive. However, there are several possibly limiting factors: IVMVD makes a trade-off between the number of queries evaluated, and the cost of each of these 22 evaluations. The reasoning is that it should be more efficient to evaluate a large number of delta rules, each involving a term (Bi+ or Bi− ) significantly smaller than the base relations. Of course, the performance advantage gained by this trade-off depends on the number of relations in the query qB , and the number of inserted/deleted tuples for each of these relations. We explore the scope of this trade-off in depth in the following section. 3.4.1 Instances We evaluated our algorithms on three test instances for qB : 1) a simple, single relation query, 2) a chain-structured query for varying the number of relations in the conjunction, and 3) an instance using the TPC-H schema and data generator [56]. Simple: Simple contains a single relation B(id , a, b). id is a key; a is an integral value on the half-open interval [0, 100); and b is a double precision floating point value. We used the simple instance to obtain baseline performance for a very simple case, measuring the effect of view size and update size on each method. Our view definition was QB (b) :− B(id, a, b) ∧ a < c where c is a constant between 0 and 100 which defines the view size as a percentage of the database size. We vary c in one of our experiments, to examine performance as view size varies. Since b values are randomly distributed, MAC was forced to sort QB after materialization. Naturally, IVM-VD was not required to sort. Data was generated uniformly. Chain: Recall from Equation (3.2) that for a query with k predicates, IVM-VD evaluates 2k delta rules and performs 2k additive union operations. Our second test instance varied the number of relations in the schema and view definition to determine how well each method scales on progressively complex queries. A chain query is of the form Q(x) :− B1 (x, id2 ), B2 (id2 , id3 ) . . . Bk−1 (idk−1 , 0), Bk (0) Where x is a unique key for B1 and idi is a unique key for Bi . As such, there is a many to one relationship between each pair Bi and Bi+1 , and the data we generate forms a forest with each tree rooted by a tuple of Bk , and the leaves formed from tuples of B1 . The view on the k-relation chain schema contains the id of all tuples in B1 who are joined to the tuple in Bk with idk = 0. TPC-H: The TPC-H schema is a benchmark decision support schema [56]; TPC-H consists of eight relations used to track the part ordering system of a retail database. We used the TPC-H schema to evaluate performance under a realistic database scenario. In our experiments we used three different queries: Q2, Q5, and Q16. The results were similar for all three, so we only show the results for Q16, which is given in Figure 3.1. We modified this query by replacing subqueries with finite value sets and removing aggregations, which preserves the original purpose of the query 23 while meeting the limitations imposed on views maintainable by the counting algorithm. Our data was generated with the TPC-H data generator, which has a uniform distribution. SELECT P_BRAND, P_TYPE, P_SIZE, PS_SUPPKEY FROM PARTSUPP JOIN PART ON P_PARTKEY=PS_PARTKEY WHERE P_BRAND <> ’Brand#4’ AND P_TYPE NOT LIKE ’SMALL%’ AND P_SIZE IN (1,6,11,16,21,26,31,36,41,46,50) AND PS_SUPPKEY NOT IN (1, 100, 200, 1000, 4000, 4800, 5200, 6400, 8800, 9300, 9700, 9999); Figure 3.1: TPC-H Query Q16 3.4.2 Methods For the simple instance, we varied the view size as a percentage of the data set size, as well as the size of the updates to the relation. To vary the view size, we first generated an instance of B; we performed 30 independent trials, where each trial generated a set of insertions B + and deletions B − by randomly selecting equally sized non-intersecting subsets of B. These insertions and deletions were applied to B, resulting in instances B and B , on which each algorithm was evaluated. For experiments varying the update size, 30 independent trials were performed for each update size. Our plots show the mean and standard deviation. For the chain test instance, we varied the number of relations k, and generated a single instance of each size k schema. The instance for a k-relation schema contained 2,000,000 tuples in B1 , and 2k−i+1 tuples for each relation instance Bi with i = 1. The foreign key references were “striped”, so that each tuple of Bi (i > 2) had exactly 2 tuples in Bi−1 referring to it, and the tuples in B1 referred uniformly to the tuples in B2 . Updates were generated for relations B1 to Bk−3 inclusive, and were fixed in size as a percentage of the size of each of those relations. As in the simple test instance, we used 30 independent trials. Our choice of such methods for generating data and updates was motivated by a desire to fix (1) the size of the updates as a proportion of the total data generated, and (2) the size of the view, while varying both the number of relations in the database and the number of relations which are updated. We found that this data and update generation schema resulted in a slight decrease in view size as the number of relations was increased, and we account for this decrease in the analysis of our results. For the TPC-H instance, we conducted experiments varying the database size (and hence the view size) and varying the size of updates, similarly to the single relation instance. For each data set size, or update size, we performed 10 trials using independently generated updates. We used the TPC-H data generator, which uniformly generates data for all eight relations proportional to a 24 size factor; this size factor roughly corresponds to its size in GB, or 1/8,660,000th the total number of tuples. Our implementations of both approaches were written in C++, using the MySQL++ library, and a MySQL database engine, version 5.0. The implementation and engine both ran on the same host, an Intel Xeon 2.93GHz system with 64GB of RAM, running OpenSUSE Linux. Upon generating instances B, B and updates B ± , each test performed the following, for both methods: 1. Materialize QB by executing qB 2. Force the server to flush its caches and close its tables 3. Compute Q± B and QB We measured the total time of step (3). Step (2) was to minimize the effects of caching and allow performance to be measured from a relatively cold state. We note that in the case of IVM-VD, step (1) involved executing the rewritten aggregate query as described in Section 3 to obtain tuple counts. 3.4.3 Results We begin with a general summary of our findings. We found both MAC and IVM-VD to offer acceptable performance under realistic conditions. MAC appears to be favorable to IVM-VD when qB contains more than a few joins, and the size of the updates is greater than 2.5% of the database instance. If the update size is small, or if only a few of the relations in qB are updated, then IVM-VD is favourable. IVM-VD’s execution time varies exponentially with the number of joined relations in qB , but only if most of these relations are updated. MAC is fairly resilient to changes in the problem parameters; its performance depends mostly on the size of Vt and Vt+1 . Both algorithms appear to be linear in the size of the instance. We also point out that the performance of IVM-VD depends on the cost of obtaining tuple counts. Since our implementation used a database which does not support tuple counts, we were required to rewrite each delta rule as an aggregate query. The cost of executing the rewritten queries depends on whether the variables X uniquely determine the Y variables in ψ, and if there is an index on X. For our Chain instance, it is the case that X → Y , but the database implementation we used did not appear to take advantage of that fact. Our first experiment varied c (the size of the view as a percentage of the data set size) on the Simple instance (Figure 3.2). As can be seen, both methods perform reasonably well, with IVM-VD slightly outperforming MAC. In this case, the size of both insertions and deletions are fixed at 5% of the size of the unified data set (B ∪ B ), which is 2M tuples. We also found that the sorting step occupied 75% of MACs execution time, meaning that it could be made significantly faster if QB is already sorted. This would be the case if the view includes any attribute for which there is an index. Our tests on TPC-H highlight the performance advantage of sorting. 25 18 MAC IVM-VD Execution Time (seconds) 16 14 12 10 8 6 4 2 0 0 20 40 60 View Size (% of relation size) 80 100 Figure 3.2: Performance of view differencing on the simple instance. Our tests on the Simple instance address in part the question of performance under fairly simplistic circumstances. Both methods appear to be feasible and scale linearly with view size (MAC at a slope of 1 and IVM-VD at a slope of < 1). IVM-VD appears to be mildly sensitive to update size, but this sensitivity is not prohibitive for reasonably sized updates. Next we tested on the Chain instance to determine the algorithms’ sensitivity to join size (as a proxy for query complexity). Figure 3.3 shows the run time (plotted on a logarithmic scale) of each algorithm as the number of joins is varied. The run time of IVM-VD increases exponentially, reaching around 1400 seconds for a 26-way join. While a 26 relation join may seem implausible, it is seen in practice; these results indicate some caution is necessary when considering the application of IVM-VD for large numbers of joins. These results require the number of updated relations to be large, regardless of the number of relations in the view definition. In preliminary trials, we found IVM-VD to be efficient for views with a large number of joins when the number of updated relations is small. This is because the majority of delta rules will perform a join with an empty relation. Modern query optimizers can detect this case and quickly return an empty result. MAC scales quite well in this case. Its runtime decreases at first due to a decrease in the size of QB and then increases as the decreased view size is overwhelmed by the time to materialize QB . The sharp increase at 23 joins is due to MySQL’s execution time of qB growing from less than 10 seconds to more than 45. Due to the data generation method for this instance, the view size gradually decreases from a mean of about 950,000 tuples for the k = 3 instance, down to about 245,000 tuples for the k = 29 instance. As a result, the execution time of qB accounts for roughly 26 Execution Time (seconds) 1000 MAC IVM-VD 100 10 1 5 10 15 Join Size 20 25 Figure 3.3: Performance of view differencing on the chain instance. 96% of the total time of MAC for the k = 23 instance, and this proportion increases with k. We note that these results are due partly to the performance characteristics of our choice of relational database engine and configuration. While replacing MySQL with an implementation having better performance for large joins may change the results’ scale, it should not affect the relative ordering of the two methods, as the execution time of MAC for views involving large joins is more likely to be bound by the time to materialize QB . Finally, we present our TPC-H results. In Figure 3.4 we fixed the update size at 10% of the data set size and varied the size of the database. Both methods scale linearly with the size of the data, with MAC outperforming IVM-VD by a greater margin with increasing database size. This latter trend is largely due to our choice of update size: in Figure 3.5 we fixed the database size at 10GB and varied the update size. In this case, the run time of IVM-VD increased with update size, while MAC was decreased slightly due to decreasing view size. It should be noted that our experimental range extended to insertions and deletions of 20% of the tuples in the test database, which would be considered an extreme case. Our view definition (Figure 3.1) was a query from TPC-H; it joins four relations and using four selection conditions involving tests of inequality, set membership and substring containment. The size of the resulting view is about 0.64% of the total database size, ranging from 0.11M to 2.37M tuples over the range of the experiments. We experimented with other TPC-H queries, and in all cases, the point at which MAC outperforms IVM-VD is around 2.5%, despite the fact that the three queries used have a varying number of joins and result in varying view sizes. 27 90 MAC IVM-VD Execution Time (seconds) 80 70 60 50 40 30 20 10 0 2 4 6 8 10 12 14 16 18 20 Database size (GB) Figure 3.4: Performance of view differencing as a function of size for TPC-H. 70 MAC IVM-VD Execution Time (seconds) 60 50 40 30 20 10 0 0 5 10 15 Update Size (% of instance size) 20 Figure 3.5: Execution time for view differencing as a function of update size for TPC-H. 28 Chapter 4 Update Translation 4.1 Related Work The view differencing stage has given us the set of insertions and deletions Q± B . Since our constraint is QB = QC , these insertions and deletions are the same as what should be applied to QC ; from ± here forward we will use Q± C to denote these updates. After finding QC , we must translate these into a set of updates on C. A satisfactory translation requires that evaluating query qC on the updated C results in the updated view. We continue Example 3.1 in the following: Example 4.1 Recall the view defined in Equation (3.1) from Example 3.1. Now suppose that in this case, the link relation is a member of the database C, and that the hop view represents the qC half of our mapping, i.e. QC (x1 , x2 ) :− link(x1 , y), link(y, x2 ) (4.1) If the contents of link are link(a.com, b.co.uk) link(a.com, c.edu) link(c.edu, d.net) link(b.co.uk, d.net) link(b.co.uk, e.org) the contents of QC are then QC (a.com, d.net), QC (a.com, e.org) Consider the deletion Q− C (a.com, d.net). This deletion could potentially be translated as deleting two tuples link− (a.com, b.co.uk), link− (a.com, c.edu) However, this translation has the side effect of also deleting QC (a.com, e.org). Or consider an insertion Q+ C (f.com, g.ca); we need to insert two tuples into link. Not only do we not know the connecting URL y, but choosing particular values for y may result in additional 29 insertions into QC . There has been early work on translating a set of updates on a view to a set of updates on the base relations. Bancilhon and Spyratos [7] have shown that an update Q± on a view Q (defined by a query q on database C) has an unambiguous translation C ± if there is a complement q of q such that: 1. q retains the information in C which is lost by q (i.e. the attributes/tuples which are not selected by q), and 2. Q± does not affect any of the information in q. As it turns out, even the query (4.1) has no such complement. For a given pair Q+ C (x1 , x2 ), we know that there should be tuples inserted into link, but we cannot determine which y value to use. In the case that there is no satisfactory complement q, Bancilhon and Spyratos’ result does not comment on whether or not translation is possible. The problem of translating a single deleted tuple t from a view into a set of deleted tuples T from the base relations has been given recent theoretical attention in [11]. The authors consider two variants of the problem: one which aims to minimize the side-effects of deleting T , and another which minimizes the size of T . They find that both are NP-hard in the general case of views involving selection, projection, join and union (SPJU). For the case of SPU or SJ queries the problem can be solved in polynomial time. This work is extended by Kimelfeld et al. in [35, 36], who focus on optimal and approximately optimal algorithms for various subclasses of conjunctive queries, as well as the theoretical constraints inherent in these problem variants. Our focus in Section 4.3 is on developing a straightforward exhaustive algorithm that finds all possible minimal, side-effect free translations of a set of deleted tuples from an SPJ view (if any exist). Our solution is expressed as a formula in disjunctive normal form. Although our algorithm is exponential, we demonstrate experimentally in Section 4.5 that through careful pruning of the search space, it is possible to find a solution in considerably less time than would be predicted by the asymptotic bound. A related problem is that of data exchange, whose goal is to translate an instance of a source schema S to an instance of a target schema T , given a set Σ of source to target tuple generating dependencies (s-t tgds) between S and T [5, 8, 18]. These mappings are similar to our bdds but only specify implication in the direction from S to T . ∀X, Y ϕ(X, Y ) → ∃Zψ(X, Z) (4.2) Where ϕ is a conjunction of relational predicates over S and ψ is a conjunction of relational predicates over T . A solution to a data exchange problem (S, S, T , Σ) is a target instance T such that S, T satisfy Σ. Tuples in a solution may contain values or labeled nulls (i.e. variables). A universal solution is 30 one which can be mapped into every solution via homomorphism, and can be found in polynomial time by chasing S on Σ. The tuple generating dependency (tgd) chase works by picking any set of values X, Y which violate a tgd given by the general form in Equation (4.2) (i.e. ϕ(X, Y ) is true, but there is no Z such that ψ(X, Z) is true), and inserting (X, Z) into the relations in ψ, where Z consists of distinct variables. The chase ends when there are no more X, Y that violate any tgd. Data exchange settings usually also consider tgds and equality generating dependencies (egds) on T . We do not consider dependencies on C, but note that when target constraints are considered, the existence of a universal solution depends on the graph defined by Σ being weakly acyclic [5]. Roughly speaking, the graph is weakly acyclic if there are no cycles involving existentially quantified variables. In Youtopia [38], updates are translated between peers using s-t tgds to express constraints amongst data instances. Each insertion or deletion is cascaded individually with forward and backward chase steps. Whereas this incremental cascading is appropriate in a peer network, we advocate a batch process for base/contingent coordination scenarios. Update translation for data coordination brings about new challenges for the following reasons: 1. We desire an exact solution; our mapping constraints demand equality (bidirectional logical implications), as opposed the more common subset constraints. This means that special attention must be paid to side-effects in the form of spurious deletions from C or insertions into B. 2. Because of the base-contingent relationship dictating the directional relationship from B to C, we are constrained to only updating C. 3. We need to consider sets of insertions and deletions, and the interaction between them. I.e. translating each insertion or deletion separately and taking the union will produce incorrect results. Our methods for translation of view insertions (Section 4.2) make use of the tuple generating dependency (tgd) chase [1]. In doing so, we generate an incomplete translation containing labeled nulls, which is similar to a universal solution in data exchange. However, where as data exchange concerns itself with the class of queries which can be answered on T , we are interested in eventually completing our update translation by having a user choose values for the labeled nulls. As mentioned in Example 4.1, the bidirectional relationship of our mappings means that some choices for these values could result in mapping violations. We use techniques from incomplete information [23] to derive a constraint on the labeled nulls which guarantees mapping satisfaction. In Section 4.3, we translate deletions separately, by using the contrapositive of our bdd constraint. We give algorithms which find a minimal expression yielding all feasible deletion translations, while allowing the user ultimate flexibility to choose how each deletion is translated. We discuss combining insertion translation and deletion translation in Section 4.4. 31 4.2 Insertions We begin by reformulating the insertion translation problem using first order logic. Recall that our data coordination scenario uses a bdd ψ(X, Z) ↔ φ(X, Y ) with query predicates defined as QB (X) :− ψ(X, Z) and QC (X) :− φ(X, Y ). Also recall that φ(X, Y ) is a conjunction of k predicates, with each φi (Xi , Yi ) referring to one of the relations of C. We can restate our bdd as two tgds QC (X) → φ(X, Y ) (insertion tgd) (4.3) φ(X, Y ) → QC (X) (deletion tgd) (4.4) We call Equation (4.3) the insertion tgd because it is violated by the insertions Q+ C . Similarly, equation (4.4) is called the deletion tgd because it is violated by the deletions Q− C . The main challenge is to determine the possible states of C so that both tgds are satisfied with respect to the updated instance QC . What distinguishes this constraint problem from prior data exchange work is that the tgds point in both directions: from s-t and t-s. Also, when dealing with deletions we have the assertion of negated predicates, meaning t ∈ Q− C is equivalent to the fact ¬QC (t[X]). Our method for translation of insertions has two steps: chase and constrain. In the chase step, we use the tgd chase to generate a universal solution that satisfies the insertion tgd. The result is an incomplete instance that contains labeled nulls for the unknown values. We expect that the user will want to eventually specify values for some or all of these labeled nulls. Following Example 2.1 from Section 2.1, the mapping allows tuples inserted into QC to dictate the quantity and description of items in the ProjectItems and ItemRates tables, but the code and rate (price) are unknown. The cost estimator has sufficient knowledge or external resources to be able to specify values for these attributes; for example, they may call a supplier in order to determine a new item’s rate. Since we want the mapping constraint to still hold with respect to any values specified by the user, we need to determine which values cannot be chosen. In the constrain step, we use methods for query evaluation on incomplete information to find such a constraint, yielding an incomplete instance which satisfies both the insertion and deletion tgds. We continue Example 4.1 as follows: Example 4.2 A tuple Q+ C (f.com, g.ca) is inserted into QC , corresponding to a two-link hop between URLs f.com and g.ca. The tgd chase on Q+ C results in link+ (f.com, z), link+ (z, g.ca) where z is a labeled null representing the (currently unknown) connecting URL between f.com and g.ca. This is acceptable if our goal is to find the certain answers to a class of queries over 32 C [18, 24]. However, since the user’s goal is to eventually resolve a ground instance of C, we need to be careful about additional side-effects. In this case, choosing z = d.net falsely asserts that there is a two-link hop between c.edu and g.ca. Past approaches to data coordination such as [38] would enforce satisfaction of both tgds given in equations (4.3) and (4.4) via a chase sequence which inserts into both C and QC . This sequence would terminate in our case because these tgds are weakly acyclic [5]. However, this approach is not an option for coordination between sources having a base-contingent relationship, because the chase rule on (4.4) would result in insertions into QC (which must remain fixed). The challenge in translation of insertions is to satisfy (4.3) without violating (4.4). 4.2.1 Chase Step The chase step translates the set of insertions by chasing [5, 18] to find a universal solution C + to the data exchange problem given Q+ C and the insertion tgd (4.3). This universal solution is a + where in addition to constants in const, tuples may also contain sequence of relations C1+ , . . . Cm values over some domain nulls of labeled nulls. We use nulls(C + ) to refer to the set of labeled nulls present in an instance. Recall that each predicate φi in φ refers to some relation Cj , so we will also use φ+ as the conjunctive formula whose relational atoms refer to the relation in the universal solution C + . This universal solution is not necessarily a “core” (smallest solution), although a core could be computed if desired in polynomial time by using the methods introduced in [19]. The following corollary establishes that, after adding this universal solution to our existing instance of C, the insertion tgd is satisfied. It follows from the definition of the chase and the fact that φ is a conjunctive query and hence is monotonic. + + + Corollary 4.1 Let φ+ 1 , . . . φk be the result of chasing QC on (4.3). Let QC = QC ∪ QC and φi = φi ∪ φ+ i . It follows that QC (X) → φ1 (X1 , Y1 ), . . . φk (Xk , Yk ) (4.5) In data exchange, universal solutions are desirable because they can be queried by treating labeled nulls as unique values; the set of answers to a query over a universal solution to a data exchange problem is common to all possible solutions for that data exchange problem. As such, querying a universal solution in this way gives the set of certain answers [18]. A natural question is whether or not the converse of (4.5) holds. That is, if there are certain answers to φ which are not present in QC . We refer to these as spurious tuples. If this is the case, then (4.4) is necessarily violated, and exact translation is not possible. Consider the following example: 33 Example 4.3 B contains a single relation and C contains two relations C1 and C2 . They are related with the following bdd. B(x1 , x2 ) ↔ C1 (x1 ), C2 (x2 ) If both B and C are initially empty, and there are two insertions B + (1, 1) and B + (2, 2), then the result of the chase step is C1+ (1), C1+ (2), C2+ (1) and C2+ (2). However, the bdd is violated by the tuple t = {x1 : 1, x2 : 2}, which satisfies the right hand side but not the left. We define a necessary condition for the violation of (4.4) by introducing the concept of a Y -disjoint query. We say a query QC (X, Y ) :− φ(X, Y ) is Y -disjoint if there exist nonempty subformulas φa (Xa , Ya ) and φb (Xb , Yb ) such that φ(X, Y ) = φa (Xa , Ya ), φb (Xb , Yb ) (4.6) with Ya ∩ Yb = ∅. I.e. φ can be expressed as a conjunction of two formulas which do not share any existentially quantified variables. We call (4.6) a Y -partitioning of φ. If a query is not Y -disjoint then we say it is Y -connected. Theorem 4.1 The deletion tgd is violated with respect to QC and φ only if φ is Y -disjoint. Proof 4.1 Assume for a contradiction that φ is Y -connected and there is a witness s to a spurious tuple s[X]1 . We first prove the proposition that there is some i such that φ+ i (s[Xi , Yi ]). Assume for a contradiction that there is no such i. Since s satisfies φ , it must be the case that s satisfies φi for all i. In this case, QC (s[X]) must also be satisfied, contradicting the assumption that s is spurious. Now let i be such that φ+ i (s[Xi , Yi ]). Since φ is Y -connected (by our assumption), it must be the case that Yi is nonempty (unless k = 1, in which case we have Q+ C (s[X]), contradicting our assumption that it is spurious). Due to the way in which the chase operates, s[Yi ] is a sequence of values in nulls which are unique to some u ∈ Q+ C which caused their generation. Hence, s[Xi ] = u[Xi ] for some u ∈ Q+ C. Since we assume that the condition in this theorem is violated, there must be a φj (Xj , Yj ) such that Yi ∩ Yj = ∅. Since s[Yi ] is unique to u, then s[Yi ∩ Yj ] is also unique to u, and hence s[Xj ] = u[Xj ]. Taking the transitive closure, we eventually reach all predicates in φ. Hence s[X] = u[X], which contradicts the assumption that s is spurious. Theorem 4.1 is not sufficient; the existence of certain answers which are spurious tuples depends on the data. In Example 4.3, had only B + (1, 1) been inserted, the bdd would be satisfied with respect to C + . Corollary 4.2 follows from Theorem 4.1 and the definition of a spurious tuple. 1 Recall from Section 2.2 that t[X] represents the sequence of values which results from applying t to each variable name in X, and that a witness to a tuple t in a query result is an extension of t onto the variables in the query’s body such that the body is satisfied. 34 Corollary 4.2 The deletion tgd is violated iff for some Y -partitioning φa , φb of φ, there exist t, u, such that φa (t[Xa , Ya ]) ∧ φb (u[Xb , Yb ]) and t[Xa ∩ Xb ] = u[Xa ∩ Xb ] and ¬QC (t · u)[X] Where t · u is defined as (t · u)[x] = t[x] for x ∈ Xb u[x] otherwise In order to check that (4.4) is necessarily violated with respect to QC and φ , we first check if φ is Y -disjoint. If so, we find the certain answers to φ (X, Y ) and compare the result with QC . In the following section we give a method for finding a constraint on the labeled nulls of C + (such as z = d.net, from Example 4.2) so that the deletion tgd is satisfied. Since this method involves querying incomplete information, which is a worst case exponential operation, we also take advantage of the problem structure to directly find the constraint while avoiding redundant evaluation of tuples. 4.2.2 Constrain Step The goal of the constrain step is to derive a constraint on the result of the chase step in order to enforce strict equality, i.e. to satisfy the insertion tgd without violating the deletion tgd. In an earlier approach [41], we expressed the universal solution C + as a conditional table (c-table) and used c-table semantics for relational operators [23] to calculate all possible answers and formulate a constraint which explicitly allows any possible answers which would be spurious. Here we define possible answers using a formalization based in logic, permitting an optimized algorithm which reduces the amount of redundant work (although its asymptotic worst case is similar). Our formalization captures the set of possible answers for a query and the condition under which each is an answer. We will construct a global constraint Φ on the labeled nulls which excludes any possible answer which is not in the updated view. Recall Example 4.2, where link+ (f.com, z), link+ (z, g.ca) were inserted as a result of inserting Q+ C (f.com, g.ca). In that context, we talked about the certain answers to qC . In order to examine the possible answers, we need to consider the case where z binds to other values or labeled nulls. We do this by defining a Possible Answers Generalization which relaxes the original query to allow such bindings: 35 Definition 4.1 (Possible Answers Generalization) Given a conjunctive query Q(X, Y ) :− φ1 (X1 , Y1 ), . . . φk (Xk , Yk ) A possible answers generalization is a pair Q∗ (Z), π) where: ❼ Q∗ (Z) is a conjunctive query with the same body as Q, except using a different set of variables Z. Q∗ (Z) :− φ1 (Z1 ), . . . φk (Zk ) ❼ π : Z → XY is a mapping such that ∀i π(Zi ) = Xi , Yi Intuitively a PAG is a relaxation of a query Q which takes multiple occurrences of a single variable and replaces them with different variables, thus allowing them to bind to different values. By definition, π is also a homomorphism from Q∗ to Q; the homomorphism theorem [1] establishes that the answers to Q are a subset of those to Q∗ . We are specifically interested in finding the tuples that are answers to Q∗ but not Q. For a PAG P = (Q∗ (Z), π), we define the relaxations of a variable v ∈ X, Y as the set of variables in Z that π maps to v. Relaxations(P, v) = {z ∈ Z : π(z) = v} We also define Relaxed (P ) as the set of variables relaxed. Relaxed (P ) = {v ∈ X, Y : |Relaxations(P, v)| > 1} Recall the query (4.1) from Example 4.1 QC (x1 , x2 ) :− link(x1 , y), link(y, x2 ) One possible PAG relaxes y by by replacing each of its occurrences a different variable, as in Q∗C (x1 , x2 ) :− link(x1 , v), link(w, x2 ) (4.7) Where π(v) = π(w) = y, and hence Relaxed (P ) = {y} with Relaxations(P, y) = {v, w}. Recall the possibly spurious tuple (c.edu, g.ca) for the condition z = d.net from Example 4.2, we can capture the condition of a possible answer by requiring that all values bound to Z-variables mapping to the same X, Y variable be equal. Continuing with our PAG (4.7), we would require each satisfying tuple to have the condition v = w, since these are both mapped to y in the original query. For v = z and w = d.net, we say that the condition z = d.net yields the possible answer (c.edu, g.ca). We formally define this as follows: 36 Definition 4.2 (Possible Answer) Given a PAG P = (Q∗ (Z), π), we define a condition c as a mapping from each v ∈ Relaxed (P ) to a set of |Relaxations(P, v)| distinct members of const ∪ nulls. A condition requires that, for each v, all values in c(v) be equal (and hence it follows that in order for c(v) to be consistent, it can contain no more than one member of const). We can represent a condition as a logical formula, Formula(c) by equating all values assigned to a particular v. For example if c(v) = {3, x, y} and c(w) = {2, z}, then Formula(c) is 3=x∧x=y∧2=z We say that a tuple t over Z is a possible answer to P under condition c if 1. The set of t[Relaxations(P, v)] is an ordering of c(v) for each v ∈ Relaxed (P ) and 2. The query Q∗ (t[Z]) is satisfied and 3. Formula(c) is satisfied by t We use PossibleAnswers(P, c) to denote the set of all such t for a given PAG P and condition c. We can compute PossibleAnswers(P, c) as the union of all queries which replace each variable in Relaxations(P, v) with a unique value from c(v). Continuing our example, a condition c binds two values to the variable y, since our PAG P defined in (4.7) has Relaxations(P, y) = {v, w}. A tuple t is a possible answer to P under c if it maps v and w to the two values in c(y) and satisfies P . Given a condition c(y) = {b.co.uk, z}, we can formulate two refinements of the query (4.7) by replacing v and w with these two values link(x1 , z), link(b.co.uk, x2 ) link(x1 , b.co.uk), link(z, x2 ) Given the tuples link(a.com, b.co.uk) and link(z, g.ca), the second refinement yields the possible answer t[x1 , v, w, x2 ] = (a.com, b.co.uk, z, g.ca). Our goal for the constrain step is to find all conditions c such that there is a PAG P with some t ∈ PossibleAnswers(P, c) which is spurious (i.e. t[X] is not in QC ). Our method from [41] found all possible answers by using a brute force execution of the incomplete information semantics of relational operators as described in [23]. We then subtracted QC from the set of possible answers, yielding the set of spurious tuples and the condition for each. We took the conjunction of the complement of these conditions forming an overall global constraint Φ. For example, if (a.com, g.ca) is not a member of QC , then a clause in Φ would be z = b.co.uk. A practical challenge is that the number of possible answers is polynomial in the number of inserted tuples and exponential in the number of predicates in the query which join on a Y variable [41]. In particular, if N is total number of tuples in our database (including inserted 37 tuples) then the number of clauses in Φ (and also execution complexity of our algorithm) is loosely bound from above by o Nk (4.8) (where k is the number of predicates in φ). However, in practice there will be a number of clauses in Φ which are implied by other clauses. Our contribution here is computing Φ while avoiding the computation of tuples which correspond to these clauses. The following example illustrates. Example 4.4 We have a view defined over three relations A, B, and C as QC (x1 , x2 , x3 ) :− A(x1 , y1 ), B(y1 , x2 , y2 ), C(y2 , x3 ) + Suppose that the insertions we wish to translate are Q+ C (0, 1, 2) and QC (3, 4, 5), and that after chasing, and inserting the results into an existing instance we have A (0, a), B (a, 1, b), C (b, 2) A (3, c), B (c, 4, d), C (d, 5) A (6, 7) where a to d are labeled nulls. Let our PAGs P1 and P2 be such that Relaxed (P1 ) = {y1 } and Relaxed (P2 ) = {y1 , y2 }. The set of possible answers to P1 which are spurious is (0, 4, 5) a = c (3, 1, 2) a = c (6, 1, 2) a = 7 (6, 4, 5) c = 7 while P2 contains in addition to these (0, 1, 5) b = d (3, 4, 2) b = d (0, 4, 2) a = c ∧ b = d (3, 1, 5) a = c ∧ b = d (6, 1, 5) a = 7 ∧ b = d (6, 4, 2) c = 7 ∧ b = d If we first determine that (0, 4, 5) is not in the updated view, then we know that Φ will contain a clause a = c. Given this knowledge, we need not compute and check all of the tuples whose 38 condition contains a = c (similarly for conditions a = 7 and c = 7). This last observation illustrates the following key idea of our optimization: for a condition c, if any possible answers implied by c are not members of QC , then we need not consider any conditions whose formula contains Formula(c). In [41] we showed how to efficiently minimize Φ by removing any redundant conditions. Our improvement here is altogether avoiding the computation of such conditions, in particular, avoiding the computation of possible answers for such conditions. In the best case, we obtain a Φ which is polynomial in size. Our algorithm proceeds through a lattice of PAGs, where P2 is an ancestor of P1 if there is a homomorphism from P2 to P1 . It begins with PAGs which relax the original query as little as possible (e.g. P1 in Example 4.4). These PAGs will have conditions c such that Formula(c) contains only a single equality comparison. If any tuple which is a possible answer for c is spurious, we can eliminate from consideration any stronger condition which requires c. As such, we only need to examine conditions which are composed of weaker conditions which do not imply spurious tuples. First we define the terms stronger and weaker. Definition 4.3 A condition c1 is said to be stronger than a condition c2 if for every v in c1 ’s domain, either c2 (v) ⊆ c1 (v) or c2 (v) is not defined. Conversely, c2 is said to be weaker than c1 . Note that a valuation of labeled nulls satisfying c1 necessarily satisfies c2 . Our algorithm ensures that for every spurious t under condition c, then either ¬Formula(c) ∈ Φ, or there is some c weaker than c such that ¬Formula(c ) ∈ Φ. We establish correctness with the following theorem Theorem 4.2 For every PAG P = Q∗ (Z), π and condition c defined over P , for every condition c weaker than c, there is a PAG P over which c is defined, to which P maps via homomorphism. Since our algorithm searches through a lattice of PAGs defined by homomorphism, Theorem 4.2 establishes that when we examine a condition c for some PAG P , we are assured that every weaker condition c has already been considered. Proof 4.2 (Theorem 4.2) We construct a homomorphism h by examining three cases for each v ∈ X, Y We then define a PAG P by applying h to P . 1. For each v ∈ Relaxed (P ) such that c (v) ⊂ c(v), choose any subset S of Relaxations(P, v) having size |c (v)|. Let h be the identity for elements in S, and let h map each element of Relaxations(P, v) − S to some arbitrary element of S. E.g. if Relaxations(P, y) = {y, y , y }, and |c (y)| = 2, then we might choose h to be the identity for y and y and h(y ) = y . 2. For any v ∈ Relaxed (P ) such that c (v) is undefined, h maps the set Relaxations(P, v) to v. 39 3. For all other v, h is the identity. We define P = Q∗ (Z ), π where Q∗ (Z ) = φ1 (h(Z1 )), . . . φk (h(Zk )) and π (u) = u for c (u) undefined π(u) otherwise The condition c is defined over P since it maps each v ∈ Relaxed (P ) to a set of size |Relaxations(P , v)|. To show that P is a PAG and that h is a homomorphism from P to P , we need to show that π (h(Zi )) = Xi , Yi . Since P is a PAG, we can do this by showing for all u ∈ Z, that π(u) = π h(u) . For v as in case 1, if u ∈ Relaxations(P, v), then since h(u) ∈ Relaxations(P, v), we have π(u) = π (h(u)) = v. For v as in case 2, for all u ∈ Relaxations(P, v), we have π(u) = v by definition of Relaxations(P, v), and π (h(u)) = v by definition of h and π . All other v are covered by case 3, for which h(v) = v and π (v) = π(v). Algorithm 4.1 outlines our solution. It proceeds through the lattice in levels. A PAG’s level represents its degree of relaxation, and is given by the expression Level (P ) = Relaxations(P, v) v∈Relaxed(P ) The first loop bootstraps the process by examining all PAGs at level two (i.e. whose conditions have only a single inequality atom). For each PAG, it stores a set of Ambivalent conditions, which do not generate any spurious tuples and hence are not necessarily violated. The second loop proceeds to higher levels of the lattice, attempting to combine ambivalent conditions from the children of each PAG. In the remainder of this section we describe how each operation in this algorithm is performed. We can find all level two PAGs by the following enumeration: 1. Choose a variable v which appears in more than one predicate (the pivot variable). 2. Partition the occurrences of v into two non-empty sets. 3. Assign a new variable to each set, and use π to map from the new variables to v. Since each condition either equates two labeled nulls or equates a labeled null with a constant, we can be a bit restrictive in our choice of pivot without excluding any possible answers. We can safely exclude variables which cannot bind to labeled nulls, or for whose value does not affect the outcome of the query. Thus the following requirements for a choice of pivot v: 40 1. There are distinct i, j, such that v ∈ Yi and v ∈ Yj . 2. There are distinct i, j, such that v ∈ Xi and v ∈ Xj , for some φi such that v ∈ Xi , another predicate in φ refers to the same relation as φi , and has a Y -variable at a position where v occurs in φi . For example, only x2 or y2 could be chosen as pivots in the following query. Q(x1 , x2 , x3 ) :− A(x1 , x2 ), A(x3 , y1 ), B(x2 , y2 , y2 , x3 ) We can find the set of conditions c on step 5, by scanning the (at most) two relations which contain the relaxed variable. We can then find all conditions by taking all pairs from this set such that at least one member of this pair is a labeled null. The loops of lines 13-15 and 30-32 are necessary because for a given (P, c) pair, there may be a c defined over a P which is neither a descendant nor ancestor of P , and for which Formula(c ) = Formula(c). Since c and c will be on the same level of the lattice, we check if one of them is ambivalent and the other is not. They must both be ambivalent in order to be passed to the next level of the search. On line 21, we rely on the ability to determine if a set of conditions C formulated over a PAG P ’s children can be combined into a coherent condition c. For example, consider a query A(x, y, y, y) And PAGs P1 and P2 on level 2, with Q∗1 (x) :− A(x, v, w, w) Q∗2 (x) :− A(x, a, a, b) We want to determine if two conditions c1 and c2 over P1 and P2 can be combined into a coherent condition over their parent PAG P , with Q∗ (x) :− A(x, v, w, u) Our condition c will be such that c(y) combines the values from both c1 (y) and c2 (y). In order for c to be satisfiable, c(y) must contain three distinct values, at least two of which are labeled nulls. If c1 (y) = {z, 2} and c2 (y) = {z, 3}, then c(y) cannot be satisfied because z cannot be both 2 and 3. If, on the other hand, c1 (y) and c2 (y) contain no values in common, we should not combine them into the condition c (for example, by choosing any 3 values from c1 (y) ∪ c2 (y)) because we obtain a condition which is stronger than some condition which is not weaker than either c1 or c2 , and 41 hence may not be ambivalent. We formalize this beginning with the observation that if P is a child of P , then either (but not both) of the following conditions hold: 1. P relaxes only one variable v which is not relaxed by P , with |Relaxations(P, v)| = 2. 2. P relaxes the same set of variables as P , but there is one v which is relaxed one additional degree. Formally, let c1 , . . . cg be a collection of conditions over all P1 , . . . Pg children of P . Then c1 , . . . cg can be combined iff for every Pi , Pj such that Relaxed (Pi ) intersects Relaxed (Pj ), for all v ∈ Relaxed (Pi ) ∩ Relaxed (Pj ) 1. |ci (v) ∪ cj (v)| = |Relaxations(P, v)|, and 2. there is at most one constant in ci (v) ∪ cj (v) Note that there need not necessarily be such a v, in which case every pair of conditions may combine. The combined condition c sets c(v) as the union of all ci (v). Lastly, recall that the set PossibleAnswers(P, c) can be found as the union of a set of query refinements. We point out that we can quickly conclude that some of these refinements are unsatisfiable. Due to the way in which the chase operates, we can associate each element of nulls(C + ) with the t ∈ Q+ C which created it during the chase step. If any query uses labeled nulls originating from different inserted tuples in a single relational predicate, then it will not be satisfiable. 4.3 Deletions As with insertions, translating deletions also presents challenges in the form of ambiguities and side-effects. There is ambiguity because any deleted tuple t ∈ Q− C , may have multiple sets of tuples which can be deleted from C, each being a valid translation. The side-effects are that any or perhaps all of these sets may result in the deletion of other tuples from QC . In [39], deletions are translated by making physical duplicates of joining tuples in the base relations and using a modified join semantics, yielding a fully automatic approach which is free from side-effects. We use a semiautomatic approach which leaves the operator semantics intact and relies on a user to make decisions in the face of ambiguity. In this section, we use the contrapositive of the deletion tgd in Equation (4.4) to formulate the set of deletions as a boolean formula. We give an exhaustive algorithm using heuristics which finds the set of minimal solutions to this problem that are feasible (i.e. also satisfy Equation (4.3)). Consider the following Example: Example 4.5 In Example 4.2, suppose that Q− C (a.com, d.net) is deleted. Among the many possible translations, we could delete both link− (a.com, b.co.uk) and link− (a.com, c.edu), or both 42 link− (b.co.uk, d.net) and link− (c.edu, d.net). However, deleting link− (a.com, b.co.uk) has the unintended side-effect of also asserting that the tuple QC (a.com, e.org) is not present (which is false). Note that deletions from QC can only cause violations of Equation (4.4) — when there is no QC (X) for a corresponding φ(X, Y ). Performing a tgd chase in this instance is not a viable solution, because we cannot fix these violations by inserting into QC . Our approach is based on transforming the tgd deletion (4.4) to its contrapositive (quantifiers are shown for clarity) ∀X ¬QC (X) → ¬∃Y φ(X, Y ) (4.9) which we know is satisfied with respect to QC and C. With respect to Equation (4.9), deleted tuples amount to assertions of ¬QC (t[X]) for all t ∈ Q− C . The challenges in satisfying this contrapositive are: 1. The right hand side asserts nonexistence of X, Y , and so we need to find all Y which are in violation for a given X 2. Finding a solution which is feasible in the sense that the insertion tgd (4.3) is not violated. Given φ, the current instance C, and the set of deleted tuples Q− C , we describe how to derive a boolean formula whose terms are complemented relational predicates of C which expresses the set of deletion translations. This is more general than past approaches such as [24], which limits the possible translations and [38], which deletes more tuples than necessary. We then give an algorithm which attempts to find all minimal translations satisfying this formula, without violating Equation (4.3). It is possible that there is no feasible translation. Following Example 4.1, if we have link(a.com, b.co.uk) and link(b.co.uk, a.com), and therefore QC (a.com, a.com) and QC (b.co.uk, b.co.uk), is impossible to delete either tuple in QC without deleting the other. We do not give necessary/sufficient conditions for the existence of a feasible solution; our algorithm reports failure if none can be found. In more detail, the right hand side of Equation (4.9) is equivalent to the following formula, which must hold for each deleted X: − ∀Y φ− 1 (X1 , Y1 ) ∨ . . . ∨ φk (Xk , Yk ) (4.10) In other words, for each witness u of t, we must choose at least one φ− i (u[Xi , Yi ]) in order to satisfy (4.10). The following example illustrates this point. Example 4.6 Recall the database C of links between URLs from Example 4.2, and the view in Equation (4.1) which specified all URL pairs which could be reached in two hops. The contrapositive 43 of the deletion tgd for this view is: − − Q− C (x1 , x2 ) → ∀y link (x1 , y) ∨ link (y, x2 ) where y represents the (unknown) connecting URL. Given Q− C (a.com, d.net), we have two witnesses u1 [x1 , y, x2 ] = (a.com, b.co.uk, d.net) u2 [x1 , y, x2 ] = (a.com, c.edu, d.net) This means that in order to delete t, we must choose a link− to satisfy the following formula link− (a.com, b.co.uk) ∨ link− (b.co.uk, d.net) ∧ link− (a.com, c.edu) ∨ link− (c.edu, d.net) There are four minimal satisfying assignments: link− (a.com, b.co.uk), link− (a.com, c.edu) link− (a.com, b.co.uk), link− (c.edu, d.net) link− (b.co.uk, d.net), link− (a.com, c.edu) link− (b.co.uk, d.net), link− (c.edu, d.net) Only the latter two are feasible, because deleting link(a.com, b.co.uk) has the unwanted side-effect of deleting QC (a.com, e.org). Our approach works by building a formula in conjunctive normal form (i.e. a conjunction of disjunctive clauses). Each disjunctive clause represents the choice of which predicate in φ1 . . . φk to delete for a given witness u to a deleted tuple t. The conjunction is over all t and all u which are witnesses of t. Specifically, − φ− 1 (u[X1 , Y1 ]) ∨ . . . ∨ φk (u[Xk , Yk ]) t∈Q− C (4.11) u witness to t A deletion translation C − consists of a selection of at least one i for each witness u of each deleted tuple t. A deletion translation is feasible if it also does not violate (4.3). Our approach is to find all such C − by transforming (4.11) to a formula, Γ, in disjunctive normal form (i.e. a disjunction of conjunctive clauses). Each conjunctive clause corresponds to a translation C − . During this transformation, we eliminate conjunctions which correspond to infeasible translations, and also minimize Γ by selectively excluding conjunctive clauses which are implied (by absorption) in the current expanded form of Γ. This approach is outlined in Algorithm 4.2. The TranslateDeletions routine constructs a formula in conjunctive normal form, and the RecursiveSearch routine builds Γ by trying combinations of 44 atoms in each disjunctive clause. The fact that QC is monotonic allows our algorithm to prune its search space by not considering any deletion translation which is a superset of an infeasible translation. This is done by the condition on step 6 of RecursiveSearch. Step 8 dictates that all clauses containing a particular term γ should be removed, since choosing that term will satisfy all of them. We can efficiently check the condition on step 6 of RecursiveSearch by maintaining a list of witnesses for all s ∈ QC . Each additional deletion on step 6 of RecursiveSearch may remove one or more witnesses, and the constraint will be violated if any s ∈ QC has no remaining witnesses. Our search forms a tree, with each level corresponding to one or more disjunctive clauses, and each branch corresponding to a term (deleted tuple) in that clause. A path from root to leaf represents a solution, and our algorithm prunes all subtrees which are attached to the root by a path which is infeasible. Our algorithm has a worst case run time of O(sk M n ), where n = |Q− C |, M is the maximum number of witnesses for any deleted tuple, and s is the time to determine if a solution yields spurious deletions. In Section 4.5, we demonstrate that our algorithm performs quite well on realistic cases where relational normalization permits a number of practical optimizations. In the following section, we discuss how to combine the techniques presented thus far to allow the user to arrive at complimentary insertion and deletion translations which form an overall solution. 4.4 Combining Insertions and Deletions Sections 4.2 and 4.3 described our solutions to individually translating view insertions and deletions into sets of possible insertions/deletions on C. The user must then select from these possibilities what they believe is the correct set of updates for their local data. Given a universal solution C + to the insertion translation problem and constraint Φ over nulls(C + ) ∪ const, the user must choose a mapping ρ : nulls(C + ) → const which gives a value for each labeled null such that ρ(Φ) is true. The user must also select one of the conjunctive clauses in Γ which corresponds to the appropriate set C − of deleted tuples. An important consideration is whether or not a given ρ and C − together produce a feasible translation. This depends on whether or not the insertions or deletions are performed first. If the insertions are performed first, then it is possible that one of the deletion translations either is no longer sufficient to satisfy the deletion tgd (i.e. it fails to achieve the deletion of some t ∈ Q− C ), or that one of the deletion translations also causes the deletion of some u ∈ Q+ C . If the deletions are performed first, then it is possible that one of the insertions causes some t ∈ Q− C to be accidentally reinserted; it is not possible, however, that the insertions fail to achieve the insertion of some + u ∈ Q+ C , since qC is monotonic and ρ(C ) is a solution to the insertion translation problem. Consider the following example. 45 Example 4.7 We have a query QC (x) :− A(x), B(y) with an initial state A(0), B(1) and QC (0). We are given a deletion Q− C (0), which can be translated + + as either A− (0) or B − (1). We are also given an insertion Q+ C (2), which implies A (2) and B (a) where a is a labeled null. If B − (1) is chosen, then there is a conflict no matter what value is chosen for a; if B − (1) is deleted before inserting B + (a), then QC (0) remains; if B + (a) is inserted before deleting B − (1) then either QC (0) remains (if a = 1) or QC (2) does not (if a = 1). In the above example, we would like to automatically make the prior determination that B − (1) is not a feasible deletion translation, no matter what concrete insertion translation is chosen. Our approach is to first insert the universal solution, C + , into C, obtaining a partially updated instance which we denote as C ∗ . We use Q∗C = QC ∪ Q+ C to denote the partially updated view. We then perform deletion translation with respect to Q∗C , C ∗ and Q− C , except adjusting Algorithm 4.2 to exclude any C − which contains a tuple in C + . Assuming that C ∗ and Q∗C satisfy both the insertion tgd (4.3) and deletion tgd (4.4), then the following two statements establish that by excluding all C − which intersect with C + , we eliminate all deletions which cannot be valid for any ρ, and do not eliminate any deletions which could be valid for some ρ. Theorem 4.3 Any result C − of Algorithm 4.2 with inputs C ∗ and Q∗C that intersects with C + will necessarily violate (4.3) with respect to C ∗ − C − and Q∗C − C − . Lemma 4.1 For any result C − of Algorithm 4.2 with inputs C ∗ and Q∗C which does not intersect with C + , there exists a ρ such that (4.3) and (4.4) are satisfied with respect to Q∗C − Q− C and C ∪ ρ(C + ) − C − . Proof 4.3 (Theorem 4.3) Let i, s be such that s ∈ Ci+ ∩ Ci− . Since all deletion translations are minimal (i.e. no subset of C − is a feasible translation), then deleting s is necessary to delete some t ∈ Q− C. Let u be the tuple in Q+ C which caused s to be generated. Assume for a contradiction that s can be deleted without causing u to be deleted. Let φj be the predicate in φ which refers to Ci , and which caused the generation of s during the chase step. Let v be the witness for t from which s was chosen, and let r be the extension of s such that r is in the universal solution and r is a witness for u. Since s comes from a minimal set of deletions, we can conclude that φj is the only predicate of v which is deleted. Let φa (Xa , Ya ) be the largest Y -connected component of φ containing φj . Due to the way in which the chase operates, s[Yj ] is a (possibly empty) set of labeled nulls unique to u (if it is empty then φa = φj ). This implies that any extension s of s which satisfies φa will have s [Ya ] = u[Ya ]. Therefore, we have v[Ya ] = u[Ya ] and hence t[Ya ] = u[Ya ]. If φ is not Y -disjoint, this implies that 46 − X = Xa and hence t[X] = u[X], which is a contradiction because Q+ C and QC do not intersect. Therefore, φ must be Y -disjoint. Since we assumed that u is not deleted, then there is an r such that r satisfies φa . Since φj is the only component of v that is deleted, then it follows that r · v satisfies φ and is a witness for t, contradicting the assumption that t can be deleted. Proof 4.4 (Lemma 4.1) The semantics of querying instances with labeled nulls behave as if each labeled null is a distinct value. If C − is a deletion translation which is feasible with regards to C ∗ (which contains labeled nulls), then choosing a ρ which replaces these labeled nulls with distinct values does not change the result of querying ρ(C ∗ ), and hence C − is a feasible deletion translation for the concrete instance as well. Notice that in the proof of Theorem 4.3 we have shown that φ must be Y -disjoint. We separate this conclusion as a corollary. Corollary 4.3 If φ is not Y -disjoint, then no C − will contain any tuple in C + , and hence every deletion translation will be feasible for some choice of ρ. We have now restricted the set of deletions to exactly those which do not necessarily cause a conflict for all possible concrete insertions. However, the remaining solutions may still conflict with particular choices of ρ, as demonstrated in the following example. Example 4.8 There is a view QC (x1 , x2 ) :− A(x1 , y), B(y, x2 ) With an initial state of A(1, 2), B(2, 1) and QC (1, 1). We are given a deletion Q− C (1, 1), which is translated into the two possibilities of A− (1, 2) or B − (2, 1). We are given an insertion Q+ C (1, 3), which is translated into the universal solution A+ (1, z), B + (z, 3). If the user chooses A− (1, 2), then ρ(z) = 2 cannot be chosen and vice-versa. We require the user to choose the deletion set first, which we first apply to the instance C ∗ . We then perform the constrain step of insertion translation, using QC as the updated view. In this way we guarantee we obtain a constraint which does not re-insert any tuples in Q− C. 4.5 Experimental Evaluation We implemented our update translation algorithms and experimentally evaluated their performance using the TPC-H database benchmark specification [56] and a query (Figure 4.1) from that specification which joins five tables and results in a large view in relation to the size of the base relations. In order to perform experiments varying the number of joins, we removed Region, Nation, Supplier, 47 and Part, in that order. We chose this test instance to be comparable with the scale and complexity of previous experiments on view updating [39]. Our implementation is a C++ program which uses the Berkeley DB external storage API [51], and B-Tree index structures. All experiments were run on a 2.93GHz Intel Xeon based system with 64GB of memory running OpenSUSE Linux. SELECT S_ACCTBAL, S_NAME, N_NAME, R_NAME, P_PARTKEY, P_MFGR, S_ADDRESS, S_PHONE, S_COMMENT, FROM SUPPLIER JOIN PARTSUPP ON PS_SUPPKEY = S_SUPPKEY JOIN PART ON P_PARTKEY = PS_PARTKEY JOIN NATION ON N_NATIONKEY = S_NATIONKEY JOIN REGION ON R_REGIONKEY = N_REGIONKEY; Figure 4.1: TPC-H query Q2 In total, our dataset had ≈1.9M tuples, and the view size is 0.8M tuples. We generated view insertions by uniformly choosing random values from the database; we also randomly generated new values. We generated view deletions by choosing tuples from our materialized view at random. We have implemented two methods for deriving the constraint Φ that is part of the result of insertion translation: ❼ Nested Loop, which is the method from [41], and uses a nested-loop to perform a join of relations containing labeled nulls. This method then computes Φ by determining which tuples in this join are spurious. ❼ Optimized, which uses Algorithm 4.1 to find the minimal Φ without computing any unneces- sary joins. In [41], we also made use of a static tables heuristic, which improved the performance of insertion translation by constraining the set of solutions to only modify certain relations. We have omitted those results here, since we observed that the Optimized method, which was not in [41], performs better. The Nested Loop method makes use of a number of optimizations based on a lexicographical indexing mechanism for disjunctions of inequalities. We define a total ordering on inequalities, and represent Φ as a balanced binary search tree of disjunctions of inequalities, where disjunctions are ordered lexicographically. During the Nested Loop join, we can skip inner loops if Φ contains any of the inequalities in the disjunction required at the current state of the outer loops. This can be determined by performing multiple lookups. We also take advantage of this indexing to periodically reduce some of the redundant terms in Φ. Our first experiment analyzes performance as the number of predicates k in φ is varied (by projecting our test query onto k ≤ 5 relations). The results (Figure 4.2) indicate exponential growth of the Nested Loop method for insertion translation, agreeing with the analysis in Section 4.2.2. 48 350 Insertions - Nested Loop Insertions - Optimized Deletions 300 Wall time (sec) 250 200 150 100 50 0 1 2 3 Predicates in query (k) 4 5 Figure 4.2: Performance of update translation as a function of the number of predicates (k) in qC . The difference in performance between k = 1 and k = 2 is statistically insignificant because the Y -variables do not intersect between the first and second predicates. The effect of increasing k dominates relation size, since the fourth relation contains 25 tuples and the fifth only 5. The Optimized method for insertion translation scales linearly because all conditions which relax variables joining these additional relations result in spurious tuples. As a result, the number of PAGs above level 3 of the lattice remains the same in each case. This gives support for our intuition that the Nested Loop method spends a majority of effort examining spurious tuples for conditions which are implied by other conditions in Φ. The performance of deletions scales roughly linearly. There is a jump from k = 2 to 3, as the number of deletions required for each t increases. The observed scalability is better than would be expected from its worst-case analysis due to the hierarchy of many-to-one relationships in the test query, which causes α(t) = 1 and also a large pruning benefit. Our second experiment (Figure 4.3), varies the size of C. Each algorithm scales linearly, due to the chain structure of the test query. The number of tuples in our test query grows linearly with the size of the data. Two relations (Nation and Region) in our test dataset have a fixed number of tuples. Because of this, the number of possible conditions grows linearly for only one free variable, and remains the same for all others, resulting in linear growth of insertion translation. Our final experiment (Figure 4.4) varies the number of inserted or deleted tuples, n. The performance of the Nested Loop method for insertion translation suffers due to the high degree polynomial in n. The scalability of the Optimized method is again much better, as the majority of 49 900 Insertions - Nested Loop Insertions - Optimized Deletions 800 Wall time (sec) 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 Database size (GB) 8 9 10 Figure 4.3: Performance of update translation as a function of the size (N ) of C. conditions yielding spurious tuples are for PAGs on level 2. The performance of deletions is largely unaffected, and appears to scale linearly with a factor roughly equal to 1. The deletion translation performance exceeds the expectations set in Section 4.3, because the normalized form of the TPC-H schema permits a number of optimizations. In particular, primary key/foreign key relationships between relations allows the set of witnesses for each tuple to be efficiently found by using primary and foreign key indexes, and also for solution feasibility to be easily determined. 50 Insertions - Nested Loop Insertions - Optimized Deletions 1200 Wall time (sec) 1000 800 600 400 200 0 10 20 30 40 50 60 70 80 Number of insertions/deletions (n) 90 100 Figure 4.4: Performance of update translation as a function of update size (n). 51 Algorithm 4.1 Overview of our optimized algorithm for finding Φ. 1: inputs 1: Query Q(X, Y ) 1: Updated view instance QC 2: Φ ← ∅ 3: for all P s.t. Level (P ) = 2 do 4: Ambivalent(P ) ← ∅ 5: for all Conditions c over P which are consistent do 6: if PossibleAnswers(P, c) ⊆ QC then 7: Add c to Ambivalent(P ) 8: else 9: Add ¬Formula(c) to Φ. 10: end if 11: end for 12: end for 13: for all P s.t. Level (P ) = 2 do 14: Remove from Ambivalent(P ) any conditions c for which ¬Formula(c) ∈ Φ. 15: end for 16: m ← 3 17: while There are P on level m such that all children have ambivalent conditions do 18: for all P s.t. Level (P ) = m do 19: Ambivalent(P ) ← ∅ 20: for all Collections C of 1 condition from each child of P do 21: if C can be combined into a condition c then 22: if PossibleAnswers(P, c) ⊆ QC then 23: Add c to Ambivalent(P ) 24: else 25: Add ¬Formula(c) to Φ. 26: end if 27: end if 28: end for 29: end for 30: for all PAGs P s.t. Level (P ) = m do 31: Remove from Ambivalent(P ) any conditions c for which ¬Formula(c) ∈ Φ. 32: end for 33: m←m+1 34: end while 52 Algorithm 4.2 Deletion translation. TranslateDeletions(Q− C) 1: inputs 1: Set Q− C of tuples to be deleted. 2: outputs 2: A formula Γ in disjunctive normal form representing all feasible deletion translations. 3: CNF ← ∅ − 4: for all t ∈ QC do 5: for all u which is a witness to t do 6: CNF ← CNF ∧ φ1 (u[X1 , Y1 ]) ∨ . . . ∨ φk (u[Xk , Yk ]) 7: end for 8: end for 9: Γ ← RecursiveSearch(∅, CNF) 10: if Γ is empty then 11: Report failure 12: end if 13: Minimize Γ using rule of absorption. RecursiveSearch(C − , CNF) 1: if CNF is empty then 2: Output C − as a conjunctive clause 3: else 4: D ← any disjunctive clause in CNF. 5: for all terms γ in D do 6: if (4.3) is satisfied w/resp. to C − ∧ γ then 7: C− ← C− ∧ γ 8: Remove from CNF any disjunction containing γ. 9: RecursiveSearch(C − , CNF) 10: end if 11: end for 12: end if 53 Chapter 5 Update Completion The final step in the data coordination process is for the user (C’s administrator) to actually choose which insertions/deletions, of those generated using methods from the previous two chapters, should be applied to C. Specifically, we call update completion the task of choosing a disjunctive term in Γ (i.e. a set of deleted tuples) and a mapping ρ : nulls(C + ) → const such that ρ(Φ) is true, given Γ, C + , and Φ as computed in Chapter 4. In this chapter we are primarily concerned with two questions: 1. What is the benefit of generating a set of possible updates, versus manual coordination? 2. How can we provide additional assistance to the user in selecting amongst the possible updates? The question of which (if any) of the possible updates corresponding to B ± will be chosen is subjective and context specific. Our work is guided by examples similar to the AEC domain described in Section 1.2. We expect in these examples, which require accurate data and precisely engineered mappings, that there will be a unique, correct, and feasible update amongst the set of possible updates generated, and that this can be resolved by some knowledge held by the user. This assumption may not hold in other data coordination scenarios. In particular, the user may want to select an update which is infeasible (either a conflicting set of insertions and deletions, or by using values for the labeled nulls which violate Φ). This implies the mapping constraint is incorrect and may either need refinement or relaxation. Some recent work uses examples [3] and data provenance [43] to create and refine mappings and information extraction rules, respectively. In another scenario, there may be many possible completions, with varying levels of correctness. In this chapter we describe several approaches which are intended to assist the user in performing update completion, depending on assumptions about the underlying data. We propose some methods for refining the set of choices, and for managing the task’s complexity. We describe a user study which attempts to answer the two questions at the beginning of this chapter, introducing an experimental prototype data coordination system. Our experimental instance consists of a simplified building design and cost estimate, based on data obtained from a real building project (the UBCO project described in Section 1.2). We give both quantitative and qualitative results showing, that semiautomatic data coordination is significantly more efficient than manual coordination, while the benefit of some assistance strategies is highly situational. 54 5.1 Assistance Strategies The number of completions for a given set of possible updates is large; the number of possible deletions is exponential in the number of deleted tuples (where the base is the number of joins in the query qC ), and the number of possible insertions is the size of the cartesian product of the labeled nulls’ domains. We propose a number of strategies to help manage this complexity, to guide the user in selecting a completion. These are given as concepts and directions for future work. A few are evaluated within the context of our user study. Deletion Ranking Ranking the conjunctions in Γ can be done so that those which are more likely to be chosen are near the top. We propose two ranking methods which can be used depending on the application scenario: 1) Size-based deletion ranking prefers deletion sets which remove a smaller number of tuples. 2) Table preference deletion ranking, is similar to the static tables heuristic as discussed in the context of update translation. Rather than explicitly disallow modifying particular relations, table preference deletion ranking uses a preference order of the relations in C, and favours deletion translations which remove more tuples from high ranked relations and fewer tuples from low ranked relations. Insertion Value Suggestion Since the number of choices for ρ is possibly infinite, ranking is not a realistic option. For this reason, we suggest an approach which gives suggested values for some number (not necessarily all) of the elements of nulls(C + ) based on incidence with C. Specifically, we suggest values for ρ based on maximizing |ρ(Ci+ ) ∩ Ci | i This can be done by, for tuple t ∈ Ci+ , looking for a matching u ∈ Ci that has the same values for the attributes of t which are non-null. We would suggest the values of u for the attributes of t which are labeled nulls. For example if there is an inserted tuple t = z, “123 Foley St.” which matches an existing tuple u = “Jane Doe”, “123 Foley St.” , we would suggest z = “Jane Doe”. This is related to the record matching problem (also called the merge/purge problem, entity matching, deduplication, and object reconciliation) which occurs when constructing a data warehouse [54]. Where as the record matching problem often deals with textual similarity measures to resolve typos, inconsistent terminology and abbreviations (e.g. “J. Doe”), our problem is simplified because a labeled null can match any value. This approach would work well in situations where the labeled nulls are functionally dependent on a subset of the non-null attributes, and hence there is at most one possible match for each t. The general case is more complex: it is possible there are multiple matches for a given t, or that there is a labeled null shared by some t1 and t2 , with matches u1 and u2 respectively which give conflicting suggestions. We leave these problems as future work. 55 Incremental Translation If the number of inserted and deleted tuples is large, then the size of each deletion set, the size of C + and the number of labeled nulls will also be large. In such cases, we propose incremental translation, where the user makes a number of individual choices for each inserted or deleted tuple. For the inserted tuples, an individual choice would consist of choosing values for the set of labeled nulls corresponding to a particular inserted tuple (i.e. generated during a single chase rule application). For the deletions, the user would have to choose, for a given t ∈ Q− C and witness u for t, from which relation in φ to delete. Since we expect the number of predicates k in φ to be relatively small, it is reasonable to assume that presenting them all at once for a given t and u will not overwhelm the user. This is similar to the approach taken in the Youtopia project [38], where user intervention is made to resolve ambiguity in each individual update. A benefit in our work is that by computing the set of updates in a batch, we can take advantage of relationships between those updates. For example, consider a database C which has relations Person(id, name), and PhoneNumber(number, person id, type), and a view QC (name, number) which joins Person and PhoneNumber. If we delete all of the name, number pairs for a given name, the preferred translation may be to delete the corresponding person from the Person relation. Provenance Data provenance is “the description of the origins of a piece of data. . . ” [10]. The provenance of a suggested update may provide important contextual information to help the user determine whether or not to choose it. For inserted tuples, each t ∈ Q+ c gives the provenance of all tuples in C + generated by it during the chase step of Section 4.2.1. In turn, the definition of t’s why-provenance as it is computed by the query qB is addressed in [14, 10]. We could also think of each conjunctive term in Φ as having why-provenance in the form of the spurious tuples it generates. This information could easily be generated by Algorithm 4.1 each time a condition c is found to imply spurious tuples. For deletions, each t ∈ Q− C is the why-provenance of each tuple in all witnesses to t, as computed on line 6 of Algorithm 4.2. Since each deleted tuple may be a witness for multiple members of Q− C, we would need to store a set for each of them. To go one step further and find the provenance for a given t ∈ Q− C would require finding its provenance in the old instance of B, and taking the intersection with B − . 5.2 A Data Coordination User Study We have performed a user study to evaluate the overall benefit of semiautomatic data coordination, and of some of the specific update completion strategies discussed in this chapter. Our user study aims to answer the following questions: Q1 Does semiautomatic data coordination provide a significant time savings versus manual coor56 Figure 5.1: A screen shot of the B Tables screen of the data coordination prototype. Each table in the current instance of B is shown in a separate tab. dination? Q2 For what types of mappings and updates is semiautomatic data coordination most useful? Q3 Does the addition of ranking deletions and suggesting insertions using table preference ranking, in combination with showing B ± and Q± C , provide a significant time savings versus “raw” semiautomatic data coordination? Q4 Where does a user focus their effort during the task of coordinating a particular set of updates? Section 5.2.1 describes our experimental prototype, and Section 5.2.2 describes the data coordination scenario, and experimental protocol. 5.2.1 Experimental Prototype In order to answer these questions, we have created a data coordination prototype which is used on an experimental data coordination instance. Our prototype connects to two databases, B and C, and is configured with a set of mappings in the form of pairs of SQL queries. It has three main screens, only one of which may be showing at any given moment. ❼ B Tables: Which shows the current instance of B. The tables are organized into tabs so that one table is visible at a time. An example is shown in Figure 5.1. ❼ Mapped Data: The mappings are organized into tabs, with one mapping viewable at a time. Each mapping is shown side-by-side, with qB (the SQL statement) and QB (the resulting data) on the left, similarly for C on the right. An example is shown in Figure 5.2. ❼ C Tables: Which shows the current instance of C, except with the ability to display more than one table at a time in a side-by-side manner. This feature was added so the tables can be arranged to mimic the ordering of the relations in φ so that the user can easily see which tuples will join. Below these tables, we display information on the translated insertions. An example is shown in Figure 5.3. Each time B changes, the B Tables screen is refreshed to show the new contents of B, and the Mapped Data screen is refreshed to show then new contents of QB for each mapping. A set C + 57 Figure 5.2: A screen shot of the Mapped Data screen of the data coordination prototype. Here, a mapping is shown side-by-side, with each side displaying the SQL query and resulting view instance. of inserted tuples is generated and disjunction Γ of sets of deleted tuples are generated using the techniques described in Chapter 4, and these are shown in the C Tables screen. The user is then able to enter values for the labeled nulls and select a deletion set to be applied to C. To display the set of possible updates, we inserted C + into the tables shown in the C Tables screen. Each inserted tuple was highlighted green to make it easy to identify. This is shown in Figure 5.4. Labeled nulls were indicated with “X n” where n is a number which uniquely identifies a labeled null. In a separate space at the bottom of the screen (I), a two-column list of all labeled nulls is displayed, with the identifiers in the left column, and a space to enter a value in the right column. The user is able to double click on the space next to a labeled null and enter a value. The corresponding tuple in the table above has its value automatically updates (II). The constraint Φ on the labeled nulls is shown in a separate tab at the bottom of the C Tables screen using Java syntax. For example (X 1 != 222020) && (X 1 != 222021). The possible deletion sets in Γ are also shown in a tab at the bottom of the C Tables screen, with each deletion set being shown as a row in a table. Each deletion set is given as a comma-separated list of tuples, where each tuple is identified using its table name and primary key. Selecting a deletion set causes the corresponding tuples in the tables above to be highlighted orange, as shown in Figure 5.5. 5.2.2 Experimental Instance We created a limited scope data coordination scenario based on the real world data we obtained for the UBCO construction project described in Section 1.2. Our intention in choosing this scenario was to be indicative of realistic complexity, but simplified enough to be comprehensible in a single 90-minute experimental session. Our building database, B, was created from a subset of the architectural model of the UBCO 58 Figure 5.3: A screen shot of the C Tables screen of the data coordination prototype. In this case, two tables of C are shown side-by-side. The area below is used to display the update translations. 59 Figure 5.4: A screen shot of the C Tables screen of the data coordination prototype. A pair of inserted tuples are shown, highlighted in green. At the bottom of the screen (I), a table shows the set of labeled nulls along with the values currently chosen. As values are entered, they are displayed in the tuples which appear in the tables (II). building. We exported this model into ifcXML [50], and created a relational schema corresponding to doors, walls, and wall layers which was then populated by extracting from this ifcXML. The schema for B is shown in Figure 5.6. Our cost estimate database, C, was created by choosing items from the UBCO estimate which are relevant to the data extracted for B. We also created a normal form for this estimate, by using two relations to separate the project-specific quantities of each cost estimate item from the item details. The cost estimate schema is shown in Figure 5.7, where C ITEM contained item details and C BUILDING EST contained item quantities for the building in C. We chose as mappings two relationships representative of real cost estimating scenarios. Mapping 1 expresses the relationship between an aggregation of the surface area of walls having a layer of “Concrete Blocks” and the quantity of a corresponding item in the cost estimate. These queries, which we refer to as qB,1 and qC,1 are shown in Figure 5.8. Mapping 2 expresses the fact that the names and counts of each type of door should be the same as the descriptions and quantities of items in the “Doors & Windows” category. The corresponding queries qB,2 and qC,2 are shown in Figure 5.9. The data we used is given in Appendix B for reference. The initial views for Mapping 1 and 2 are given in Tables 5.1 and 5.2 respectively. We chose these two mappings to satisfy conflicting design goals: ease of understanding, and representative of realistic complexity. By using a concrete scenario we help make the mappings easy to understand and relate to. The two queries qC,1 and qC,2 present varying levels of difficulty for our subject. Updates propagated through query qC,1 will adjust the quantity of a particular 60 Figure 5.5: A screen shot of the C Tables screen of the data coordination prototype, with four deletion sets shown at the bottom of the screen. The second deletion set is selected, causing the corresponding tuples to be highlighted in the tables above. Figure 5.6: The building design B schema. 61 Figure 5.7: The cost estimate C schema. SELECT SUM(AREA) FROM B_WALL, B_HAS_LAYER, B_LAYER WHERE B_WALL.WALL_ID = B_HAS_LAYER.WALL_ID AND B_HAS_LAYER.LAYER_ID = B_LAYER.LAYER_ID AND B_LAYER.CATEGORY="Masonry" AND B_LAYER.SUBCATEGORY="Concrete Blocks"; = SELECT QTY FROM C_BUILDING_EST WHERE ITEM_ID = 42200030; (b) qC,1 (a) qB,1 Figure 5.8: Mapping 1 between B and C. SELECT NAME, COUNT(NAME) FROM B_DOOR GROUP BY TYPE; = SELECT DESCRIPTION, QTY FROM C_ITEM, C_BUILDING_EST WHERE C_BUILDING_EST.ITEM_ID = C_ITEM.ITEM_ID AND C_ITEM.CATEGORY = "Doors & Windows"; (a) qB,2 (b) qC,2 Figure 5.9: Mapping 2 between B and C. SUM(AREA) 589.0 Table 5.1: Initial view corresponding to Mapping 1 from Figure 5.8. It consists of a single aggregate value. We use the column names from query qB,1 . NAME Hollow Core Door - Single Hollow Core Door - Single w/Vision Panel Hollow Steel - Double COUNT(NAME) 1 1 2 Table 5.2: Initial view corresponding to Mapping 2 from Figure 5.9. It consists of a set of door types and the number of doors of each type. We use the column names from query qB,2 . 62 item. Since qC,2 involves a join, it introduces an extra degree of difficulty in choosing whether an insertion or deletion should modify C ITEM, C BUILDING EST, or both. We initially used a join in qC,1 as well, but found through pilot trials that subjects struggled to understand the mapping. Using a join for the right-hand side of only one mapping increased understandability. 5.2.3 Experimental Procedure Each subject read a preparatory script which gave a high level introduction to the scenario, specified their role as the cost estimator, gave a description of each data source and an explanation of the mappings (see Appendix C). Each subject was given eight distinct update cases (Appendix D). In each update case a set of updates was applied to B, and the subject was required to select the corresponding update to C. Subjects were instructed that their goal was to ensure the mappings are satisfied, and should assume that the mappings are correct and complete (i.e. they need not worry about effects on the cost estimate beyond the scope of the mappings). The first two update cases were warm up tasks, where the preparatory script explained how to use the prototype to choose the correct translation. The results of these tasks were excluded from our results. The remaining six cases are summarized as follows: 1. The area of a single wall is decreased, causing Mapping 1 to be violated. The correct translation is to update the quantity of an item in C BUILDING EST. 2. A new door is added, causing Mapping 2 to be violated. A door with the corresponding description exists in C ITEM, so the correct translation is to insert a single tuple into C BUILDING EST with the corresponding quantity. 3. The material of a wall is changed, causing Mapping 1 to be violated. The correct translation is similar to the first case. 4. The type of two doors is changed, causing Mapping 2 to be violated. As opposed to case 2, new tuples must be inserted into both C ITEM and C BUILDING EST. 5. The width and height of two doors are changed. In this case, neither mapping is changed and so there should be no effect on the estimate. 6. A total of four different changes to B cause both mappings to be violated. As mentioned at the beginning of Section 5.2, question Q3 aims to evaluate whether a combination of assistance strategies provides a significant time savings versus raw semiautomatic coordination. For the cost estimating scenario, we have implemented the following assistance strategies: deletion ranking, insertion value suggestion, and provenance information. For changes which affect Mapping 1, there will always be only a single feasible deletion and no labeled nulls in the insertion, 63 so deletion ranking and value suggestion should provide no benefit. For Mapping 2, we rank deletion sets according to the number of tuples deleted from C BUILDING EST, due to the inclusion dependency of C BUILDING EST in C ITEM. For the same reason, we suggest values for inserted tuples based on incidence with C ITEM. We give provenance information by revealing changes to B and to the mappings. After each update case the insertions into B are highlighted with green in the B Tables screen, and the deletions are highlighted yellow. For each update case, a user was presented with one of three experimental conditions, expressed as a configurable Assistance Level within our prototype: ❼ NONE. This represents the manual coordination scenario. When the assistance level is set to NONE, the Mapped Data screen of the prototype is not available, and no update translations are given. The subject is allowed to browse and query both B and C, but must analyze the data and use their understanding of the mappings to compose SQL update statements on C. ❼ LOW. This represents the “raw” semiautomatic coordination case. The prototype operates as described in Section 5.2.1. ❼ HIGH. This employs the use of assistance strategies as described above: the set of deletions is ranked, suggested values are given for the labeled nulls, and the changes to B and each mapping are explicitly shown. Our inclusion criteria for subjects is that they have a basic understanding of relational database concepts and SQL. In total 14 subjects were recruited. We used a within-subjects protocol (where each subject was presented varying Assistance Levels throughout the experiments) to eliminate the possible skew of a particular subject being matched to a particular assistance level. Our results are given in the following section. 5.3 Experimental Results Question Q1 Our first question is whether semiautomatic data coordination provides a time sav- ings benefit over manual coordination. In our experiments, manual coordination is when Assistance Level is set to NONE. The subject is aware of the mappings, and performs manual coordination by executing both queries for each mapping on B and C. They are instructed to examine the view differences, and on how to mentally invert the qC side of each mapping to arrive at a set of updates. This procedure is favourable to manual coordination because mappings have already been established and the user is aware of how to use them for coordination. It is more realistic to believe that the user would not have any formal mappings, and would begin the process by examining B for changes. We measured the amount of time for each subject to complete each update case, and compared the times for subjects with Assistance Level NONE against the times for subjects with Assistance 64 Update Case 1+3 2+4 5 6 t 4.2000 1.7601 2.0698 3.6163 p 0.0005 0.1000 0.0386 0.0056 Table 5.3: Results of an independent samples t-test comparing the mean update completion time of Assistance Levels NONE and LOW for each update case. 1400 NONE LOW HIGH 1200 Time - Seconds 1000 800 600 400 200 0 1+3 2+4 5 6 Update Case Figure 5.10: The time taken for subjects to complete each update case, based on assistance level. Each bar is centered on the mean, and has height of twice the standard deviation. The whiskers indicate the minimum and maximum values. level LOW. We grouped update cases 1 and 3, as well as 2 and 4 due to their similarity. We used an independent samples t-test to estimate the likelihood p that the expected time for subjects with Assistance Level LOW is not different from the expected time for subjects with Assistance Level NONE, given the observed measurements. As can be seen in Table 5.3, we found a significant benefit (90% or greater confidence) of semiautomatic data coordination for all update cases. We note that due to our grouping of update cases (1 with 3 and 2 with 4), some of the observed times for the LOW and NONE cases are from the same research subject. This violates the t-test’s assumption that all samples are independent, resulting in estimates that are more conservative than if the samples were independent (i.e. where we would expect smaller p values). Question Q2 Our second question is to what degree does the benefit of semiautomatic data coordination will depend on the nature of the mappings and updates. In Figure 5.10, we plot the time taken for subjects to complete each update case, for each assistance level. 65 Update Case 1+3 2+4 5 6 t p Not significant 2.1020 0.0259 -2.2521 0.0295 Not significant Table 5.4: Results of an independent samples t-test comparing the mean update completion time of Assistance Levels LOW and HIGH for each update case. For Update Cases 1 and 3, there is a large difference between the manual and semiautomatic approaches. This is not surprising, since both of these cases cause a violation of Mapping 1, for which update translation can be performed unambiguously (i.e. there are no labeled nulls and only one possible deletion translation), and hence data coordination is fully automated. In contrast, there is a much narrower difference between the manual and semiautomated approaches for Update Cases 2 and 4, which violate Mapping 2 and results in ambiguous update translations. Update Case 5 is the instance where B changes but both mappings are satisfied. All subjects who had an Assistance Level of LOW for this case were able to complete it fairly quickly, because no suggested updates are given. Interestingly, subjects who had an Assistance Level of HIGH took longer, possibly because they were distracted by explicitly seeing the changes to B (more on this below). Subjects coordinating manually for this case were varied: some subjects ran the mapping queries in MySQL Workbench and were able to fairly quickly conclude that there should be no changes. Other subjects spent time trying to determine how B had changed, and if there should be any effects on C. The final case is the set of updates which violate both mappings. This set of updates is larger and more complex than all previous updates. Three different types of changes to walls are made, as well as a type of door is changed. Here we see clearer evidence of the benefit of semiautomatic over manual coordination. The HIGH assistance level subjects did not outperform the LOW subjects, because the advantage of suggested values and ranked deletions was offset by the burden of reviewing the changes to B. Question Q3 Similarly to Q1, we performed an independent samples t-test comparing the mean completion time between Assistance Levels LOW and HIGH (Table 5.4)2 . Surprisingly, there was no significant difference at or below p = 5% to indicate that ranking, labeled null value suggestion, and showing B changes provides a time savings. The strongest support was for Update Cases 2+4, where (correct) suggestions are given for update translations correcting a violation of Mapping 2. As seen in Figure 5.10, there was a disadvantage of Assistance Level HIGH for Update Case 5. We discuss the cause of this while investigating Q4. 2 As with Q1, since some of these measurements are for the same subject, our results are slightly conservative. 66 300 B Mappings C 250 Time - Seconds 200 150 100 50 0 NONE LOW HIGH Assistance Level Figure 5.11: The average time per update case spent on each screen of the prototype, for each assistance level. Question Q4 Question Q4 asks how a user spends their effort while performing a coordination task, under each assistance level. For each update case, we analyzed the amount of time spent on each screen of our prototype: B Tables, Mapped Data, and C Tables. The results are shown in Figure 5.11. The aggregate result is shown for simplicity instead of the result for each update case. There is a high variance due to the variance in difficulty of the update cases, but we have confirmed that the same trend exists for each individual update case, where the variance is much lower. The exception is for NONE, where the first update case sees a proportionally large amount of time spent examining B, and all remaining cases see a proportionally smaller amount of time examining B. This due to a learning effect; after the first update case, the subject learns to pay less attention to the changes to B. For the LOW assistance level, more time is spent examining the mappings relative to B than for the HIGH assistance level. This is a possible explanation for the lack of a strong benefit for HIGH Assistance Level: because the changes to B are highlighted, the subjects with a HIGH Assistance Level feel compelled to examine and understand them, even though it is not necessary. While we initially thought it would be helpful for users to be able to examine these changes, it became clear that when the only goal is to satisfy the mappings, it does not matter how B changes. In a realistic scenario where the mappings may require maintenance over time, seeing these B changes might be useful to gain insight into the workings of the system. In an informal conversation with architects, they expressed a general concern with computer automation in that it may make unexpected or wrong changes without the user’s awareness. We expect that showing B changes will help reduce such problems. 67 Other Observations We also examined the updates that each subject had selected, and compared these to the ideal updates for each case; we found two common errors: 1. For Mapping 2 violations due to inserted tuples (specifically, update case 2), subjects with assistance level LOW were inclined to assign a new ITEM ID to the tuple inserted into the C ITEM table, even though there is an existing item with the same DESCRIPTION. This could be due to an incomplete understanding of the database and mapping semantics. Another explanation is the manner in which the inserted tuple is displayed; it is not obvious that not inserting it (i.e. by assigning its labeled nulls to the values in an existing tuple) is an option, even though this was explicitly stated in the instructions. Given that no subject who performed update case 2 with an assistance level of NONE made this mistake, we are inclined to adopt the latter explanation. 2. Several subjects with a LOW assistance level forgot to select a deletion. In the HIGH assistance level, the first ranked deletion is automatically selected, and this choice can be overridden. In the LOW assistance level, the subject must actively select a deletion (even if there is only one possibility). This error be corrected by requiring that the subject select a deletion before they may submit their update. These common errors highlight the importance of an understanding of the data. An incomplete knowledge of its meaning, with a narrow focus on selecting updates can lead to erroneous results. This can be partially remedied by ensuring that the mappings are satisfied with respect to the updates chosen. Each subject concluded the trials with a questionnaire. We found a general consensus on the following points: 1. Being able to query B (specifically, with the mappings) is extremely useful when doing manual coordination. 2. Surprisingly, all except for one subject found it very helpful to see the changes to B highlighted with the HIGH assistance level. Since these changes are irrelevant to the task (other than how the mapping is affected), we suspect that knowing the B changes appeared to be useful because it provides a slightly deeper insight into what has happened, rather than providing any actionable data. 3. With the NONE assistance level, determining the changes to C (i.e. inverting the queries qC,1 and qC,2 ) was a significant mental challenge. 4. Several subjects found the use of insert/delete pairs, rather than updates, to be a source of confusion. 68 5. Several subjects wanted to be able to “preview” the results of a particular choice of updates, before confirming it as a final selection. This would allow them to confirm that it has the intended effect and that the mapping is satisfied. Our experiments have confirmed that semiautomatic data coordination using the techniques developed in this thesis can provide a time savings benefit over manual coordination. In our usability study, we used subjects with a limited understanding of the base and contingent databases and relationship between them, but having experience with basic relational database concepts (relations, primary and foreign keys, SQL queries). Since the motivation for our work is for semiautomatic data coordination to assist users who are experts with the data (in particular, with C) but not necessarily with relational databases, there are additional steps in applying our generic approach to a given domain. In the following chapter, we address some of these in the AEC domain with a case study. 69 Chapter 6 A Case Study in AEC As discussed in Section 1.2, our work is motivated in part by the problem of coordinating a construction cost estimate with a building design in the domain of Architecture, Engineering and Construction (AEC). Recall that an AEC project involves collaboration between a large number of groups with differing goals and specialties. After interviewing a number of AEC practitioners, we learned that it is a major challenge for a construction cost estimator, employed by the project’s general contractor, to update their estimate as changes are made to the building design by the project’s architects and engineers. The emergence of Building Information Modeling (BIM) has the potential for addressing the cost estimate coordination problem, but domain heterogeneity and resistance to changing traditional work practices is a major barrier. In addition, cost estimating data is often proprietary and estimating methods/data varies between different cost estimators. Some authors propose standardization to fix this, but we are skeptical of how well a “push” approach to technology will work. We prefer a “pull” approach as advocated in [31]. We have studied data provided to us for two recent construction projects, and interviewed practitioners to gain a deeper understanding of the different ways which design changes can impact a cost estimate. By using mappings composed of pairs of queries, our approach permits the estimator to encode a broad variety of relationships between the design and estimate. These go beyond most current technology, which largely focuses on calculating material quantities. Our approach also eschews the use of a common standard to which both the designers and estimators must conform, allowing the estimator to add on functionality to their existing process. In this chapter, we study the application of our general approach for data coordination in the context of two real world projects. We have examined real project data and interviewed practitioners for the CIRS and UBCO projects, described in Section 1.2. We begin in Section 6.1 by analyzing data from our two cases and characterizing the features of building design data and cost estimate data which make coordination challenging in this domain. Focusing specifically on the UBCO project, we present a summary of the current cost estimating approach (Section 6.2), as well as a state of the art BIM based cost estimating application (Section 6.3). We then describe how our approach to data coordination can go beyond the state of the art to represent a richer set of relationship types in Section 6.4. We then describe how our approach could be developed into a system for cost estimate coordination, and remaining challenges in Section 6.5. 70 6.1 AEC Data As mentioned in Section 1.2, our case study is based on two real world construction projects: the Centre for Interactive Research on Sustainability (CIRS) and the University of British Columbia Okanagan (UBCO) Arts and Sciences Building. The CIRS building is a recently constructed four story, 63,000 sqft building on the University of British Columbia’s Vancouver campus. It is an interdisciplinary space for research on sustainability, and through cutting edge technology aims to be one of the first buildings in Canada to be certified under the Living Building Challenge. For CIRS, we obtained an early stage BIM produced with the popular Autodesk Revit design application, and two cost estimates dated approximately five months apart, where aggressive cost-saving changes were made to the building design. The UBCO building is a 93,000 sqft, ✩31M building which includes a lecture theatre, office space, wet and dry laboratories, and animal care facilities. For the UBCO project, we have obtained two detailed cost estimates (CP1 and CP2), dated at 16 weeks apart, as well as the architectural and structural drawings corresponding to CP2. For the UBCO building we obtained a digital copy of the 2D drawing sheets, which we used to recreate a subset of the building design in a BIM using Autodesk Revit, incorporating all of the details (e.g. material layers) specified in the drawings into our model, which was then exported as ifcXML [50]. We have extracted basic properties of ceilings, columns, doors, walls, slabs, spaces, and wall layers, and transformed them into a simplified XML representation. Figure 6.1 shows a small portion of this XML for illustrative purposes. A larger portion showing an example for each type of component is given in Appendix A. An important feature of the design for cost estimation is the type of each object. In Figure 6.1 we have a column of type “CC3” and a door of type “D5”. Other elements such as walls and windows have a type, and each type is associated with materials, dimensions, etc. These types are not standardized, and each architect/project may use their own set of types. Although design applications such as Autodesk Revit include a library of standard types, we found the design team made extensive use of custom types in the designs we examined, especially where unique architectural features were included. In a 2D set of architectural drawings, the details for each type are typically found in a table of attributes. Many mappings will require aggregating objects of a given type. These mappings will not be very reusable, since in general the selection predicates will have to be updated for each new project. An approach such as mapping tables proposed in the Hyperion project [33, 34, 45] may be useful for resolving these issues with value heterogeneity. Another means for increasing the reusability of mappings is to aggregate based on property sets. In ifcXML, each type of object can be associated with multiple property sets containing single-valued attributes for that object. For example, a WindowPanel object may have a property set with child elements for FrameDepth, FrameThickness etc. A challenge with doing this in practice is that the data as exported by applications such as Autodesk Revit may store these properties as key-value pairs, rather than using the elements which are part of the standard. Returning to 71 <COLUMN id="i1909"> <NAME>M_Rectangular Column:CC3:124568</NAME> <OBJECTTYPE>CC3</OBJECTTYPE> <PROPERTYSET><NAME>PSet_Revit_Dimensions</NAME> <PROPERTY><NAME>Width</NAME><VALUE>1.31</VALUE></PROPERTY> ... </PROPERTYSET> </COLUMN> <DOOR id="i62"> <NAME>M_Double-Flush:D5:129416</NAME> <OBJECTTYPE>D5</OBJECTTYPE> <OVERALLHEIGHT>2450</OVERALLHEIGHT> <PROPERTYSET>...</PROPERTYSET> </DOOR> <WALL id="i738">...</WALL> <ASSOCIATESMATERIAL> <RELATEDOBJECTS><WALL ref="i738"/></RELATEDOBJECTS> <RELATINGMATERIAL><LAYERSET ref="i731"/></RELATINGMATERIAL> </ASSOCIATESMATERIAL> <LAYERSET id="i731"> <MATERIALLAYER><MATERIAL ref="i782"/><THICKNESS>16</THICKNESS></MATERIALLAYER> </LAYERSET> <MATERIAL id="i782"> <NAME>Finishes - Interior - Gypsum Wall Board</NAME> </MATERIAL> <SPACE id="i726"> <NAME>196</NAME><LONGNAME>Theatre</LONGNAME> <PROPERTYSET><NAME>PSet_Revit_Dimensions</NAME> <PROPERTY><NAME>Area</NAME><VALUE>7442.34</VALUE></PROPERTY> <PROPERTY><NAME>Perimeter</NAME><VALUE>358.69</VALUE></PROPERTY> </PROPERTYSET> </SPACE> <BOUNDS> <RELATINGSPACE><SPACE ref="i726"/></RELATINGSPACE> <RELATEDBUILDINGELEMENT><WALL ref="i738"/></RELATEDBUILDINGELEMENT> </BOUNDS> Figure 6.1: A small portion of the transformed XML extracted from an Autodesk Revit model for a subset of the UBCO building design. Figure 6.1, the width of a column is stored in a property whose NAME element has value “Width”, essentially representing schematic information as data. Material layer sets are another basis for grouping components for aggregation, which may provide greater reuse than relying on stable type names. For example, an estimate may group interior partition walls into two items: one for interior walls belonging to Sound Transmission Class (STC) 40, and another for walls belonging to STC 45. The STC rating of a wall is not explicitly stored as an attribute, but instead depends on the number of layers of gypsum wall board in the 72 wall: walls with two layers belong to class STC 40 while walls with four layers belong to class STC 45. In order to calculate the quantity of each type of wall, the estimator needs to aggregate walls by grouping on their number of drywall layers. A challenge with doing this is that the set of layers for a wall is associated indirectly via id/ref links. In Figure 6.1 there are three such references between four elements: WALL, ASSOCIATESMATERIAL, LAYERSET, and MATERIAL (in the original schema it is five references between six elements, excluding elements which are direct children of related elements). Therefore, in order to group walls based on the number of layers of gypsum board, our estimator needs to use an aggregate subquery which counts these layers for each wall. Adding to the challenge is that material names are not standardized: one architect might use a material named “Gypsum Wall Board” while another might use the abbreviation “GWB”. The message we aim to deliver is that the two main challenges to constructing and reusing mappings on ifcXML are: 1) the flexibility within the standard in terms of naming types (i.e. the lack of standard type names) and 2) the complexity of representation (i.e. the shallow element tree and extensive use of id/ref associations). As such, we cannot expect a cost estimator to develop the expertise necessary to construct and maintain useful mappings. They will either need the assistance of someone with expert level database knowledge, or techniques for assisting in the creation of mappings such as [52]. As mentioned in Section 1.2, a cost estimate typically consists of a single table of items representing work results (materials, labour, equipment or a combination). In North America, these items are organized into a hierarchy of categories according to MasterFormat, which is a numbering scheme published by the Construction Specifications Institute (CSI). For example: section 3000 is concrete and section 4000 is masonry. Within section 3000 there are subsections; 3100 for concrete formwork, 3200 for concrete reinforcement, etc. There are up to four levels, and each item has a CSI attribute specifying its section, which may be anywhere from the second level down. In principle, estimators also assign a code that is unique to an item within a given CSI section, so that the pair (CSI, code) form a composite key for the item. In practice codes are often not unique because an estimate is composed by copying, pasting and editing items in a spreadsheet. Each item has (in addition to CSI and code) a description, quantity, unit of measure, per-unit rate, and total price. In the UBCO project, the cost estimator we spoke to described a process for creating an estimate whereby there is a master spreadsheet of standard items. To create an estimate, the estimator copies items from this master spreadsheet, assigning quantities, editing their per-unit rate and description to suit any special conditions. Table 6.1 shows two items from the UBCO estimate corresponding to handrails. Both are edited copies of a common originating item from the master spreadsheet, to suit the conditions of the design. Often, the details needed for exact quantities are not available, and the estimator will create “lump sum” items whose total cost is roughly estimated without using a quantity and per-unit rate. Since our aim is to assist with coordination in a way that is minimally disruptive to existing practices and usage of data, we focus on the scenario where an estimate has already been created. 73 We will discuss ideas for future extensions in helping create an estimate in Section 6.5. CSI 5520 Item Code 0010 5520 0010 Description Handrails and Railings - Metal with Glass Handrails and Railings - Metal with Perforated Metal Panels Qty. 194 442 Unit lnft Rate ✩230.00 Total ✩44,641 lnft ✩200.00 ✩88,385 Table 6.1: A snippet of the UBCO project Cost Plan #2 showing items relating to metal railings. It is worth mentioning that there are companies which provide yearly updated cost databases which are based on geography. RS Means, and Timberline are two examples. None of the projects we surveyed used commercially provided construction cost data, so we do not discuss these any further. 6.2 Current Practice: On-Screen Takeoff Current BIM technology for automating cost estimating focused on the problem of quantity takeoff, which is the calculation of materials and their quantities. Since the building design for the UBCO project used the traditional 2D drawing format rather than a BIM, its cost estimators used a program called On-Screen Takeoff to generate material quantities. On-Screen Takeoff is a software program which allows a cost estimator to easily takeoff material quantities from a digital set of 2D building drawings. The estimator creates a number of “conditions”, each corresponding to a type of component in the design. The estimator then draws areas (lines, polygons or points) on the drawings and associates each with a condition. Figure 6.2 shows a snapshot of a portion of the structural plan for the basement level, where a number of conditions have been created for slab on grade concrete (i.e. a horizontal concrete surface that is placed directly on the ground), footings (which are thick, reinforced sections of a concrete slab that support load-bearing walls and columns) and walls. Creating these conditions allows for a detailed takeoff to be performed, as the necessary quantities are calculated automatically from the visual conditions. Figure 6.3 shows the quantities derived for a few of the footing items. Each condition can have multiple measurements associated with it, as indicated in the “Quantity [1..3]” and “UOM [1..3]” columns of Figure 6.3. For footings, these correspond to perimeter, area, and volume. The association between conditions and estimate items is many-to-many. In the cost estimate CP2 for the UBCO project, there are separate cost items for footing formwork (the wooden mold into which concrete is poured), rebar (reinforcing steel bars inside of the concrete) and the actual concrete itself. These items, reproduced from the cost estimate, are shown in Table 6.2. Each of these items is in turn associated to all of the footing conditions created in On-Screen Takeoff. These associations are not explicitly represented anywhere. In order to specify quantities for these 74 Figure 6.2: Screen shot of On-Screen Takeoff conditions used to takeoff footing concrete from a structural drawing. Figure 6.3: Quantities calculated automatically from the visual conditions shown in Figure 6.2. items, the cost estimator must add up a measure of all footings (although we only show a few, there are 20 such items). In the case of footing forms, the cost estimator must manually calculate the sum of all footing areas, and convert it to imperial units. In the case of footing concrete, the same must be done for all footing volumes. In the case of rebar, a formula is applied to the volume (the conversion used is 200lb per cubic yard). This process is time-consuming and laborious. Moreover, except for creating the conditions, these steps must be repeated when an updated design is given. 6.3 A BIM Solution: Innovaya There are several software applications for automatically extracting quantities from a BIM design. Most of these are produced as a pair of compatible applications: one for design and one for estimating, and use a proprietary format for transferring between the two. The primary drawback 75 CSI 3115 3214 3310 Item Code 0010 0030 0020 Description Continuous Footing Forms Footings Rebar #5 Grade 60 Footing Concrete - 25Mpa Qty. 8,157 158,472 852 Unit sqft lb cuya Rate ✩15.00 ✩0.95 ✩140.00 Total ✩122,355 ✩150,548 ✩119,228 Table 6.2: A snippet of the UBCO project Cost Plan #2 showing items relating to concrete footings. with this approach is that it ties the choice of design software to the choice of estimating software (and vice-versa). While this may be sufficient when a single organization does both the design and construction of a project, it is unsuitable in environments where these tasks are done by separate organizations, which is commonly the case. We reviewed a state of the art application called Innovaya which is able to read and interpret multiple different design formats (including open BIM standards and proprietary formats from Autodesk Revit), and so is not as tightly coupled to the choice of design software. When a design is imported, a set of managed quantities can easily and automatically be created. Managed quantities are roughly analogous to On-Screen Takeoff’s conditions; each managed quantity corresponds to a type of component in the design. These quantities can be exported to an excel spreadsheet or combined with a cost database to create an estimate. Figure 6.4 shows the result of associating a type of interior partition wall (on the left) to an assembly in a proprietary cost database (Timberline, on the right). In this example, the height and total length of walls of type Interior - 4 7/8” Partition (1-hr) are mapped to the length and height of the Timberline assembly 0932 - Wall - S Studs Interior. The cost database in turn is able to associate this assembly to all of the cost items needed to build it (at the bottom of the figure). Innovaya remembers this mapping, so that when the design is changed, the quantities of the taken-off assemblies and items are automatically updated to reflect this. A colleague with professional cost estimating experience investigated the use of Innovaya for estimating the CIRS project as an example design. In practice, he found Innovaya’s method of grouping building components into managed quantities to be unsatisfactory for estimating some assembly types. Specifically, the CIRS building design contains a large glazed curtain wall, which was grouped into many different groups of identical types by Innovaya. Many other types of architectural objects that are logically a single unit were represented by a large number of managed quantities in Innovaya. This created difficulty in matching managed quantities to assemblies. Based on our experience, we conclude that the usefulness of Innovaya for BIM based takeoff depends on choices of design methodology, irrespective of the end product. For example, the grouping of building components into types in the design needs to match the grouping of quantities into estimate items used by the cost estimator. In another example, a case study [55] using Innovaya found that the design software has redundant choices in how to model openings (e.g. for windows), and that only some of these choices allow accounting for openings in an estimate. The representation standard adds yet another layer of redundancy. As pointed out in [40], the lack of industry wide 76 Figure 6.4: Using Innovaya to map a managed quantity (corresponding to a particular type of interior partition wall) to a Timberline assembly. Here, the length and height attributes are mapped, as indicated by the Mapping column in the top-right section of the screen. standards for object definitions in BIM reduces its usefulness for the cost estimator. We illustrated this problem in Section 6.1 using our example XML data for the building design. While standard schemas specify a common format that can be read by many applications, the semantics may vary considerably depending who produced the model and with what software. 6.4 Mapping Types Here we characterize the types of mappings which are needed for estimate coordination in AEC, and how to adjust our techniques to accommodate them. Recall that the algebraic form of our mapping constraints is qB = qC . In estimate coordination, typically qB will be an aggregation over some set of building components which computes a single value, and qC selects the quantity of a single item in the estimate. For example (in SQL) SELECT Qty FROM Estimate WHERE CSI=3115 AND Code=0010; So far we have not discussed the use of arithmetic comparisons in the class of queries for qB . In this case, since the comparisons are for exact equality, it is equivalent to using constants in the query body; i.e. QC (Qty) :− Estimate(3115, 0010, d, Qty, u, r, t) 77 These queries can be supported in our approach, by transforming the problem so that the attributes corresponding to the constants appear in the query head, and each inserted/deleted tuple uses the constant values for these attributes.3 In some cases, where a component type name in the building design corresponds to item descriptions in the cost estimate, we may have a mapping where qB returns aggregate groups of (typename, qty) pairs, which correspond to the set of (description, qty) pairs within a certain section of the estimate. Consider the section of UBCO CP2 corresponding to wood doors, as shown in Table 6.3. In this case, qB would be an aggregate query, and qC would select the description and quantity of all items with CSI=8200. A challenge in this case is that the architect often uses codes such as D1, D2 etc. for different door types, while the cost estimator uses more descriptive names. As discussed in Section 6.1, techniques such as mapping tables [33] may be useful here. CSI 8200 8200 8200 Item Code 0010 0010 0010 Description Hollow core doors Doors w/sidelights Sliding doors Qty. 148 87 8 Unit ea ea ea Rate ✩1,000.00 ✩1,500.00 ✩750.00 Total ✩148,000 ✩130,500 ✩6,000 Table 6.3: A snippet of the UBCO project Cost Plan #2 showing items for wood doors. Another mapping feature which we found would be useful for cost estimators is to use a nonspecific relationship instead of equality. Being able to represent the fact that an estimate item is related to some query on the design, without the specifics of this relationship, can provide useful information. We propose using mappings of the form qB ≈ qC so that when QB changes, our system can notify the cost estimator that the items selected by qC may require updating. Throughout this thesis we have defined qB as a conjunctive query; if we take the symmetric difference approach to view differencing, then we are less constrained to the class of queries to be used for qB . In the case of a mapping which associates a single aggregated value from the building design with a particular estimate item, we may also allow other data models and query languages for B. This will make the application of our approach to the AEC domain easier to implement, since it removes the need to perform any extraction and/or transformation to a relational model. For example, XQuery can be used for a building design in the ifcXML format. Another open schema is IFC [42], which is written using the EXPRESS object-oriented data definition model, for which a query language exists. By using these models and languages, powerful analytical queries can be constructed and associated with estimate items. For example there exists work on extending SQL [9] and XQuery [49] to support spatial operators specific to BIM. After speaking with cost estimators, we have identified three broad classes of mappings, which we describe in more detail below. 3 Using more general arithmetic comparisons is non-trivial and discussed in [2]. We do not need them in the AEC domain. 78 Takeoff Extracting material quantities is the common feature of the currently most advanced software applications in use. Using our UBCO building data, we were able to create a number of queries extracting various quantities. The benefit is that through the flexible use of selection predicates, we are able to match the granularity of the grouping of building components to the estimate items, which was a problem with both methods of quantity takeoff surveyed in the previous section. For example, the following XQuery statement extracts the total area of all suspended slabs. SUM( FOR $X IN //SLAB WHERE $X/PROPERTYSET/PROPERTY/VALUE = "Suspended Slab" RETURN DATA ($X/PROPERTYSET/PROPERTY[NAME=’Area’]/VALUE)) This quantity is relevant to several estimate items, such as suspended slab forms, whose quantity should equal this value, and concrete cleaning, which is a lump sum item whose exact relationship to the area of suspended slabs is not specified exactly. We call these proxy mappings, which are discussed in the next section. Proxies We briefly mentioned “lump sum” items in Section 6.1. During the early design phase of a building project, exact details are often not available, and lump sum estimates are often created as a rough guess of the total cost of an item. For example, cost plan #1 of the UBCO building made lump sum estimates of the cost of millwork (wood features such as desks) where details were lacking. Table 6.4 shows these items. CSI 6220 Item Code 0010 6220 6220 0010 0010 6220 0010 Description Millwork - Lecture Theatre (Lecturn Desk) Millwork - Projection Room Millwork - Classroom (Lecturn Desk) Millwork - Custom millwork in Board room (RM 470) Qty. 1 Unit ea Rate ✩13,000.00 Total ✩13,000 1 1 ea ea ✩4,000.00 ✩13,000.00 ✩4,000 ✩13,000 1 ea ✩20,000.00 ✩20,000 Table 6.4: A snippet of the UBCO project Cost Plan #1 showing lump sum items for millwork. The cost estimator has created rough lump sum estimates for millwork in a few of the building’s rooms. Exact millwork details are not available at this time, but the estimator may use some related building features which are available as proxies. In this example, the cost of lecture theatre millwork is related to the gross area of the lecture theatre; if the lecture theatre changes in size, the cost of this item is likely to need updating as well. This is captured in the following query, which we would associate in a non-specific way to the item for lecture theatre millwork. //SPACE[LONGNAME="Theatre"]/PROPERTYSET/PROPERTY[NAME="Area"]/VALUE 79 Constructibility Certain overall properties of the building design can impact the construction process and hence its cost in a way which is independent from material quantities. “Constructibility” refers to the degree of difficulty of building features. Constructibility queries refers to queries which check for or measure these conditions. For example in [49] the authors propose the use of spatial BIM queries to examine the number and location of places where walls are penetrated by mechanical and electrical elements such as plumbing and wire conduits. In another example, a building design has a conceptual grid of lines along the X and Y axis, which is used as a guideline to lay out major structural components such as columns and walls. The number of columns located off of these major grid lines is relevant to the cost of column formwork. Using the ongrid predicate defined in [49], the cost estimator could create a mapping which counts the number of off-grid columns as shown below. SUM( FOR $X IN //COLUMN WHERE ongrid($X) = boolean(0) RETURN 1) Constructibility mappings typically associate some aggregate quantity to the per-unit rate of an item. In another example, if more than 50% of interior partition walls exceed a height of 10 feet, then a higher rate per square foot unit should be used for partition walls in the estimate due to additional equipment needs (e.g. ladders) and reduced productivity of the installation crew. 6.5 A System for AEC One of the goals of our approach in the context of AEC is to be complimentary to a cost estimator’s existing work processes, rather than replace it. In this section we take a broader view and describe how a data coordination system would fit into the processes of cost estimating, referencing ideas we have developed in earlier chapters of this thesis and highlighting areas for practically motivated future work. Figure 6.5 gives an overview of the process for cost estimate coordination using the Integration Definition for Function Modeling (IDEF)0 modeling language. Each box is a task. The incoming arrows on its left describe the task’s input; on the top, its controls; on the bottom, its mechanism; the outgoing arrows on its right give the task’s output. Tasks A0 through A2 are performed for each new building that needs to be estimated, and tasks B0, C0 and C1 are performed (in that order) each time a building is updated. A0 — Create Project Mappings The first task for each new building project is, given a BIM, to create project mappings (A0). This will inevitably require expertise in the data modeling language used for representing the BIM and its associated query language. We do not expect a cost 80 Figure 6.5: An IDEF0 diagram illustrating our approach for updating cost estimates using the help of mappings. Activities A0 through A2 are performed once for each new building to be estimated, and activities B0, C0 and C1 are performed each time the building design is updated. estimator to have the time to develop this expertise, hence it will be a tandem effort with someone who does (a “data modeler”). Creating mappings is likely to be the most challenging task in implementing our approach. Although we largely leave this problem as future work, we indicate a few promising directions. There are a number of approaches for creating schema mappings (see [53] for a survey). Many of these are likely to be of limited use due to domain heterogeneity. One recent area which may be useful is in schema summarization [58]; the ability to summarize large and complex schemas such as IFC would undoubtedly make it easier for the cost estimator and data modeler to create mappings. Another potential means for simplifying this task is to reuse mappings from previous building projects. That is why we have labeled “Existing Mapping Library” as an input to task A0. In some cases, for example if the BIM is from the same design team as a previous project, a large number of mappings may be able to be reused. If the BIM is from a different design team but in 81 the same format (e.g. ifcXML), then some of the mappings from previous project may be reusable, and some may need modifications. Using templates for common mapping types may also speed up this task. For example, the cost estimator can use a template for a query which aggregates all square columns in the design, specifying different selection predicates such as ObjectType=“CC1” vs. ObjectType=“ColA” depending on the design. The set of mappings created in task A0 need not be complete, in the sense that the cost estimator does not need to encode all possible relationships as mappings. Our approach works with existing practices and can be used with as little as one mapping; adding additional mappings yields greater benefits from the system, following the pay-as-you-go data integration tenet of [30]. The cost estimator may want to reuse a given query qB in multiple mappings. We suggest maintaining a library of aliased design queries, and creating mappings by associating aliases to estimate queries. This will make the task of maintenance easier. A1 — Generate Views The views are generated by executing the set of all qB ’s from the mappings created in task A0. This is done automatically using a query execution engine for the data model in which the BIM is specified. These views are then stored locally on the cost estimator’s computer along with the mapping to which each is associated. A2 — Generate Estimate For this task, the cost estimator generates an estimate for the project using whatever methods they are familiar with. We give the set of views as a control since the estimator may wish to examine the views in doing so. B0 — Update BIM Upon completing a round of design revisions, the architect sends an updated BIM to the cost estimator, replacing the previous version. C0 — Compare Views Tasks C0 and C1 contain our methods for view differencing and update translation as presented in Chapters 3 and 4 respectively. They repeat each time the building design is updated (B0), and a new estimate is required. We assume that the BIM is small enough so that the estimator can retain a copy of each version, or at least the previous version. Depending on what queries are used for the left hand side of the mappings, incremental view maintenance may be an option for view differencing. Or, since the designs are typically not very large (80MB in the case of CIRS), a new set of views can be generated by reevaluating the queries in the mappings on the new BIM. C1 — Update Estimate The estimator automatically receives a set of estimate changes as the output from C0. They are then able to examine each change: the old value, the new value, the estimate item in question, and the query associated with the mapping, and from there decide on how the estimate should be updated. 82 From our interviews with practitioners, we learned that a common concern with computer automation in BIM-based solutions is that the system will take an incorrect or unexpected action without the practitioner’s awareness. Bearing this in mind, the type of provenance discussed in Section 5.1 is likely to be useful, so that the cost estimator can understand why a given update has been performed or is suggested. This allows the cost estimator to verify that the suggested changes are correct, and if not, help the cost estimator refine the mappings if necessary. Our user study in Section 5.2 showed that the representation of updated tuples as inserted/deleted pairs is unintuitive and leads to confusion. This should be taken into consideration when presenting updates to the cost estimator. For example, in the UBCO cost plans the triple (CSI, Code, Description) forms a unique key for estimate items. For any pair of inserted and deleted tuples which have the same values for all three attributes, viewing them as a single update is likely to be more comprehensible. The updated views generated by task C0 are then stored, replacing the old views generated in task A1. These views are used as input in the next instance the BIM is updated. Similarly, the updated estimate replaces the old estimate and is used for input the next time task C1 is performed. We believe that a general approach following this design, and equipped with the ability to represent and execute mappings of the type described in Section 6.4 will ❼ Decrease the time needed for an estimator to update their estimate given an updated design. ❼ Result in a more accurate estimate, by taking into account constructibility conditions that may have otherwise gone unnoticed. ❼ Result in expanded coverage of the cost estimate by allowing a broader class of design rela- tionships to be considered. These claims are made based on our experience examining data for the UBCO and CIRS projects, and our interviews with cost estimators. Although we do not provide an experimental evaluation, we believe this is an important direction for future research. Addressing some of the challenges discussed in Sections 6.1 and 6.5 and progressing towards a working system or set of coordination tools for the AEC domain is likely to result in progress that can be applied to other domains. In the following chapter we also give directions for future research which broadens the core data coordination problem itself, which important step for addressing practical problems such as those discussed here. 83 Chapter 7 Summary and Future Research 7.1 Summary In this thesis, we introduced the problem of coordinating a contingent data source C with an autonomous, heterogeneous base source B on which it depends. In Chapter 2, we proposed the use of bidirectional dependencies (bdds) for expressing exact relationships between B and C. A bdd specifies that the result of a conjunctive query qB on B should be equal to the result of a conjunctive query qC on C, which differs from the commonly used source to target tuple-generating dependencies (s-t tgds) which specify a subset relationship. Using a bdd mapping, we outlined a solution which uses a materialized view and formalized the data coordination problem in this context, whereby we have existing instances C and QB (the view), an updated instance B , and must find the corresponding updated instance C . We proposed an approach which first finds the set Q± B of view updates (View Differencing), then uses these to find the set of all possible contingent updates C ± (Update Translation), and guides the user to select amongst these a single unique update (Update Completion). In Chapter 3 we addressed view differencing by materializing QB and taking the symmetric difference with QB . We also adapted a well known incremental view maintenance algorithm, which is more efficient when the number of updated tuples and number of relations in the view definition is small, but requires access to the old base data B and updates B ± to B. We compared the two approaches in an experimental evaluation. In Chapter 4 we described methods for translating a set Q± B of view updates into the set of all possible updates C ± on the contingent database. We used the tgd chase to translate view insertions into a set of incomplete insertions on C. We then derived a constraint Φ on the labeled nulls in these incomplete insertions so that the bdd holds exactly with respect to the view insertions. We gave necessary and sufficient conditions for the existence of a feasible solution. We defined the notion of a possible answers generalization of a conjunctive query, and used it to construct an optimized algorithm for finding Φ by directly building conditions rather than computing joins over incomplete information. We proposed translating a set of view deletions into a minimized formula in disjunctive normal form, where each conjunctive clause corresponds to a possible deletion translation C − . We described an overall process which interleaves the translation of inserted and deleted tuples to arrive at a combined solution. We demonstrated the effectiveness of our algorithms empirically using a subset of the the well known TPC-H benchmark instance. 84 In Chapter 5 we discussed assistance strategies to help guide a user in selecting a solution from the possibilities generated in Chapter 4. We performed a user study based on the building design — cost estimate coordination problem from Section 1.2. In our user study we built a prototype data coordination system which was used to compare the benefits of semiautomatic vs manual coordination, and of ranking deletions and suggesting labeled null values, as measured by the amount of time needed to coordinate updates, for several types of mappings and update sets. We found a significant benefit for semiautomatic data coordination but not for ranking deletions and suggesting labeled null values, as well as gathered insights on common challenges and useful features for a data coordination implementation. In Chapter 6 we reexamined the Architecture, Engineering and Construction (AEC) domain in a case study based on two real world construction projects. We discussed specific challenges in that domain, relating to lack of standardization as well as the relatively large leeway within existing data standards. We discussed problems with and limitations of current approaches to BIM-based cost estimating. We characterized the types of mappings which would be used in a cost estimate coordination application, and some of the challenges with implementing them. We concluded with a process-based description of how data coordination would fit within a cost estimator’s workflow. 7.2 Future Research Our work lays the basis for coordinating sets of updates between pairs of heterogeneous sources. There are a number of areas for future work. We have already discussed (in Section 5.1) some ideas for future work for development of assistance strategies to help a user choose the set of update translations from amongst the possible translations generated. Another important direction which has been discussed already (Section 6.5) is towards methods which assist the user in creating mappings, and in the reuse of mappings (e.g. through templatization). There are other promising directions in expanding the core data coordination problem to be more general, and in extending the class of mappings. Here we describe what we believe to be the most promising directions, as well as thoughts on the anticipated challenges. 7.2.1 Arithmetic Constraints In [2], Afrati et al. address the problem of data exchange where arithmetic comparisons are allowed in the target formula of an s-t tgd, e.g. S(x) → T1 (x, y), T2 (y, z), y < z They describe a chase procedure which forms a tree structure by considering all possible total orderings induced by the partial orders resulting from each application of the tgd chase rule. The result is useful for computing certain answers for conjunctive queries with arithmetic comparisons. 85 Since our goal is to resolve a ground instance by assisting a user in assigning values to the labeled nulls, we do not need to consider all total orderings of these labeled nulls. We do need to worry about the possibility that a given set of insertions Q+ C may generate an inconsistent set of arithmetic comparisons (e.g. z < 2 ∧ z > 4), and what to do if this is the case. The problem of deletion translation changes as well, since the disjunction for each witness to each deleted tuple now contains a complemented inequality, which is itself an inequality. Following our example above, its complement is ¬S(x) → ∀y ¬T1 (x, y) ∨ ¬T2 (y, z) ∨ y ≥ z This means that deletion can be achieved by deleting T1 (x, y) or T2 (x, y), or changing the values of y and/or z so that y ≥ z. 7.2.2 Multiple Mappings In most data coordination scenarios of practical interest there will be more than one mapping. Extending our methods to account for multiple mappings should be a priority. The main challenge in our formulation of data coordination due to the combination of bidirectional dependencies and the restriction that updates only propagate from the base to the contingent source. As discussed at the beginning of Chapter 4, this second restriction is what prevents us from turning a set of bdds into a set of tgds and applying the tgd chase, as has been done for peer databases in [32]. Given a set Σ = σ1 , . . . σm of bdds, we could perform the chase step as discussed in Section 4.2.1 individually for each bdd, and take the union of each of their results as our incomplete instance. A potential concern is that Theorem 4.1 is no longer necessary for the existence of certain spurious tuples (i.e. violating the deletion tgd), and an important component of research in this direction would be to define sufficient and necessary conditions for a feasible translation to exist. Our suspicion is that the right hand side of these bdds should not intersect, either because they each involve conjunctions over different relations, or select different values for common attributes. This is the reason why we were able to use two mappings in our user study from Section 5.2: the queries on the right hand side of these mappings do not intersect. In terms of translating deletions, we could modify the TranslateDeletions algorithm to consider all deleted tuples from the views defined by all bdds in the step 6 of Algorithm 4.2. We expect that the feasibility of translating deletions under multiple bdds will depend on similar conditions as those for insertions. 7.2.3 Multiple Base Sources The dependence of a contingent source on multiple base sources is a natural extension of the data coordination problem as we have defined it. For example, the cost estimator’s data C may not only depend on the building design B, but also on the current market price of materials as dictated by 86 its suppliers S. We expect that dependence on multiple base sources will produce challenges closely related to those which arise when dealing with multiple mappings, as discussed in the previous section. It may be possible to transform the multiple base source problem into a multiple-mapping problem by viewing the relations of all base sources as a single collection. 7.2.4 Using Constraints During the creation of a concrete data coordination instance for our user study in Chapter 5, it became clear to us that information about C’s schema, such as primary key and foreign key constraints, is useful for narrowing the list of update translations. For example, here is the the right hand side of a mapping used in that instance QC (desc, qty) ↔ C BUILDING EST(id, qty), C ITEM(id, desc, r, u) In reality, there is a functional dependency from C ITEM.desc to all other attributes of C ITEM, which should “automatically” determine which values to use for id, r, and u for a given inserted tuple in Q+ C . The exception is if there is no tuple in C ITEM with the corresponding description, in which case the insertion represents a new item, for which new id, r and u values must be chosen. This can be achieved by representing the functional dependency as an equality generating dependency (egd), and incorporate egd chase rules into the chase procedure for insertion translation. Research in this direction would need to update the conditions for which a feasible solution exists. Similarly, there is an inclusion dependency C BUILDING EST[id] ⊂ C ITEM[id] which means that a deletion in Q− C should always delete from C BUILDING EST. We had informally considered both of these dependencies via our update completion assistance strategies, as discussed in Section 5.1. In particular, the approach for suggesting labeled null values based on the intersection with the current instance of C produces the same result as using the functional dependency. Similarly, the ranking of deletions by preferring those which delete from C BUILDING EST always places the deletion set which respects the inclusion dependency first. Taking advantage of these dependencies directly would allow for a more general approach which is favourable to using such ranking heuristics. 87 Bibliography [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [2] F. Afrati, C. Li, and V. Pavlaki. Data exchange in the presence of arithmetic comparisons. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology, EDBT ’08, pages 487–498, New York, NY, USA, 2008. ACM. [3] B. Alexe, L. Chiticariu, R. J. Miller, and W. C. Tan. Muse: Mapping understanding and deSign by example. In ICDE, pages 10–19, 2008. [4] T. Anderson, Y. Breitbart, H. F. Korth, and A. Wool. Replication, consistency, and practicality: are these mutually exclusive? In SIGMOD, pages 484–495, 1998. [5] M. Arenas, P. Barcel´ o, L. Libkin, and F. Murlak. Relational and XML Data Exchange. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010. [6] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, and J. Mylopoulos. The Hyperion project: from data integration to data coordination. SIGMOD Record, 32(3):53–58, 2003. [7] F. Bancilhon and N. Spyratos. Update semantics of relational views. TODS, 6(4):557–575, 1981. [8] P. Barcel´ o. Logical foundations of relational data exchange. SIGMOD Record, 38(1):49–58, 2009. [9] A. Borrmann and E. Rank. Specification and implementation of directional operators in a 3D spatial query language for building information models. Advanced Engineering Informatics, 23(1):32 – 44, 2009. [10] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316–330, 2001. [11] P. Buneman, S. Khanna, and W. C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150–158, 2002. [12] Ceri and Widom. Deriving production rules for incremental view maintenance. In VLDB, pages 577–589, 1991. 88 [13] L. S. Colby, A. Kawaguchi, D. F. Lieuwen, I. S. Mumick, and K. A. Ross. Supporting multiple view maintenance policies. In SIGMOD, pages 405–416, 1997. [14] Y. Cui and J. Widom. Practical lineage tracing in data warehouses. In ICDE, pages 367–378, 2000. [15] C. Eastman, P. Teicholz, R. Sacks, and K. Liston. BIM Handbook: A Guide to Building Information Modeling for Owners, Managers, Designers, Engineers, and Contractors. John Wiley & Sons, Inc., 2008. [16] C. M. Eastman, Y. Jeong, R. Sacks, and I. Kaner. Exchange model and exchange object concepts for implementation of national BIM standards. Computing in Civil Engineering, 24(1):25–34, January 2010. [17] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and query answering. In ICDT, pages 207–224, 2003. [18] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89–124, 2005. [19] R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: getting to the core. ACM Trans. Database Syst., 30(1):174–210, Mar. 2005. [20] J. N. Foster, M. B. Greenwald, J. T. Moore, B. C. Pierce, and A. Schmitt. Combinators for bidirectional tree transformations: A linguistic approach to the view-update problem. ACM Trans. Program. Lang. Syst., 29(3), 2007. [21] M. Friedman, A. Y. Levy, and T. D. Millstein. Navigational plans for data integration. In AAAI/IAAI, pages 67–73. AAAI Press / The MIT Press, 1999. [22] Cost estimating and assessment guide. Technical Report GAO-09-3SP, United States Government Accountability Office, March 2009. [23] G. Grahne. The Problem of Incomplete Information in Relational Databases, volume 554 of LNCS. Springer, 1991. [24] T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Update exchange with mappings and provenance. In VLDB, pages 675–686, 2007. [25] T. J. Green, G. Karvounarakis, N. E. Taylor, O. Biton, Z. G. Ives, and V. Tannen. ORCHESTRA: facilitating collaborative data sharing. In SIGMOD, pages 1131–1133, 2007. [26] T. Griffin and L. Libkin. Incremental maintenance of views with duplicates. In SIGMOD, pages 328–339, 1995. 89 [27] T. Griffin, L. Libkin, and H. Trickey. An improved algorithm for the incremental recomputation of active relational expressions. TKDE, 9(3):508–511, 1997. [28] A. Gupta and I. S. Mumick, editors. Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge, MA, USA, 1999. [29] A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. In SIGMOD, pages 157–166, 1993. [30] A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1–9, 2006. [31] T. Hartmann, H. van Meerveld, N. Vossebeld, and A. Adriaanse. Aligning building information model tools and construction management methods. Automation in Construction, 22:605–613, 2012. [32] G. Karvounarakis and Z. G. Ives. Bidirectional mappings for data and update exchange. In WebDB, 2008. [33] A. Kementsietsidis, M. Arenas, and R. J. Miller. Managing data mappings in the Hyperion project. In ICDE, pages 732–734, 2003. [34] A. Kementsietsidis, M. Arenas, and R. J. Miller. Mapping data in peer-to-peer systems: Semantics and algorithmic issues. In SIGMOD, pages 325–336, 2003. [35] B. Kimelfeld. A dichotomy in the complexity of deletion propagation with functional dependencies. In PODS, pages 191–202, 2012. [36] B. Kimelfeld, J. Vondr´ ak, and R. Williams. Maximizing conjunctive views in deletion propagation. ACM Trans. Database Syst., 37(4):24, 2012. [37] L. Kot, N. Gupta, S. Roy, J. Gehrke, and C. Koch. Beyond isolation: research opportunities in declarative data-driven coordination. SIGMOD Record, 39:27–32, September 2010. [38] L. Kot and C. Koch. Cooperative update exchange in the youtopia system. PVLDB, 2(1):193– 204, 2009. [39] Y. Kotidis, D. Srivastava, and Y. Velegrakis. Updates through views: A new hope. In ICDE, page 2, 2006. [40] W. E. Kraus, S. Watt, and P. D. Larson. Challenges in estimating costs using building information modeling. AACE International Transactions, pages IT11–IT13, 2007. [41] M. Lawrence, R. Pottinger, and S. Staub-French. Data coordination: Supporting contingent updates. In PVLDB, volume 4, pages 831–842, 2011. 90 [42] T. Liebich, Y. Adachi, J. Forester, J. Hyvarinen, K. Karstila, K. Reed, S. Richter, and J. Wix. IFC2x edition 3 technical corrigendum 1. http://www.buildingsmart-tech.org/ ifc/IFC2x3/TC1/html/index.htm, 2007. Online, accessed 2012/09/20. [43] B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. Reiss. Refining information extraction rules using data provenance. IEEE Data Eng. Bull., 33(3):17–24, 2010. [44] B. Marnette, G. Mecca, P. Papotti, S. Raunich, and D. Santoro. ++spicy: an opensource tool for second-generation schema mapping and data exchange. PVLDB, 4(12):1438–1441, 2011. [45] M. M. Masud, I. Kiringa, and H. Ural. Update processing in instance-mapped P2P data sharing systems. IJICS, 18(3/4):339 – 379, 2009. [46] T. L. McCuen. The quantification process and standards for BIM. AACE International Transactions, pages BIM.01.1 – BIM.01.11, 2009. [47] T. L. McCuen. Underdeveloped and underutilized: Cost estimating in BIM. AACE International Transactions, pages BIM.01.1 – BIM.01.10, 2010. [48] R. J. Miller, L. M. Haas, and M. A. Hern´andez. Schema mapping as query discovery. In VLDB, pages 77–88, 2000. [49] M. P. Nepal, S. Staub-French, R. Pottinger, and A. Webster. Querying a building information model for construction-specific spatial information. Advanced Engineering Informatics, To appear, 2012. [50] N. Nisbet, T. Liebich, P. Houbaux, G. Skarseth, and F. Grobler. ifcXML implementation guide. Technical report, International Alliance for Interoperability, February 2005. [51] M. A. Olson, K. Bostic, and M. Seltzer. Berkeley DB. In USENIX, pages 43–43, 1999. [52] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD Conference, pages 73–84, 2012. [53] E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4):334–350, 2001. [54] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000. [55] S. Tiwari, J. Odelson, A. Watt, to inform target value design. and A. Khanzode. Model based estimating http://www.aecbytes.com/buildingthefuture/2009/ ModelBasedEstimating.html, August 2009. Online, accessed 2012/11/12. [56] TPC benchmark H, standard specification, revision 2.11.0, April 2010. 91 [57] M. Venugopal, C. Eastman, R. Sacks, and J. Teizer. Semantics of model views for information exchanges using the industry foundation class schema. Advanced Engineering Informatics, 26:411–428, 2012. [58] C. Yu and H. V. Jagadish. Schema summarization. In VLDB, pages 319–330, 2006. [59] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View maintenance in a warehousing environment. In SIGMOD, pages 316–327, 1995. 92 Appendix A Subset of UBCO XML Data for Case Study <COLUMN id="i1909"> <NAME>M_Rectangular Column:CC3:124568</NAME> <OBJECTTYPE>CC3</OBJECTTYPE> <TAG>124568</TAG> <PROPERTYSET><NAME>PSet_Revit_Dimensions</NAME> <PROPERTY><NAME>Depth</NAME><VALUE>1.31</VALUE></PROPERTY> <PROPERTY><NAME>Width</NAME><VALUE>1.31</VALUE></PROPERTY> <PROPERTY><NAME>Offset Base</NAME><VALUE>0</VALUE></PROPERTY> </PROPERTYSET> <PROPERTYSET><NAME>PSet_Revit_Constraints</NAME> <PROPERTY><NAME>Base Offset</NAME><VALUE>0</VALUE></PROPERTY> <PROPERTY><NAME>Top Offset</NAME><VALUE>-0.65</VALUE></PROPERTY> </PROPERTYSET> </COLUMN> <DOOR id="i62"> <NAME>M_Double-Flush:D5:129416</NAME> <OBJECTTYPE>D5</OBJECTTYPE> <TAG>129416</TAG> <OVERALLHEIGHT>2450</OVERALLHEIGHT> <OVERALLWIDTH>2450</OVERALLWIDTH> <PROPERTYSET><NAME>PSet_DoorCommon</NAME> <PROPERTY><NAME>IsExternal</NAME><VALUE>1</VALUE></PROPERTY> </PROPERTYSET> <PROPERTYSET><NAME>PSet_DoorCommon</NAME> <PROPERTY><NAME>Thickness</NAME><VALUE>0.147</VALUE></PROPERTY> <PROPERTY><NAME>Height</NAME><VALUE>8.038</VALUE></PROPERTY> </PROPERTYSET> </DOOR> 93 <WALL id="i738"> <NAME>Basic Wall:127324</NAME> <OBJECTTYPE>Basic Wall:127087</OBJECTTYPE> <TAG>127324</TAG> <PROPERTYSET><NAME>PSet_WallCommon</NAME> <PROPERTY><NAME>IsExternal</NAME><VALUE>1</VALUE></PROPERTY> <PROPERTY><NAME>LoadBearing</NAME><VALUE>0</VALUE></PROPERTY> </PROPERTYSET> <PROPERTYSET><NAME>PSet_Revit_Dimensions</NAME> <PROPERTY><NAME>Length</NAME><VALUE>15.09</VALUE></PROPERTY> <PROPERTY><NAME>Area</NAME><VALUE>136.92</VALUE></PROPERTY> <PROPERTY><NAME>Volume</NAME><VALUE>82.65</VALUE></PROPERTY> </PROPERTYSET> </WALL> <ASSOCIATESMATERIAL> <RELATEDOBJECTS><WALL ref="i738"/></RELATEDOBJECTS> <RELATINGMATERIAL><LAYERSET ref="i731"/></RELATINGMATERIAL> </ASSOCIATESMATERIAL> <LAYERSET id="i731"> <MATERIALLAYER><MATERIAL ref="i782"/><THICKNESS>16</THICKNESS></MATERIALLAYER> <MATERIALLAYER><MATERIAL ref="i783"/><THICKNESS>22</THICKNESS></MATERIALLAYER> <MATERIALLAYER><MATERIAL ref="i784"/><THICKNESS>190</THICKNESS></MATERIALLAYER> </LAYERSET> <MATERIAL id="i782"> <NAME>Finishes - Interior - Gypsum Wall Board</NAME> </MATERIAL> <MATERIAL id="i783"> <NAME>Metal - Furring</NAME> </MATERIAL> <MATERIAL id="i784"> <NAME>Masonry - Concrete Blocks</NAME> </MATERIAL> <SPACE id="i726"> <NAME>196</NAME> <LONGNAME>Theatre</LONGNAME> <INTERIOROREXTERIORSPACE>internal</INTERIOROREXTERIORSPACE> <PROPERTYSET><NAME>PSet_Revit_Dimensions</NAME> <PROPERTY><NAME>Area</NAME><VALUE>7442.34</VALUE></PROPERTY> 94 <PROPERTY><NAME>Perimeter</NAME><VALUE>358.69</VALUE></PROPERTY> <PROPERTY><NAME>Unbounded Height</NAME><VALUE>22.12</VALUE></PROPERTY> </PROPERTYSET> </SPACE> <BOUNDS> <RELATINGSPACE><SPACE ref="i726"/></RELATINGSPACE> <RELATEDBUILDINGELEMENT><WALL ref="i726"/></RELATEDBUILDINGELEMENT> </BOUNDS> <BOUNDS> <RELATINGSPACE><SPACE ref="i726"/></RELATINGSPACE> <RELATEDBUILDINGELEMENT><DOOR ref="i62"/></RELATEDBUILDINGELEMENT> </BOUNDS> <FILLS> <RELATEDBUILDINGELEMENT><WALL ref="i..."/></RELATEDBUILDINGELEMENT> <RELATEDOPENINGELEMENT><DOOR ref="i62"/></RELATEDOPENINGELEMENT> </FILLS> 95 Appendix B Datasets for Usability Study B DOOR table DOOR ID 62 63 64 65 NAME Hollow Steel - Double Hollow Steel - Double Hollow Core Door Single Hollow Core Door Single w/Vision Panel TYPE D5 D5 D1 HEIGHT 2450 2450 2450 WIDTH 1800 1800 900 IS EXTERNAL 1 0 0 THICKNESS 0.148 0.148 0.148 D3 2450 762 1 0.148 HEIGHT LENGTH AREA VOLUME 12.467 12.467 12.467 12.467 12.467 12.467 12.467 12.467 12.467 12.467 3.675 30.266 14.154 30.338 20.643 11.110 15.092 9.396 8.904 23.294 42.683 378.288 168.029 385.755 261.124 130.644 136.921 56.711 81.410 270.315 31.928 282.927 104.742 232.870 171.341 78.866 82.656 34.236 49.145 266.058 B WALL table WALL ID TYPE LOAD BEARING 732 733 734 735 736 737 738 739 740 741 126187 126187 126998 127087 128063 127087 127087 127087 127087 125800 0 0 0 0 0 0 0 0 0 1 IS EXTERNAL 0 0 0 0 0 0 1 0 0 1 B CONTAINS table DOOR ID 62 63 64 65 WALL ID 738 739 740 741 B LAYER table 96 LAYER ID 1 2 3 4 5 CATEGORY Concrete Finishes Metal Masonry Metal SUBCATEGORY Cast-in-Place Concrete Interior NULL Concrete Blocks Stud Layer TYPE NULL Gypsum Wall Board Furring NULL NULL THICKNESS 300 16 22 190 152 B HAS LAYER table WALL ID 732 732 732 733 733 733 734 735 735 735 736 736 736 736 737 737 737 738 738 738 739 739 739 740 740 740 741 LAYER ID 2 3 4 2 3 4 4 2 2 5 2 2 2 5 2 2 5 2 2 5 2 2 5 2 2 5 1 LAYER POSITION 1 2 3 1 2 3 1 1 3 2 1 2 4 3 1 3 2 1 3 2 1 3 2 1 3 2 1 C ITEM table 97 ITEM ID 22240040 31100011 31100012 31100030 31100040 31100051 33100010 42100020 42200030 42200040 81100010 81100020 82500010 82500011 CATEGORY Sitework Concrete Concrete Concrete Concrete Concrete Concrete Masonry Masonry Masonry Doors & Windows Doors & Windows Doors & Windows Doors & Windows 82500030 82500040 84500010 Doors & Windows Doors & Windows Doors & Windows 84500020 Doors & Windows 84600010 92500100 92500200 92500390 Doors & Windows Finishes Finishes Finishes 92500391 Finishes 92500400 93000070 93000050 96650030 96800020 99100010 99200220 99200520 Finishes Finishes Finishes Finishes Finishes Finishes Finishes Finishes DESCRIPTION Gravel Fill Round Column Forms Square Column Forms Form Wall Suspended Slab Forms Ground Slab Edge Forms Pump and Place Concrete - 35MPa Brick Veneer Concrete Block 8 Concrete Block 12 Hollow Steel - Single Hollow Steel - Double Hollow Core Door - Single Hollow Core Door - Single w/Vision Panel Solid Wood Door - Single Solid Wood Door - Double Glazed Aluminum Entrance Doors Double Glazed Aluminum Entrance Doors Single Automatic Entrance Doors Drywall Ceiling/Metal Studs Standard Drywall Ceiling - Bulk Heads Metal Stud Partition Wall - Single Layer Metal Stud Partition Wall - Double Layer Metal Stud Partition Wall - Furred Ceramic Tile Slate Tile Linoleum Flooring Carpeting Exterior Paints Walls - Dry Wall Paint Doors UNIT CY EA SF SF SF LF CY SF SF SF EA EA EA EA UNIT PRICE 40.00 120.75 2.03 2.85 5.75 0.84 19.00 35.00 19.00 25.00 1500.00 1500.00 1000.00 1200.00 EA EA EA 1800.00 2500.00 2000.00 EA 1000.00 EA SF SFCA SF 3500.00 9.00 12.00 9.00 SF 9.00 SF SF SF SF SF SF SF EA 9.00 8.00 12.00 8.00 6.00 0.60 0.75 75.00 C BUILDING EST table 98 ITEM ID 31100011 31100012 31100030 31100040 31100051 33100010 42200030 81100020 82500010 82500011 92500390 92500391 92500400 9300070 99200220 99200520 QTY 1 258 540.630 1374 149 3 589 2 1 1 791 261 421 199 2525 1 99 Appendix C Introductory Script for Usability Study C.1 About this Study This study is for evaluating a data coordination tool. Data coordination is the problem of updating a contingent database C in response to changes to an independent but related database B. You will be presented with a hypothetical scenario where you are the administrator of a cost estimate database C, and must update this database as a result of changes to a building database B. You will use an application named SuperCoordinator to assist you in this task. You will be presented with six independent update cases (to B), and in each case, must determine the corresponding update(s) to C. C.2 Introduction: Cost Estimating at Foo Construction Imagine that you are the cost estimator working for a company, Foo Construction, which builds large commercial and residential buildings. You are currently engaged in a project with an architecture firm, Bar Architects, who has designed a building to be built by Foo Construction. A floor plan for this building is shown in Figure C.1. The building has a few rooms, columns, doors and walls placed on a concrete floor slab, and enclosed from above by a second ceiling slab. The walls are of various types: concrete, cinderblock (also known as concrete masonry units), and drywalls (which are composed of layers of gypsum wall board attached to a central layer of studs.) The building also has four square columns, one round column, and a number of single and double doors of different types. Your responsibility as a cost estimator is to estimate the cost of constructing the building designed by Bar Architects. This is a challenging job, since Bar Architects periodically makes minor changes to the building design, which means that you need to periodically make changes to your estimate. Furthermore, Bar Architects generally aren’t concerned with how difficult it is for you to keep an accurate estimate; they don’t even tell you how the design changes, only when it changes. Fortunately, Bar Architects gives you read-only access to a relational database which contains the building components and their measurements. You have already created a cost estimate for 100 Figure C.1: A floor plan of the building designed by Bar Architects the existing version of the design. Your estimate is also stored in your own relational database. Your task is: given your understanding of the relationship between the building design and cost estimate, detect when and how the design changes, and update your cost estimate database accordingly. Fortunately, Foo Construction has just acquired a fancy new “Data Coordination” tool, the SuperCoordinator. We will show you how to use it in a minute, but first you need to become acquainted with the two databases and their relationships. C.3 The Building Database ‘B’ The schema of the building database (hereafter denoted as ‘B’) is shown in Figure C.2. Tables not relevant in the scope of this study are omitted. B has tables for two types of building components: walls and doors. The table B CONTAINS 101 Figure C.2: The schema for Bar Architects’ building database ‘B’ Figure C.3: The schema for Foo Construction’s cost estimate database ’C’ describes the relationship between a wall which contains a door. Each wall is composed of a sequence of layers, and this structure is captured by the B HAS LAYER and B LAYER relations. Most of the attributes are self-explanatory; you may refer to a table in the appendix for detailed documentation of this schema. The best way to get a feel for it, is to examine some data. TO-DO: The “select all B” script in MySQL workbench selects all tuples in each table; run this script now and browse through the tables in B. C.4 The Cost Estimate Database ‘C’ Unlike B, your cost estimate describes the different types of equipment, materials and labour necessary for constructing the building (Figure C.3). Foo Contractors has a standard list of cost items, which are stored in the C ITEM table. These are re-used from project to project, and this table seldom changes. Each item represents 102 either a piece of equipment, a building material, or some labour item. Items are organized into an industry-standard set of categories (e.g. “Concrete”, “Masonry”). Your estimate, stored in the C BUILDING EST table consists of a tally of the quantities of items necessary to build the building. TO-DO: Run the “select all C” script in MySQL Workbench, and examine the contents of these two tables. A detailed description of C is also given in the appendix. When Bar Architects changes the building design, usually this means changing the quantity (QTY) of one or more items in C BUILDING EST. Sometimes it also involves adding or deleting items to or from C BUILDING EST. Even rarer, there are cases where a building requires some specialty item that doesn’t exist in the C ITEM table, and Foo Contractors must invent a new item and code for this. C.5 Familiarization Tasks In order to become better acquainted with B and C, we’ll walk you through a few queries. Please run these queries in MySQL Workbench as you read along. C.5.1 Layer Dependencies One of the site planners has asked if there is a way of easily grouping or summarizing walls by the types of layers they have. You have created the following query which shows each wall ID and wall type, along with the layers for that wall, in order of layer position. Run this query now. USE building_small; SELECT B_WALL.WALL_ID, B_WALL.TYPE AS WALL_TYPE, B_LAYER.CATEGORY, B_LAYER.SUBCATEGORY, B_LAYER.TYPE FROM B_WALL, B_HAS_LAYER, B_LAYER WHERE B_WALL.WALL_ID = B_HAS_LAYER.WALL_ID AND B_HAS_LAYER.LAYER_ID = B_LAYER.LAYER_ID ORDER BY B_WALL.WALL_ID, B_HAS_LAYER.LAYER_POSITION; You notice that the layers are the same for each wall instance of a given type. You then modify this query (by replacing the ORDER BY clause with a GROUP BY clause) to return only one 103 instance of each wall type, and give the resulting table to the site planner for easily identifying the different types of walls. Run this query now. USE building_small; SELECT B_WALL.TYPE AS WALL_TYPE, B_LAYER.CATEGORY, B_LAYER.SUBCATEGORY, B_LAYER.TYPE FROM B_WALL, B_HAS_LAYER, B_LAYER WHERE B_WALL.WALL_ID = B_HAS_LAYER.WALL_ID AND B_HAS_LAYER.LAYER_ID = B_LAYER.LAYER_ID GROUP BY WALL_TYPE, B_HAS_LAYER.LAYER_POSITION; C.5.2 Project Costs Switching our attention to the cost estimate. Usually, the project’s client (i.e. the institution or individual who owns the land and is purchasing the building) wants to analyze the C to gain some insight about the various costs of the building. The following query calculates the total cost of your estimate for the building project, by multiplying each item’s quantity (QTY) by its unit price. Run this query now. USE estimate_small; SELECT SUM(UNIT_PRICE*QTY) AS TOTAL_COST FROM C_ITEM, C_BUILDING_EST WHERE C_ITEM.ITEM_ID = C_BUILDING_EST.ITEM_ID; Although sometimes they want to know the cost breakdown into the various item categories. You produce the following aggregate query. Run this query now. USE estimate_small; SELECT C_ITEM.CATEGORY, SUM(UNIT_PRICE*QTY) AS TOTAL_COST FROM C_ITEM, C_BUILDING_EST WHERE C_ITEM.ITEM_ID = C_BUILDING_EST.ITEM_ID GROUP BY C_ITEM.CATEGORY; 104 C.6 B-C Mappings Since you have access to both B and C, you have found it convenient to query them in order to ensure your cost estimate is up to date. In fact, you’ve even gone one step further by creating mappings that can tell you whether or not C is consistent with the current state of B. Each mapping is a pair of queries, one on B and one on C, that encodes some of your knowledge about how B and C are related. A mapping is satisfied if the result of the query on B is the same as the result of the query on C. You have created two mappings as described in the following subsections. For the purpose of this study, your scope is limited to only these two mappings. C.6.1 Mapping 1: Concrete Blocks Several of the building’s walls are made of concrete blocks (a.k.a. cinderblocks), which are large bricks made out of concrete. One of your estimate items (with ITEM ID = 42200030 and DESCRIPTION = Concrete Block 8”) is the total material needed for these walls, which is measured in square feet. The QTY of this item in your C BUILDING EST table should equal the total surface area of walls in the building made of concrete blocks. You have created a query on B which returns the total surface area of these walls as follows: USE building_small; SELECT SUM(AREA) FROM B_WALL, B_HAS_LAYER, B_LAYER WHERE B_WALL.WALL_ID = B_HAS_LAYER.WALL_ID AND B_HAS_LAYER.LAYER_ID = B_LAYER.LAYER_ID AND B_LAYER.CATEGORY="Masonry" AND B_LAYER.SUBCATEGORY="Concrete Blocks"; As well as a query on C which returns the QTY of the corresponding item. USE estimate_small; SELECT QTY FROM C_BUILDING_EST WHERE ITEM_ID = 42200030; Run the queries “mapping1 b.sql” and “mapping1 c.sql” in the MySQL Workbench. These should both yield the same result: a quantity of 589. Excellent, your concrete block quantities are also coordinated! 105 C.6.2 Mapping 2: Doors The final mapping for which you are responsible is for the quantities of various types of doors. This mapping differs from the previous in that it is between sets of values. In particular, names of doors in B correspond exactly to the names of doors in C (e.g. “Hollow Steel - Double”). Only B contains a description of each and every door, while your cost estimate counts the number of doors of each distinct type. Your mapping for doors consists of an aggregate query on B USE building_small; SELECT NAME, COUNT(NAME) FROM B_DOOR GROUP BY TYPE; and a query on C which selects items whose category is “Doors & Windows” USE estimate_small; SELECT DESCRIPTION, QTY FROM C_ITEM, C_BUILDING_EST WHERE C_BUILDING_EST.ITEM_ID = C_ITEM.ITEM_ID AND C_ITEM.CATEGORY = "Doors & Windows"; Run the queries “mapping2 b.sql” and “mapping2 c.sql” in MySQL Workbench. They should yield the result Hollow Core Door - Single Hollow Core Door - Single w/Vision Panel 1 1 Hollow Steel - Double 2 The ordering of tuples doesn’t matter in this case, only that the corresponding door names have identical counts. C.7 SuperCoordinator: A Data Coordination System As was described in the introduction, your job is to keep the estimate you have created up to date with the latest changes to the building design: i.e. to coordinate your estimate with the building design. This means ensuring that the results are the same for each pair of queries, in each of the two mappings. Fortunately for you, Foo Construction licensed some nifty new software (SuperCoordinator) designed and written by some brilliant researchers at UBC, and you get to use this software to help you coordinate your database. SuperCoordinator is programmed with your two mappings, and it monitors the building database for changes. Each time B changes, it takes these changes and uses the mappings to calculate how C should be updated. 106 The problem is that mapping 2 does not give SuperCoordinator enough information to update the estimate all on its own; SuperCoordinator needs you to use your knowledge of what the mappings mean in order to decide exactly how the estimate should be updated. First we give a brief introduction, then we’ll walk you through a few example coordination tasks. C.7.1 Overview SuperCoordinator has three main screens, one showing the building tables, one showing the estimate tables, and one showing the current state of the mappings. Take a look at SuperCoordinator now. You should see something like the following. The SuperCoordinator building tables screen The building tables screen is shown by default after launching SuperCoordinator. It simply shows the current state of the building database, with each table displayed in a separate tab. Move to the mapped data screen by selecting “View → Show Mapped Data”. There is a tab for each mapping, showing the B and C queries and their results. 107 The SuperCoordinator mapped data screen Now, select “View → Show Estimate Tables” to show the estimate tables screen. In this case, the C BUILDING EST and C ITEM tables are shown side by side. Below the two tables are three tabs, which is where SuperCoordinator shows you its guesses as to how the estimate should be updated. More on this in a moment. 108 The SuperCoordinator estimate tables screen C.7.2 Example Coordination Tasks In this study you need to complete six coordination tasks. Your goal in each case is to find the correct set of updates (insertions and deletions) to/from C, so that each pair of queries in a mapping returns the same result. First, we will demonstrate this on two example tasks. Task 1: Wall Areas First, navigate back to the building tables screen by selecting “View → Show Building Tables”. To proceed to the next task, select “Coordinate → Check for building updates” (click yes to confirm.) When there are updates, SuperCoordinator does the following: ❼ It shows how the building database has changed. 109 ❼ It shows how the mappings have changed. ❼ It suggests updates to the cost estimate tables. In the building tables screen, the table which has been updated is signaled. You should now see that the B WALL table has changed. Click on the B WALL table to reveal the following update. Deleted tuples are highlighted in orange; inserted tuples are highlighted in green. Unfortunately, SuperCoordinator does not know about updated tuples, so these are represented as an insertion and a deletion. In this example, we can see that there is an inserted wall with WALL ID=734, and a deleted wall with the same WALL ID. If we look carefully at the attributes, we can see that this update represents a change in this wall’s LENGTH and AREA, from 14.154 to 10.154 and 168.029 to 115.62 respectively. The mappings are updated, too. Select “View → Show mapped data”, and select the tab for mapping 1. You should see the following: 110 Showing that the single value calculated by mapping 1 has changed from 589 to 536.591. Your job as the cost estimator is to update your estimate so that the query on the right hand side also returns 536.591. Looking at the SQL query on the right hand side of the mapping, we are reminded that this mapping gets the QTY of the item in C BUILDING EST whose ITEM ID is 42200030. Select “View → Show Estimate Tables”, and you should see that SuperCoordinator has already made suggestions on how you should update your estimate, shown below. What has happened is that SuperCoordinator has determined some changes to the C BUILDING EST table, and is showing its suggested changes to you. It is showing you that there should be a tuple (ITEM ID=42200030, QTY=536.591) inserted into the C BUILDING EST table, and the tuple 111 (ITEM ID=42200030, QTY=589.0) should be deleted. SuperCoordinator has inferred these updates based on the changes to mapping 1, and the structure of the query “mapping1 c.sql”. Now select the “Deletions” tab. This shows the possible deletions which should be performed so that the value 589.0 no longer appears as a result of the query on the right hand side of mapping 2. You should see the following screen with one option, which is selected. In this example, there is only one possible deletion from C. In general, the “Deletions” tab will show you all possible sets of deleted tuples. Each deletion set is described as a comma-separated list in the form TABLE NAME(ITEM ID). Each deletion set is shown on a different row, and your job is to select one of them. They are ranked, so that the deletion set that SuperCoordinator thinks is the correct solution is shown first. The first deletion set is automatically selected, but you may choose another set if you think it is appropriate. Task 2: New Door Now you are ready to move on to the next update task. Select ”Coordinate → Check for building updates”, which will cause your selected updates to be logged and the next update case to be triggered. Before applying each update case, both B and C are reset to their original state, so that the updates from previous cases are not persisted. Select “View → Show building tables”. You should see that there have been two inserted tuples, one into B DOOR and one into B CONTAINS, as shown below We see that there has been a new door with the NAME “Rolling Garage Door” inserted, and 112 that this door is contained inside of the wall with WALL ID=741. Now select “View → Show mapped data”, and select mapping 2. You should see the following: A new tuple “Rolling Garage Door” with count 1. Now select ”View → Show estimate tables”, you should see an inserted tuple into each table of C, as shown below Because the query on C for mapping 2 joins the C ITEM and C BUILDING EST tables, the insertion into mapping 2 is translated by SuperCoordinator into insertions into both of these tables. SuperCoordinator uses variables X 1, X 2, and X 3 in place of unknown values for the attributes ITEM ID, UNIT, and UNIT PRICE. It is your job to provide values for each of these variables. If there had been an existing tuple with the description “Rolling Garage Door” in C ITEM, SuperCoordinator would have suggested its ITEM ID, UNIT, and UNIT PRICE values for X 1, X 2, and X 3. In this case, there would be no actual insertion into C ITEM, since the “inserted” tuple would already exist. 113 Select the “Variables” tab, you should see the following: Double click the cell next to X 1, and enter a new ITEM ID, for example 85000010. You will also need to give values for this item’s UNIT and UNIT PRICE. Since other doors are measured by the each (“EA”), this is probably the correct value for X 2 (otherwise, what would the quantity of 1.0 mean?.) Since rolling garage doors are more expensive than automatic entrance doors, you can use the value 4500.00 for X 3. The exact value doesn’t really matter for the purpose of this study, you just need to use some value. There is one caveat though, which is the values you enter must satisfy the constraint shown in the “Constraint” tab, which is a logical expression using Java/C syntax. For example, the constraint may contain (X_1 != 22240040) && ... along with a number of other clauses. The reason these are here is that if the value you choose for X 1 violates any of these clauses, then the mappings will still not be satisfied because there would be extra tuples in the results for the queries on the estimate. So for example, if you tried to use the value X 1=22240040, that would imply your estimate contains 2.0 cubic yards of Gravel Fill, which is totally wrong. The values suggested by SuperCoordinator always satisfy the constraint, so you don’t need to check them. Since there are no deletions for this update case, you’re all finished! C.8 Completing This Study You must complete an additional 6 update cases on your own. After each case, both B and C are reset to their initial state. Your goal is to complete each update case as quickly as possible, while keeping your estimate as accurate as possible. But there is one more problem... SuperCoordinator is kind of buggy. Perhaps because the brilliant researchers who wrote it ended up wasting too much time reading web-comics and watching funny cat videos on YouTube, and finished the program in a last-minute rush. As a result, SuperCoordinator only sometimes works to its full capacity. At the top of the window, you will see an “Assistance Level” indicator, which will either read NONE, LOW, or HIGH. When the assistance level is HIGH, SuperCoordinator works as described above, showing you the 114 possible updates, suggesting values for the unknown variables (if possible), ranking the deletions, and selecting the first deletion automatically. When the assistance level is LOW, SuperCoordinator will fail to highlight the changes to the building database and to the mapped data. It will still show you the current building database, and the current mapped data, but you will be on your own on figuring out what exactly has caused the changes. SuperCoordinator will also not display any suggested variable values, nor will it rank the suggested deletions (they will be in some arbitrary order), nor will it choose the first deletion for you. When the assistance level is NONE, SuperCoordinator will not show you any possible insertions or deletions. In fact, it won’t even show you the mapped data! As with an assistance level of LOW, it will however still show you the current building database, but you will be completely on your own in updating the cost estimate. Since you cannot select insertions or deletions, you will need to manually write SQL update statements. There will be a menu item “Coordinate → Edit Cost Estimate Updates” as shown below Clicking on this item will pop open a text editor, where you can manually enter the updates you think should be applied to the cost estimate. Below is a screen shot showing the result of manually entering some updates for Example Task 1 (Wall Areas). 115 If you close this text-editor window, the results you have entered are saved but not applied. You may re-open the window at any time by selecting ”Coordinate → Edit Cost Estimate Updates”, and may do this as many times as you like. Your updates are not made final until you select ”Coordinate → Check for building updates” and move on to the next task. Don’t worry if your SQL syntax is not perfect, these updates won’t actually be applied to the cost estimate. In each case, you are still free to use the MySQL Workbench to look at the data and run any queries which help you in your tasks. MySQL Workbench will also be showing the most up to date building database after each update. In summary, the following workflow diagram shows the process you should use to determine how to update C in each case. 116 117 Now click “Coordinate → Check for building updates” to get started, and good luck! C.9 Appendix C.9.1 Building database ’B’ documentation ❼ B DOOR — Doors. – DOOR ID — A unique ID for each door. – NAME — A description of the type of door. – TYPE — A code for the type of door. – HEIGHT— The door height in mm. – WIDTH— The door width in mm. – IS EXTERNAL— A boolean value indicating whether or not the door is external. – THICKNESS — The door’s thickness, in feet. ❼ B CONTAINS — Describes which doors are in which walls. Each door is contained by exactly one wall, but a wall may contain any number of doors. – DOOR ID — ID of the contained door. – WALL ID — ID of the wall containing the door. ❼ B WALL — Walls. – WALL ID — A unique ID for each wall. – TYPE — An integer identifying the type of wall. – LOAD BEARING — Boolean value indicating if a wall supports something above it. – IS EXTERNAL — Boolean value indicating if a wall is on the exterior surface of the building. – HEIGHT — The height of the wall, in feet. – LENGTH — The wall’s length in feet. – AREA — The wall’s area, in square feet, excluding any openings in the wall for windows, doors etc. – VOLUME — The wall’s volume, in cubic feet, excluding any openings in the wall for windows, doors etc. 118 ❼ B HAS LAYER — Each wall is composed of one or more layers. This table stores the wall’s layers, in order. – WALL ID — The wall containing the layer. – LAYER ID — Layer contained in the wall. – LAYER POSITION — Integer indicating which position this layer is in relative to other layers in the same wall. ❼ B LAYER — Describes layers (i.e. building materials.) – LAYER ID — A unique ID for the layer. – CATEGORY, SUBCATEGORY, TYPE — Define a three-level hierarchy for the layer. Category is the name of the type at the highest level, and TYPE is the name of the type at the lowest level. SUBCATEGORY may be NULL, indicating the TYPE is a direct parent of the CATEGORY. TYPE may be NULL, indicating that CATEGORY is the most detailed information available. – THICKNESS — The layer thickness, in mm. C.9.2 Cost estimate database ’C’ documentation ❼ C ITEM — This is Foo Construction’s standard cost of for various labour, equipment, and material items. Its items are re-used from project to project. – CSI — A level two or three CSI code corresponding to the best classification of this item. – CODE — This is a two-digit item code used to identify different items within the same CSI section. These codes are created by Foo Construction and do not correspond to any specific industry standard. – DESCRIPTION — A description of the item (labour, equipment or materials) – UNIT — The unit of measure for the item. An upper-case acronym which is one of: SF (square feet), CY (cubic yards), EA (each), GAL (gallons), GSF (ground square feet), LD (loads), LF (lineal feet), SFCA (square feet of contact area). – UNIT PRICE — The cost per unit. ❼ C BUILDING EST — This is the table which contains the specific items which are in your estimate for the building designed by Bar Architects. – CSI — The CSI code of the item. – CODE — The code of the item. Together, CSI and CODE form a foreign key reference into the C ITEM RATES table. 119 – QTY — The quantity of this item required for the building project. 120 Appendix D Usability Study Update Cases The following update cases were presented in order to each subject in the usability study. Warmup Case 1: Wall Area Change Changes to B B WALL+ (734, 126998, 0, 0, 12.467, 10.154, 115.620, 65.500) B WALL− (734, 126998, 0, 0, 12.467, 14.154, 168.029, 104.742) View differences Q+ 1,C (536.591) Q− 1,C (589.000) Results of update translation C BUILDING EST+ (42200030, 536.591) C BUILDING EST− (42200030, 589.000) Warmup Case 2: “Rolling Garage Door” Added Changes to B B DOOR+ (75, “Rolling Garage Door”, “D10”, 2700, 2400, 1, 0.225) View differences Q+ 2,C (“Rolling Garage Door”, 1) Results of update translation C BUILDING EST+ (X1 , 1.0) C ITEM+ (X1 , “Doors & Windows”, “Rolling Garage Door”, X2 , X3 ) Constraint (X_1 != 22240040) && 121 (X_1 != 31100011) && (X_1 != 31100012) && (X_1 != 31100030) && (X_1 != 31100040) && (X_1 != 31100051) && (X_1 != 33100010) && (X_1 != 42100020) && (X_1 != 42200030) && (X_1 != 42200040) && (X_1 != 81100010) && (X_1 != 81100020) && (X_1 != 82500010) && (X_1 != 82500011) && (X_1 != 82500030) && (X_1 != 82500040) && (X_1 != 84500010) && (X_1 != 84500020) && (X_1 != 84600010) && (X_1 != 92500100) && (X_1 != 92500200) && (X_1 != 92500390) && (X_1 != 92500391) && (X_1 != 92500400) && (X_1 != 93000070) && (X_1 != 93000050) && (X_1 != 96650030) && (X_1 != 96800020) && (X_1 != 99100010) && (X_1 != 99200220) && (X_1 != 99200520) No suggested values are given, since there is no existing tuple in C ITEM with the description “Rolling Garage Door”. Case 1: Wall Area Change Changes to B B WALL+ (733, 126187, 0, 0, 10.6, 30.266, 310.225, 235.422) B WALL− (733, 126187, 0, 0, 12.467, 30.266, 378.288, 282.927) 122 View differences Q+ 1,C (520.937) Q− 1,C (589.000) Results of update translation C BUILDING EST+ (42200030, 520.937) C BUILDING EST− (42200030, 589.000) Case 2: “Solid Wood Door” Added Changes to B B DOOR+ (100, “Solid Wood Door - Single”, “D7”, 2450, 900, 0, 0.148) View differences Q+ 2,C (“Solid Wood Door - Single”, 1) Results of update translation C BUILDING EST+ (X1 , 1.0) C ITEM+ (X1 , “Doors & Windows”, “Solid Wood Door - Single”, X2 , X3 ) Suggested values for HIGH Assistance Level. X1 = 82500030 X2 = EA X3 = 1800.0 Constraint (X_1 != 22240040) && (X_1 != 31100011) && (X_1 != 31100012) && (X_1 != 31100030) && (X_1 != 31100040) && (X_1 != 31100051) && (X_1 != 33100010) && (X_1 != 42100020) && (X_1 != 42200030) && 123 (X_1 != 42200040) && (X_1 != 81100010) && (X_1 != 81100020) && (X_1 != 82500010) && (X_1 != 82500011) && (X_1 != 82500040) && (X_1 != 84500010) && (X_1 != 84500020) && (X_1 != 84600010) && (X_1 != 92500100) && (X_1 != 92500200) && (X_1 != 92500390) && (X_1 != 92500391) && (X_1 != 92500400) && (X_1 != 93000070) && (X_1 != 93000050) && (X_1 != 96650030) && (X_1 != 96800020) && (X_1 != 99100010) && (X_1 != 99200220) && (X_1 != 99200520) Case 3: Wall Material Changed From Cast Concrete to Concrete Blocks Changes to B B HAS LAYER+ (741, 4, 1) B HAS LAYER− (741, 1, 1) View differences Q+ 1,C (859.315) Q− 1,C (589.000) Results of update translation C BUILDING EST+ (42200030, 859.315) C BUILDING EST− (42200030, 589.000) 124 Case 4: Type of Two Doors Changed Changes to B B DOOR+ (62, “Solid Oak (Double)”, “D8”, 2450, 1800, 1, 0.148) B DOOR+ (63, “Solid Oak (Double)”, “D8”, 2450, 1800, 0, 0.148) B DOOR− (62, “Hollow Steel - Double”, “D5”, 2450, 1800, 1, 0.148) B DOOR− (63, “Hollow Steel - Double”, “D5”, 2450, 1800, 0, 0.148) View differences Q+ 2,C (“Solid Oak (Double)”, 2) Q− 2,C (“Hollow Steel - Double”, 2) Results of update translation C BUILDING EST+ (X1 , 2.0) C ITEM+ (X1 , “Doors & Windows”, “Solid Oak (Double)”, X2 , X3 ) C BUILDING EST− (81100020, 2.0)∨ C ITEM− (81100020, “Doors & Windows”, “Hollow Steel - Double”, EA, 1500.00) No suggested values, due to lack of existing tuple in C ITEM with the description “Solid Oak (Double)”. For HIGH Assistance Level, the deletion from C BUILDING EST is ranked before the deletion from C ITEM. Constraint (X_1 != 22240040) && (X_1 != 31100011) && (X_1 != 31100012) && (X_1 != 31100030) && (X_1 != 31100040) && (X_1 != 31100051) && (X_1 != 33100010) && (X_1 != 42100020) && (X_1 != 42200030) && (X_1 != 42200040) && (X_1 != 81100010) && (X_1 != 81100020) && (X_1 != 82500010) && 125 (X_1 != 82500011) && (X_1 != 82500030) && (X_1 != 82500040) && (X_1 != 84500010) && (X_1 != 84500020) && (X_1 != 84600010) && (X_1 != 92500100) && (X_1 != 92500200) && (X_1 != 92500390) && (X_1 != 92500391) && (X_1 != 92500400) && (X_1 != 93000070) && (X_1 != 93000050) && (X_1 != 96650030) && (X_1 != 96800020) && (X_1 != 99100010) && (X_1 != 99200220) && (X_1 != 99200520) Case 5: Width and Height of Two Doors Changed Changes to B B DOOR+ (62, “Hollow Steel - Double”, “D5”, 2450, 1800, 1, 0.148) B DOOR+ (63, “Hollow Steel - Double”, “D5”, 2450, 1800, 0, 0.148) B DOOR+ (62, “Hollow Steel - Double”, “D5”, 2550, 1700, 1, 0.148) B DOOR+ (63, “Hollow Steel - Double”, “D5”, 2550, 1700, 0, 0.148) No changes to either view, and also no suggested updates. 126 Case 6: A Wall’s Dimensions Are Changed; A New Wall is Added, And a Door Type is Changed Changes to B B DOOR+ (65, “Hollow Core Door - Single”, “D1”, 2450, 900, 1, 0.148) B DOOR− (65, “Hollow Core Door - Single w/Vision Panel”, “D3”, 2450, 762, 1, 0.148) B WALL+ (732, 126187, 0, 0, 12.467, 6.224, 48.971, 35.298) B WALL+ (742, 125800, 1, 0, 12.467, 10.15, 120.054, 116.231) B WALL+ (738, 125800, 0, 1, 12.467, 15.092, 136.921, 82.656) B WALL− (732, 126187, 0, 0, 12.467, 3.675, 42.683, 31.928) B WALL− (738, 127087, 0, 1, 12.467, 15.092, 136.921, 82.656) B HAS LAYER+ (738, 1, 1) B HAS LAYER+ (742, 1, 1) B HAS LAYER− (738, 2, 1) B HAS LAYER− (738, 2, 3) B HAS LAYER− (738, 5, 2) View differences Q+ 1,C (595.288) Q− 1,C (589.000) Q+ 2,C (“Hollow Core Door - Single”, 2) Q− 2,C (“Hollow Core Door - Single”, 1) Q− 2,C (“Hollow Core Door - Single w/Vision Panel”, 1) 127 Results of update translation C ITEM+ (42200030, 595.288) C ITEM+ (X1 , 2) C BUILDING EST+ (X1 , “Doors & Windows”, “Hollow Core Door - Single”, X2 , X3 ) C BUILDING EST− (42200030, 589) ∧ (C BUILDING EST− (82500011, 1) ∧ C BUILDING EST− (82500010, 1))∨ (C ITEM− (82500011, “Doors & Windows”, “Hollow Core Door - Single w/Vision Panel”, EA, 1200.00)∧ C BUILDING EST− (82500010, 1))∨ (C BUILDING EST− (82500011, 1)∧ C ITEM− (82500010, “Doors & Windows”, “Hollow Core Door - Single”, EA, 1000.00))∨ (C ITEM− (82500011, “Doors & Windows”, “Hollow Core Door - Single w/Vision Panel”, EA, 1200.00)∧ C ITEM− (82500010, “Doors & Windows”, “Hollow Core Door - Single”, EA, 1000.00)) Suggested values for HIGH Assistance Level. X1 = 82500010 X2 = EA X3 = 1000.0 Constraint (X_1 != 22240040) && (X_1 != 31100011) && (X_1 != 31100012) && (X_1 != 31100040) && (X_1 != 31100051) && (X_1 != 33100010) && (X_1 != 42100020) && (X_1 != 42200030) && (X_1 != 42200040) && (X_1 != 81100010) && (X_1 != 81100020) && (X_1 != 82500010) && (X_1 != 82500011) && (X_1 != 82500030) && (X_1 != 82500040) && 128 (X_1 != 84500010) && (X_1 != 84500020) && (X_1 != 84600010) && (X_1 != 92500100) && (X_1 != 92500200) && (X_1 != 92500390) && (X_1 != 92500391) && (X_1 != 92500400) && (X_1 != 93000070) && (X_1 != 93000050) && (X_1 != 96650030) && (X_1 != 96800020) && (X_1 != 99100010) && (X_1 != 99200220) && (X_1 != 99200520) 129
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Data coordination
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Data coordination Lawrence, Michael 2013
pdf
Page Metadata
Item Metadata
Title | Data coordination |
Creator |
Lawrence, Michael |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | Data coordination is the problem of updating a contingent database C as a result of changes to a database B on which it depends. For example, a general contractor’s construction cost estimate (C) needs to be updated in response to changes made by an architect to a building design (B). Although these two databases are related in a very specific way, they contain information about fundamentally different types of objects: the cost estimate is composed of items which represent work results, and the building design is composed of physical objects. Motivated by scenarios such as design-cost coordination, we propose an approach to coordinating data between autonomous, heterogeneous sources which have a base-contingent relationship. We propose the use of declarative mappings to express exact relationships between the two. Using materialized views to maintain state, we give an overall approach for coordinating sets of updates from B to C through view differencing and view update translation. We adopt ideas from data exchange and incomplete information to generate the set of all possible updates which satisfy a mapping. We propose methods for assisting a user (C’s administrator) in choosing amongst the possible updates, and experimentally evaluate these methods, as well as the overall benefit of semiautomatic data coordination, in a usability study. We then discuss practical challenges in applying our general techniques to the domain of architecture, engineering and construction, by interviewing practitioners and analyzing data from two construction projects. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-03-22 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-ShareAlike 3.0 Unported |
DOI | 10.14288/1.0052209 |
URI | http://hdl.handle.net/2429/44046 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-sa/3.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2013_spring_lawrence_michael.pdf [ 1.77MB ]
- Metadata
- JSON: 24-1.0052209.json
- JSON-LD: 24-1.0052209-ld.json
- RDF/XML (Pretty): 24-1.0052209-rdf.xml
- RDF/JSON: 24-1.0052209-rdf.json
- Turtle: 24-1.0052209-turtle.txt
- N-Triples: 24-1.0052209-rdf-ntriples.txt
- Original Record: 24-1.0052209-source.json
- Full Text
- 24-1.0052209-fulltext.txt
- Citation
- 24-1.0052209.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052209/manifest