UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Managing updates and transformations in data sharing systems Thrastarson, Arni Mar 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_november_thrastarson_arni.pdf [ 417.09kB ]
JSON: 24-1.0135578.json
JSON-LD: 24-1.0135578-ld.json
RDF/XML (Pretty): 24-1.0135578-rdf.xml
RDF/JSON: 24-1.0135578-rdf.json
Turtle: 24-1.0135578-turtle.txt
N-Triples: 24-1.0135578-rdf-ntriples.txt
Original Record: 24-1.0135578-source.json
Full Text

Full Text

Managing Updates andTransformations in Data SharingSystemsbyArni Mar ThrastarsonB.Sc. Computer Science, The University of Iceland, 2012B.Sc. Software Engineering, The University of Iceland, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2014c© Arni Mar Thrastarson 2014AbstractDealing with dirty data is an expensive and time consuming task. Estimatessuggest that up to 80% of the total cost of large data projects is spent ondata cleaning alone. This work is often done manually by domain expertsin data applications, working with data copies and limited database access.We propose a new system of update propagation to manage data cleaningtransformations in such data sharing scenarios. By spreading the changesmade by one user to all users working with the same data, we hope toreduce repeated manual labour and improve overall data quality. We de-scribe a modular system design, drawing from different research areas of datamanagement, and highlight system requirements and challenges for imple-mentation. Our goal is not to achieve full synchronization, but to propagateupdates that individual users consider valuable to their operation.iiPrefaceThis dissertation is original, unpublished, independent work by the author,Arni Mar Thrastarson.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Scenario 1: Changes in Application Data . . . . . . . . . . . 72.2 Scenario 2: Changes in Source Data . . . . . . . . . . . . . . 92.3 Scenario 3: Data Integration with Limited Storage . . . . . . 122.4 Scenario 4: Unknown Transformations . . . . . . . . . . . . . 143 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . 163.2 System Components . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . 17ivTable of Contents3.2.2 Data Update Store . . . . . . . . . . . . . . . . . . . 193.2.3 User Data Definitions . . . . . . . . . . . . . . . . . . 193.2.4 Data Files . . . . . . . . . . . . . . . . . . . . . . . . 203.2.5 Data Integration . . . . . . . . . . . . . . . . . . . . . 203.2.6 Update Consolidation . . . . . . . . . . . . . . . . . . 223.2.7 User Mappings . . . . . . . . . . . . . . . . . . . . . . 223.2.8 Standard ETL Transformations . . . . . . . . . . . . 233.2.9 Data Cleaning Changes . . . . . . . . . . . . . . . . . 233.2.10 Transformation Phase . . . . . . . . . . . . . . . . . . 243.2.11 User Application . . . . . . . . . . . . . . . . . . . . . 253.2.12 Transformation Scripts . . . . . . . . . . . . . . . . . 253.2.13 Transformation Inference Engine . . . . . . . . . . . . 263.2.14 Update Propagation to Sources . . . . . . . . . . . . 274 Analyzing Case Studies . . . . . . . . . . . . . . . . . . . . . . 314.1 Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.1.1 Retrieving and Transforming the Data . . . . . . . . 314.1.2 Propagating Updates to the Source . . . . . . . . . . 374.2 Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Scenario 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1 Collaborative Data Sharing Systems . . . . . . . . . . . . . . 495.2 Data Transformation Systems . . . . . . . . . . . . . . . . . 515.3 Data Provenance . . . . . . . . . . . . . . . . . . . . . . . . . 535.4 Data Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . 55vTable of Contents5.5 Probabilistic Databases . . . . . . . . . . . . . . . . . . . . . 585.6 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . 595.7 Extraction, Transformation, and Loading . . . . . . . . . . . 605.8 View Maintenance . . . . . . . . . . . . . . . . . . . . . . . . 626 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 64Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70viList of Tables4.1 Sample data source. . . . . . . . . . . . . . . . . . . . . . . . 324.2 Sample unprocessed application data. . . . . . . . . . . . . . 334.3 Sample processed application data. . . . . . . . . . . . . . . . 354.4 Sample data source after making updates. . . . . . . . . . . . 404.5 Sample update buffer. . . . . . . . . . . . . . . . . . . . . . . 424.6 Updates presented to application user. . . . . . . . . . . . . . 43viiList of Figures3.1 The general system architecture. . . . . . . . . . . . . . . . . 184.1 Scenario 1: Retrieving and transforming data from a source. . 344.2 Scenario 1: Propagating updates to the source. . . . . . . . . 384.3 Scenario 2: Propagating new data and changes from the datasource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 The data integration scenario. . . . . . . . . . . . . . . . . . . 47viiiAcknowledgementsLet no man glory in the greatness of his mind,but rather keep watch o’er his wits.Cautious and silent let him enter a dwelling;to the heedful comes seldom harm,for none can find a more faithful friendthan the wealth of mother wit.He knows alone who has wandered wide,and far has fared on the way,what manner of mind a man doth ownwho is wise of head and heart.Wise in measure let each man be;but let him not wax too wise;for never the happiest of men is hewho knows much of many things.(Snorri Sturluson - 13th century Icelandic poet)I am continually grateful for the kind guidance provided by my excellentsupervisor, Dr. Rachel Pottinger. I thank her for making all the necessarypushes and shoves. My gratitude extends to Dr. Ed Knorr for taking thetime to read this document. I continue to be amazed by the loving andunwavering support of my parents, both of whom still have little idea of whatI actually do. A last shoutout goes out to my good friends Hafsteinn, GisliBaldur and Siggi Logi, for their encouragement and fun times throughoutthis process.ixChapter 1IntroductionAs reliance on data increases in modern businesses, the need for appropri-ate and efficient data management solutions keeps growing. The ability tocollect data at large scales, and a lowering cost of storage, are coupled witha growing awareness of the need to incorporate data as an essential part inmanagement decisions.“Garbage in, garbage out,” is a well known mantra about computersystems that applies in equal measure to data dependent decision supportsystems, whose relevance will in large part be dictated by the quality ofthe input data. When our business decisions rely on data in such a multi-dimensional way–past, present, and future–distorted analysis of dirty datacan even eliminate the potential benefits of a data-driven approach [30].Errors in data should be considered an inherent part of data manage-ment systems, and they can be introduced at various different phases ofdata collection efforts. Measurement errors due to miscalibrated sensors inscientific settings are a common source of errors, as well as human input er-rors when data has to be manually entered. Other common sources includedistillation errors, where an analyst transforms or aggregates data in someincorrect way, and data integration errors when data from more than onesource is merged, causing inconsistencies in the final outcome. Hellerstein[30] has described the “lifetime” of data as an iterative process, potentially1Chapter 1. Introductionlasting over large spans of time and space, that includes such tasks as collect-ing, transforming, storing, cleaning, and analyzing data. The complexity ofscenarios where data quality can be compromised, often involving multipleorganizations and human actors, foreshadows a difficult problem in practice.Advances in data integration alone are deserving of a special mentionin this regard. Many applications are now reliant on their ability to querydata that resides in different heterogeneous sources. In addition, commonscenarios in industry also require data from different sources to be mergedand consolidated in company databases. Expecting data collection and stor-age to adhere to a designed initial schema, with little evolution over timeis unrealistic in modern enterprises, and this is liable to cause thorny dataquality problems in applications that rely on derived or integrated versionsof data. Often, the end result is the introduction of more noise to data thanin traditional single source systems [28].We have seen how dirty data is a prevalent issue in most data manage-ment applications. Dealing with these concerns, by first discovering andthen correcting various quality issues, is an incredibly costly measure. Es-timates suggest that up to 80% of the total development time and cost oflarge data projects is spent solely on data cleaning [21]. That is a huge pro-portion and one that can’t be ignored, especially considering the growingdependence on data solutions across the spectrum. Data cleaning is nowthe biggest bottleneck in the data analysis pipeline, and due to the costsensitivity of many projects, it is obvious that even slight improvementson state-of-the-art methods can be of significant financial benefit. Becauseof this, research interest on data cleaning has proliferated in recent years,followed by an increased offering of dedicated enterprise software.Although methods for automatic anomaly detection exist, and are con-2Chapter 1. Introductiontinually being improved, human judgment is still required to judge the re-sult. The manual labour currently needed for data cleaning is thereforequite extensive. Not only does the effort require numerous man-hours, butin practice the people who are best suited to make judgment calls aboutdata quality, are in many cases either experts in the domain of the data,or highly trained analysts. The time and skill level needed to clean datamanually will quickly incur large costs on a data project.A motiviating assumption we make, is that some of the most valuabledata cleaning is performed relatively late in the data analysis pipeline. Thisis when a domain expert starts working with data, already collected andstored in a data source, and must first examine it to correct errors. Notethat in this case, the expert is not the same as the database administrator,who might not be as familiar with the domain, and whose main task is tomaintain the data rather than use it. The users that actually do so, arelikely to only work with particular subsets of the data at a time. Moreover,in a common scenario, this subset is a data copy that the analyst receives,and manipulates locally on his machine. In that case, all data cleaningefforts would be lost to the data source if the local data is simply discardedor ignored after use.Given the cost and value of data cleaning operations, there is a large mo-tivation to lessen the need for manual inspection of large data sets. To meetsome of these challenges, the business intelligence community has recentlycalled for new systems, with a specific control loop that supports write-backto data sources. Such a system could then be used to both correct and an-notate the data at its source [48]. An important step in this direction wouldbe an effort to limit wasted data cleaning efforts in data sharing systemswhere more than one user works on subsets or copies of the same data. This3Chapter 1. Introductionis the ultimate goal of the system we describe in this thesis. By constructinga framework to facilitate the sharing of data cleaning efforts between dif-ferent users, we foresee a system of reusable updates, that reduces repeatedmanual labour and promotes up-to-date data of higher overall quality.A flurry of new ideas has been introduced in data management researchin recent years. Since data management is a big business these ideas are of-ten implemented in prototypes, followed by applicable enterprise solutions.The operation of a typical relational database management system has al-ready been extensively studied, so the focus in data management researchhas recently shifted to larger systems of multiple data sources, issues withdata integration, systems for specialized applications, and data analytics. Acommon theme has also been effort to combine important ideas into com-posite systems that serve a broader purpose. An example would be the Triosystem [56] that integrates probabilistic databases and provenance ideas.We think this trend will continue and eventually converge in a new generalpurpose data management system. Exploring new systems that integratedifferent ideas and try to solve specific problems, is an important step ingauging their usefulness and feasibility as parts of a larger general purposesystem.Motivated by the growth in data-driven business solutions, the impor-tance of data quality for their effectiveness, and the large cost and time-consuming work of ensuring that quality, we propose a new system of updatepropagation for common data sharing scenarios. In doing so, our goal is toutilize recent advances of important new research topics, in the spirit of thecurrent drive to integrate them into new practical solutions.In this thesis, we describe a new framework for update propagation incollaborative data sharing scenarios. These are scenarios where analysts4Chapter 1. Introductionand applications use data from any number of sources, either working with alocally stored copy, or by fetching data directly at run-time, in a typical dataintegration fashion. Updates are taken to be all changes made to the databy any user, and by their propagation the intention is to notify all relevantusers of their existence, spreading the news, so they can conveniently choosewhich changes to apply to their own data copies. The goal is not totalsynchronization between users, but a heightened awareness of system-wideactions, to reduce manual labour and qualitatively improve overall dataquality.The key contributions of this thesis are the modular system design,and the identification of necessary system requirements and functions. Wepresent a distributed system architecture, composed of numerous modulesthat draw from different topics in data management. Among the functionsperformed by different modules are the storage of all schema- and data map-pings that describe how different users’ data is related, the logging of variousactions and transformations, as well as the identification and propagationof relevant updates from one user to another. Together, the modules ensurethat both application users and data sources can seamlessly use shared data,while progressively supporting individual data cleaning transformations andupdate propagations to improve overall quality in the system.The remainder of the thesis is organized as follows. In Chapter 2, wediscuss four motivating case studies, each highlighting specific system re-quirements. Chapter 3 gives an overview of the system architecture, beforedescribing each module separately. Chapter 4 shows how our system oper-ates in practice when faced with each of our previous use cases. An overviewof related work is found in Chapter 5, and we conclude this thesis in Chapter6, with a discussion on future work and important challenges.5Chapter 2Case StudiesThe ultimate goal of our system, is for it to be a framework for updatepropagation in applications with shared data. These assume any number oforiginal data sources and an arbitrary number of users, whose applicationstake data copies from at least one of the sources as input. The systemwe seek to describe must be robust enough to solve the known problemsaccurately, yet general enough to cover a broad field of common use cases.In order to discover the system requirements, and to motivate variousdesign choices, we investigate four key scenarios. The scenarios vary in theirspecific goals and assumptions, but the main motivation of the system is toensure update propagation, while minimizing manual repetition and humanerrors in data manipulation. An adequate solution is one where all relevantsystem users are made aware of global data corresponding to their own, insome timely manner, and are given a choice to either accept or deny saidchanges. That is not to say that we want a system of full synchronizationamong users. By propagating the transformations made by other users ofthe same data, we want the decision to lie with each user to accept onlyupdates they deem to be relevant and valuable to their data copy.62.1. Scenario 1: Changes in Application Data2.1 Scenario 1: Changes in Application DataA common use case is where a user checks out data from a database touse in his own application. It is a fair assumption in many instances thata third party application owner does not have write privileges to the datasources in question. Furthermore, he should not be required to even have thetechnical know-how to query a database correctly. Instead we will assumethat the application owner will request a particular set of data, throughsome interface, or be directly provided one by an administrator of the datasource.An example of this scenario is the one of a business analyst who is taskedwith writing a report on a company’s recent performance. We can assumethat our analyst is an expert in the domain of the company’s business,and by extension the domain of its data. She is not necessarily a databaseadministrator, and is not tasked with the primary storage of the company’sdata. To perform her analysis our user needs some specific data, for examplesales figures for the last quarter, and in this case is given a spreadsheetcontaining a copy of that data. The particular method of data exchangecould be by means of any standard file format for structured data. Theimportant thing is that the user receives her own copy of the requested datafor local storage.As with all data, it is safe to assume that it may contain errors. Thefirst task of our analyst is therefore to carefully examine the data for anyinconsistencies that need fixing. We contend that few are better suited forthis arduous and time consuming task than the domain experts, tasked withextracting value from the data. This process of data cleaning is interleavedwith one of data wrangling, that is systematically changing the structure of72.1. Scenario 1: Changes in Application Datadata, to fit some purpose or input format for a later application.We believe that any number of changes made by a user in this datatransformation stage could be of great value to the administrator of theoriginal data source, as well as future users of the data. A system thatfacilitates the exchange between interested parties, of changes made to datacould potentially do a lot of good. The act of reporting news about updatesfrom other parts of the system back to the source would save future usersfrom having to spend time on repeating the work.The first fundamental task of our system is to establish and maintaina history of transformations made by the user. The most intuitive way todo this would be to maintain a log of all actions made, in a transforma-tion script. The design choice for how such a log should be transcribed isimportant. A log of tuple-level updates, similar to a database log, is notsuitable since it does not capture the transformations at a high enough level.Since the user is expected to be working in a spreadsheet-like environment,it would also be less intuitive, not to mention verbose, for the transactionlog to be at the transaction level of a relational database.Prior research on data transformation systems [38] has defined a declar-ative transformation language, complete with eight classes of transforms,that is specifically designed to deal with common data cleaning tasks anddata wrangling operations. This set of operations is provably sufficient tocover all one-to-one and one-to-many transformations that structured datacould possibly undergo [44, 52]. A log generated in this language wouldsolve the the task of keeping track of user transformation history.A major hurdle on the way to effective transformation logging, is to forcethe user to stick to using only the meaningful transformations defined byour transformation language. As it is, most common spreadsheet software82.2. Scenario 2: Changes in Source Dataallows users to perform ad-hoc changes to data that might be difficult tointerpret as transformations on rows and columns. This problem will befurther discussed below.Since both source and application owners are allowed to arbitrarily makechanges not only to the data, but to schemas as well, our system must becapable of mapping data between the application and source schemas beforepropagating it back. This requires a schema mapping module, common inprevious work on data exchange and materialized views. Such a modulewould have to be incrementally updated whenever either side makes changesto its schema.2.2 Scenario 2: Changes in Source DataAspiring to improve data quality at the sources by propagating applicationchanges back to them, is a worthy objective, but on its own probably notworth the extra effort from the application user’s perspective. His data isnow clean and transformed to his liking so the added overhead of loggingactions and sending back updates could be perceived as excessive work. Toadd to his benefit we introduce the source-centric scenario.Data is not only cleaned and changed at the application end of thesystem. Both operations will also be performed every now and then by theadministrator of the original data source. This would result in value updatesand the occasional schema update. Insertions of new data, or deletions ofold data, would be even more common at the source, especially if the sourceis a transactional database. Whatever changes are made to the source data,they are likely to affect in some way any application working with an oldercopy of the now changed data.92.2. Scenario 2: Changes in Source DataIn this scenario we assume that a user is already working with his owncopy of the source data before the original data is updated. To illustrate,we can imagine an academic working on a model that takes as input de-velopment indicators about the state of affairs in African countries. Theindicators are maintained by the World bank and made available as rawdata files on the institution’s public website. As new data becomes avail-able, on a semi-regular basis, the indicators are updated and new files arepublished. We must now solve the problem of how our system will propagatethese changes to the academic, essentially offering him to refresh his currentcopy.To meet the challenges of this use case our system must be capable of:• Recording changes made at the source.• Remembering what query that originally generated the user’s data.• Knowing when the user checked out his copy, or when he last receivedupdates from the source.• Mapping data between the source and user schemas.• Updating schema mappings in a timely fashion if either source or userschemas change.• Applying applicable mappings the user has defined on new data.In many ways we can model the relationship of the original source andthe user’s data copy after the relationship between a database schema anda materialized view defined on the schema, the difference being that we donot require the user’s data to be a queryable database table. The task ofpropagating updates from the source to the user is now akin to the problem102.2. Scenario 2: Changes in Source Dataof view maintenance in data warehouses [51]. A key difference between ourscenario and materialized views is the fact that we do not require the user’sdata to be fully synchronized with the source after a data refresh.The main reason view maintenance algorithms come into play here isbecause a user’s data copy from a source is essentially the result set of somequery the user evaluates at a given time. If the query is complex, for exampleinvolving aggregation or particular kinds of projections, this requires moresophisticated techniques for synchronizing the materialized view [26]. Forexample, a simple projection can produce duplicate rows that have to beaccounted for when making insertions and deletions in the view. To supportour desired functionality for this scenario, our system will likely maintain astore of metadata on the user side to keep track of row-level statistics. It ispossible that data provenance techniques could be enhanced to contributeto this module. Further research is needed on whether we must constrictusers to some subset of queries for our system to work as proposed.Like before, we assume that a user would like to perform data cleaningoperations and transformations on the new data coming from the source.This is mostly handled under Scenario 1 in the above discussion. In addition,we would like our system to remember previous mappings that the user hasmade and deemed feasible, so that they can be applied on the fly to newlyinserted data through an update propagation from the source. This wouldpotentially save the user the time otherwise spent manually re-applying pastchanges. Not every transformation is suitable as a stored mapping. Forexample, errors that are corrected during data cleaning are often specific toa single value, and do not generally apply to other data rows. Our systemwill likely require user input on what mappings should be applied to allincoming data.112.3. Scenario 3: Data Integration with Limited Storage2.3 Scenario 3: Data Integration with LimitedStorageMany data applications are designed not to store data locally, but rather toretrieve it over a network only when needed. In one classical data integrationsituation, data is stored remotely in any number of sources, and queries madeover a universal schema return data to the user [28, 29]. Another situationthat is becoming popular is when web applications make their data availableto other developers through an application programming interface.The assumption in this scenario is that after the data has been acquiredfrom its remote origin it gets fed directly to an application, and is neverstored on the user side. This can be for a lack of local storage but in mostcases the reason is the user’s need for the data to be as fresh as possible.By never storing data locally the user is also free of all view maintenanceconcerns that were discussed in Scenario 2 above.To motivate this scenario, we begin with an example where a group ofresearchers has been looking at data about UBC’s campus. Among otherthings, they have looked at an application by a local company (Cloverpoint)that is an interactive visualization of the UBC campus. This applicationis enhanced with data from many different sources, one of which is theopen data portal of the city of Vancouver. One of multiple public datasetsmade available by the city authorities is called “city trees”, with informationabout every single tree within the city limits, including its species, age andgeographical location.1 When loaded into the UBC campus, visualizationerrors became evident in the data. For example, duplicates (trees on top oftrees) were discovered in some locations, due to rough estimations when the1See data.vancouver.ca/datacatalogue/streetTrees.htm.122.3. Scenario 3: Data Integration with Limited Storagedata was recorded. The key problem here is that since the data is not keptin local storage, even after errors have been identified, there is no easy wayto fix them.To meet the challenges this scenario has outlined, our system has to:• Define and store transformations locally for individual rows of data.• Apply stored transformations to the correct data at runtime.• React correctly to new updates made at the source with respect toapplication transformations.• Propagate suggested transformations back to the source.We can assume that the application user’s objectives here, namely stor-ing transformations and adding them to new data where they apply, areidentical to the ones outlined in Scenario 1 above. The only difference is thelack of local storage. The user is still interested in cleaning and transforminghis data, but since he simply can’t manipulate it, we need to adjust our sys-tem architecture to ensure persistent user transformations. Like before, weare interested in capturing transformation logs when our user manipulatesthe data. The act of logging should now result in an executable script thatgets applied to data every time it is loaded into our application.A not so subtle difference between the saved transformations here and thecase where local storage is assumed, is that we are now interested in applyingall changes the user has made, and not just some general transformationsapplicable to new data. Because of that, the added complexity of includingand mapping all changes in data values to the correct rows must be carefullyconsidered in the system design. However, this does not eliminate the need132.4. Scenario 4: Unknown Transformationsto keep track of general transformation schemes so they can be applied tonew incoming data.Besides a different setup and the application user’s new interface withhis data, the goal of our system ultimately remains the same. We want it topropagate updates between collaborating agents (the user application andits data source) while at the same time maintaining all changes betweenapplication sessions. We still expect our system to notify the data source ofchanges made by the user and vice versa. The operation of those componentsshould be fairly similar to how they are described above, with the possibleaddition of a layer to check if original data anomalies still persist. Such alayer could help prune the transformation script before data is loaded intothe application.2.4 Scenario 4: Unknown TransformationsA crucial component of our proposed system is the transformation log of allchanges made by the user. Without knowing exactly what those changes are,any intentions of propagating updates back to the source are moot. The taskof forcing the user to make meaningful transformations and capture themfor later use in a transformation script has been closely studied [38, 39, 52].For such solutions to work new tools must be developed, and people whowork with that data must use those tools. The industry is already pushingin the right direction2, but until new solutions gain widespread traction wemust still assume that most data wrangling will be done using commonspreadsheet software or ad hoc scripts.Restricting our work to the wonderful world where everybody uses the2See www.trifacta.com/product/technology/ for one example.142.4. Scenario 4: Unknown Transformationsbest tools to make life easier for our system would be great, but since thatworld is not here yet we are forced to eventually address the situation on theground. Perfect transformation logs are unlikely to be captured in practice.For the sake of generality we must therefore consider our options for thescenario where all information on user changes is absent. In this case all wehave to work with are copies of the data before and after possible changeswere made.The question at hand is this: Given relational data from a source databaseand a modified cleaned copy of the same data, is it possible to automaticallyinfer the script of transformations that modified the source copy with no ad-ditional information available? The answer is we don’t know. To present adefinitive solution to this problem would require substantial extra work thatis out of the scope of this study. It is, however, a problem worth mentioning.It serves as an important direction for future research and should our systembe implemented, its solution would be greatly beneficial.There are at least two possible ways to tackle this problem: the smartway and the dirty way. The former has a focus on data mining, and thelatter requires more draconian comparisons of the data sets. Recently therehas been a lot of interest in employing data mining techniques to aid datacleaning systems [27, 35, 38, 39, 52]. Some techniques proposed include var-ious similarity metrics or clustering methods. We have reason to believethat these can be reverse engineered or used to focus the efforts of auto-matic change detection. Another approach would be to simply compute thedifference between the two data instances and try to infer a transformationscript that way. Parts of this method would likely be somewhat similar tothe Diff operator from the Model Management literature [7].15Chapter 3System OverviewGuided by the case studies presented in the previous chapter, we aspire tolay out in broad strokes the key components of a general update propagationsystem. We start with a discussion, in Section 3.1, on the general overview ofthe system and highlight its two types of users. We continue in Section 3.2,and begin describing the various system modules, their different functions,and the challenges that might arise in implementation.3.1 General OverviewWe propose a system to manage updates and transformations in data sharingscenarios, the outline of which can be seen in Figure 3.1. A more detaileddiscussion on each module and its purpose follows below; but for now, let’sdirect our focus to the big picture operation of the system.We assume that the operation of the system is split between two distinctentities: data sources and data applications. We further assume that aninstance of our system contains any number of data sources and any numberof applications—one of each in the simplest case. The examples we willconsider are all limited to a single application, but increasing the kindsof applications should follow trivially from our discussion. We expect theinput data to the applications to come either from conventional databases163.2. System Componentsor a data integration system, as discussed in Scenarios 1 and 3 respectivelyin Chapter 2. In the same vein, a data application is taken to be anythingthat takes structured data as input. This would also cover a simple dataanalysis, where the application could consist of various statistical functionsand visualizations. In Figure 3.1, the split between the two interactingsystem entities is marked by a dotted vertical line.We expect two types of primary users to interact with our system: adatabase administrator for each data source and an application user. Thedifference is important because we assume a different background and skillset for each type. The database administrator is assumed to be an expertin data management and storage, but not necessarily familiar with the se-mantics of the data. The application user however, is assumed to be adomain expert for the particular data his application is using. We arguethat this expertise makes the application user better suited to judge manydata anomalies that are expected to exist in the data. The person we referto as the application user is an individual who is responsible for managingall data after it is received from the data sources and before it enters the ap-plication. For our purposes, we won’t concern ourselves with any secondaryconsumers of the data application.3.2 System Components3.2.1 Data SourcesAny number of data sources should be able to tie to the system and serveas the main repository of its data. Such a repository should store data ina structured relational format for two main reasons. We need the data tobe in a rigid format so that transformations down the line are meaningful173.2. System ComponentsUpdate BufferUser Data DefinitionsApplicationSource Transaction LogSchema MappingsUser MappingsUpdate ConsolidationUpdate ManagerData CopyTransformationsTrans-formation ScriptData CleaningDatabase(s)ORData Integration SystemTransformation PruningSQL GenerationORError ExplanationLineageFigure 3.1: The general architecture design of the proposed system. Mod-ules to the left of the dotted vertical line represent modules that interactwith the input data sources, while modules to the right of the line interactwith the data application.183.2. System Componentsand well behaved. We also require the data sources to be queryable, sincedefining and keeping track of data copies will be far easier that way. Most ofour use cases expect data sources to be traditional transactional databases,requiring our system to expect and handle regular insertions and deletionsof data rows.3.2.2 Data Update StoreSince update propagation between sources and applications is not expectedto be continuous, we must include a module to act as a buffer for updates,made at the source, that have not yet been applied to a particular user’sdata copy. We call this component the Data Update Store and maintainit at every data source, one section for each checked out data copy. Thiscomponent should be similar to the transformation log on the user side, butsmart enough to only store transformations that affect the specific data thatresides with the user. If the user does not store his data locally, the dataupdate store might not be necessary. In that case, we would still argue forits importance, because notifying a user of changes made at the source sohe can evaluate them for his purpose would likely be desired.3.2.3 User Data DefinitionsGiven a relational database D as a data source, a user data definition isa query over D that defines the specific data copy checked out by a userapplication. A user data definition is similar to the query definitions of log-ical views, and at the time of checkout the result of it, sometimes called areport, defines the exact instance of data the corresponding user applicationreceives. In order to know what data updates apply each time, our system193.2. System Componentsmust maintain a table of definitions at every data source for any user ap-plications that it serves. Data files produced by the source for distributionshould also be tied to their general query definition.3.2.4 Data FilesIt is unlikely that data applications will in all scenarios have direct access totheir working data at the sources. In a similar vein, users can’t be expectedto always have the capabilities or the schema understanding to query adatabase directly for what they want. In those cases, and most cases wherecertain data is being made publicly available, a database administrator willcompile a data file in a standard format. The file could be a spreadsheet, csvfile, or any popular format known to ease data exchange between parties.Our system design must gracefully handle the intermediate step of data filesin its use cases.3.2.5 Data IntegrationQuerying multiple heterogeneous databases for data is an increasingly com-mon task, even in otherwise simple applications. If the user is only presentedwith the query result, and the data still resides in its original places this iscalled data integration. This problem has been heavily researched in thelast decades [28, 46] and remains of some interest. Since we want to allowdata input to our system to come from an integration of such databases, wemust consider the standard interfaces available to external systems in thosescenarios.A common approach is to describe data sources as view expressions undera mediated universal schema. The user queries the universal view as a203.2. System Componentssingle logical data source, and the rest is taken care of. For this to work,schema mappings, sometimes called source descriptors, must be generatedand maintained by some means. The design of our system should supporta common module for data integration.Schema MappingsThe relationship between data sources and their universal view in data inte-gration systems is described with a mechanism called schema mappings [47].They are a collection of formulas in some logic, usually a convenient subsetof first-order logic, that express all constraints between a source schema anda related target schema. If a user application needs to change the structureof its data copy, such schema level differences must be recorded with schemamappings and stored on the user side.In the case where a user application stores the full data copy locally,schema mappings can be used to map directly between a source schema,possibly a logical schema representing many sources, and the applicationschema. For this purpose, two distinct classes of mappings are common:the so-called tuple-generating dependencies on one hand, and equality de-pendencies on the other [24].Data ExchangeData exchange is one of the oldest fundamental problems in database re-search. Given an instance of source data and a schema mapping betweensource and target schemas, data exchange is the problem of generating atarget instance that adheres to all constraints.When data copies in our system are stored locally by applications, all new213.2. System Componentsdata and updates coming in from data sources must solve the data exchangeproblem in some consistent manner before they can be considered by users.Unless we can agree on what constitutes a best solution the solution spacein data exchange is not deterministic. Therefore, a data exchange module inour system must be careful to generate predictable solutions. The fact thatthe target schema has in our case directly evolved from the source schemacould possibly make this problem better behaved than the most general case.3.2.6 Update ConsolidationEvery data application in our system should have an update consolidationmodule. This part of the system acts as a funnel for incoming data updates,and its inputs come from various sources after schema mappings have beenapplied. The update consolidation phase transforms them into the applica-tion schema, detects conflicting updates from different sources and finallypresents a set of possible updates in human-readable form, so that the ap-plication user can choose which updates to apply to his data. If possible,updates should not be presented at the row level but at some appropri-ate aggregation that groups together related updates if they apply to manydifferent rows.3.2.7 User MappingsThe data integration scenario where a user application fetches all data atruntime and never stores the bulk data locally causes an interesting conun-drum. We want and expect the user to be able to make changes to and cleanhis data as soon as issues are discovered. The problem is that without localstorage, changes need to be applied anew everytime the data is loaded. We223.2. System Componentspropose that our system solve this issue by storing so-called user mappingsin a local update table. Now when data is loaded, stored transformationsfrom the local update table are applied to the data before the applicationitself sees it, therefore simulating persistant storage. User mappings havetwo flavors that we discuss separately. Their key difference is how generallythey can be applied to previously unseen incoming new data.3.2.8 Standard ETL TransformationsExtraction, transformation and loading (ETL) is the name given in theliterature to the process used to populate data warehouses. Data is typi-cally sourced from many transactional databases, transformed to fit a newschema, and cleaned before it is loaded into the data warehouse and itsmaterialized views. What we refer to as standard ETL transforms are op-erations that research papers [17, 55] describe as happening in the DataStaging Area. These include schema level transformations, various applica-tion specific mappings, and cleaning operations. Any changes defined by auser application that are general enough to be applied to new incoming datawe consider standard ETL transformations.3.2.9 Data Cleaning ChangesWe mention data cleaning specifically as a special case of user mappingsbecause these changes will generally not apply to new data as it arrivesfrom the sources. These changes will generally be at the row-value level,such as, deletion of duplicates or corrections of wrong input, rather thangeneral data cleaning operations like changing the format of date fields. Acore reason for separating the two is the fact that before applying these233.2. System Componentstransformations we need to correctly identify their corresponding rows asthey are loaded. This complication might cause problems, and must be keptin mind and solved in a system implementation.3.2.10 Transformation PhaseThis last phase before data is loaded into a user application will often proveto be the most time consuming. This is where incoming data is carefullyaudited and all transformations specific to one application are designed andimplemented. Transformations can be divided into two classes that roughlyfollow along the lines of the user mapping groups above. The first classincludes all data restructuring, operations such as dropping or combiningcolumns, pivoting rows with columns and other schema-altering changes.In most cases the application owner will have to perform a near-manualinspection of the data to identify data quality issues. Such issues can beanything from erroneous and missing values to duplicate records and typeinconsistencies stemming from integrated data. Not only is the detection ofdirty data issues difficult, but the design and evaluation of transformations tosuccessfully rectify them can be very hard. In fact, data cleaning is so tediousin practice that studies have estimated that up to 80% of development time(and cost) in data warehousing projects is spent on just discovering andcorrecting this issue [21].For an implementation of the transformation phase we recommend an in-teractive approach, similar to the one introduced by the Wrangler project [38].A design like that aims to make it simple for users to find and express mean-ingful transformations while limiting the need for manual repetition.243.2. System Components3.2.11 User ApplicationA design goal of the system architecture is sufficient modularity, so thatindividual parts can be interchanged or replaced as needed. Different systemmodules generally focus on discrete problems, many of which represent wholebodies of research. An irreplacable heart of the system remains the dataapplication, without which the feedback loop we are looking for would beimpossible.In a traditional data project the time consuming work of the transforma-tion phase is paid back, hopefully, by value extracted from the data in someapplication. A user application could be any piece of software that takesdata as an input, for example clever visualization or manual spreadsheetcomputations made by a business analyst. Sometimes data quality issuesare not discovered before the runtime of the application. Our system wouldhave to support on-the-fly data updates on those occasions.3.2.12 Transformation ScriptsLogging user changes during the transformation phase or application run-time serves two main purposes. It is the genesis of all user mappings to beapplied at a later execution time. Secondly, it serves as input for all modulesthat propagate these updates back to the original data sources.Transformations in our systems should be recorded as an actionablescript in some declarative data transformation language. Again the Wran-gler project [38] provides the direction by introducing a language design,based on earlier languages [44, 52], that seems well suited for our purposes.The language operates on tabular data and is composed of nine classes oftransformation operators:253.2. System Components• Map transforms a data row into zero or any number of data rows. Mapoperators include row deletion, value updates, arithmetic and splittinginto multiple columns or rows.• Reshape operators perform schema-level changes by either folding mul-tiple columns to key-value sets or unfolding by creating new columnheaders from data values.• Positional transforms update table values by using information inneighboring rows, generating new values or shifting values in place.• Schema transforms change column names, data types and semantics.• Lookups, joins, sorting, aggregation and key generation are furtherclasses, mostly self-explanatory, that are not of key interest in ourdiscussion.The set of transformations from the Wrangler project is large enough tocover most common data wrangling and cleaning tasks. In fact, it is provablysufficient to handle all one-to-one and one-to-many data transformations [44,52]. Despite this coverage, our system must take special care when it comesto schema changes. They must be treated so as to facilitate incrementalupdates of schema mappings, possibly to more than one data sources.3.2.13 Transformation Inference EngineAs discussed at some length in Section 2.4 above, transformation logs maynot always exist for outstanding data copies. In those cases it is worthconsidering whether one can be inferred by comparing the original datacopy to the modified current instance. The desired output should be similarto the transformation script described above. The feasibility of this module263.2. System Componentsis not investigated in this study but it would likely draw from a wide varietyof prior work on automatic data cleaning and integration [23, 30, 32]. Thisproblem is even further complicated if changes to the data have been madeat the source, or schema changes are lost.3.2.14 Update Propagation to SourcesSince the most valuable data cleaning revisions happen at the applicationlevel, with domain experts applying their knowledge and adding constraints,the propagation of those updates back to corresponding data sources be-comes a critical part of our system design. This module is also particularlychallenging because its implementation and technology would likely rely onuncharted territory in active research fields, and a carefully constructedscripting language [38]. We note two different approaches for this modulebelow, each with nice features but distinct challenges.Not all updates are created equal. A data source administrator is mostlikely not interested in knowing about every single change made to copies oftheir data by application users. For example, if a user changes the attributename of a column or splits it in two this change is probably not of value tothe original data source. Therefore it shouldn’t get propagated back. Weare only interested in sending along updates that increase the overall valueof the data. Such changes are those that increase data quality for all futureusers of the data and lessen the load on future data cleaning operations.Few schema level changes are likely to apply, so an implementation of oursystem would do well to focus on only propagating Map transformations, asdefined above, from applications to data sources.After updates have been propagated to the sources, our system must273.2. System Componentskeep track of that fact and not attempt to resend updates later. One way toimplement this would be to organize updates with a stack, so that they arenot revisited after they have been popped off and propagated once. Evenbetter, if a data source rejects an update this should be remembered so thesystem doesn’t attempt to propagate rejected updates back and forth thenext time the application fetches or loads the same data copy. This would beanother bit of metadata that could be stored alongside the user mappings,since they should reflect all past application changes.SQL Update GenerationThe first approach to an update propagation module takes a transformationscript from the data application as input and generates a list of correspond-ing SQL statements, executable on the data sources. This goal raises anumber of questions that need to be addressed in a successful implementa-tion.For one, although it has been established that a declarative scriptinglanguage is capable of expressing all data transformations it is not certainwhether a clear and unambiguous correspondence exists between transformsand SQL statements. A further complication is caused by the fact that atransformation script would likely operate on numbered lines in a data table,possibly making it harder to keep track of individual tuples as time goeson. Even if such a one-to-one mapping could be established, a translationbetween the two would remain a difficult task.Another fundamental question is where updates need to be propagated.If the application sources data from more than one origin some mechanismis needed to pinpoint the genesis of the particular set of data affected by an283.2. System Componentsoperation. Integrating data from different sources might also have altereddata or combined it in a way that adds to the complexity of knowing whereit came from. A possible solution for this problem may be found in theconcept of data lineage. A hot topic in recent literature, data lineage [11, 13]is essentially any piece of metadata that captures the history of data, whereit came from and how it has been changed. For our purposes we wouldlikely require a low level approach, one that captures data lineage at thetuple level.Like when new data is loaded from sources to applications, we assumethat table schemas will generally diverge to some extent. If possible, thosechanges are kept up-to-date in the form of schema mappings. After decidingwhich source is to receive a specific update the SQL generator must interfacewith the data exchange module, retrieve relevant schema mappings, andproduce source-specific SQL statements.Error ExplanationA slightly different approach to update propagation is the one of error ex-planation and action prescription introduced in a data cleaning system byChalamalla et al [14]. Instead of translating the transformation script toSQL statements and giving only a data source administrator the power toeither accept or reject individual updates, the methodology is now a twofoldprocess.First, when violations are discovered during data cleaning on the appli-cation side, an attempt is made to summarize them as accurately as possi-ble using predicates on the data copy. This is called the error explanationphase. When errors in violating data tuples have been repaired, their lin-293.2. System Componentseage is traced back to the sources where all tuples that made contributionsto them are identified. They can now serve as input to a mechanism thatwould propose a course of action for the administrator to fix the perceivedproblem, the so-called action prescription. This idea is revisited in moredetail in Chapter 5.30Chapter 4Analyzing Case StudiesIn an effort to further flesh out system requirements, and to show howthe modular design discussed in the last chapter ties in with the standardoperation of our system, we will now revisit the common use cases fromChapter 2. By showing how user action moves data through the system,and how modules interact in different scenarios, we try to motivate thecurrent design while shedding light on problematic parts that need to besolved before an implementation can be achieved.4.1 Scenario 14.1.1 Retrieving and Transforming the DataThe setup for our first main use case is as follows: A single application userchecks out a data copy from one data source and stores it locally on his ownserver. One might think that this use case could be trivially extended toinclude multiple users and many sources but that leads to complications tobe discussed in a separate section.As a running example, we imagine the case of a beat reporter workingfor the National Basketball Association (NBA) who wants to put togetheran analysis on some current NBA players. To achieve this she needs touse data that is hosted in one of the NBA’s databases, a sample of which314.1. Scenario 1player id last name first name team salary college231 james lebron cleveland 19.07 null321 love kevin minnesota 13.67 ucla343 duncan tim san antonio 103.6 wake forest576 parker tony san antonio 12.50 nullTable 4.1: Sample data from a relational data source containing informationabout current NBA players.can be seen in Table 4.1. The table contains information on player names,their team affiliation, salary, and what college they played for before joiningthe NBA. The missing data in the college column is caused by the corre-sponding players not having had a college career. The exact nature of thereporter’s application is not important but for our purposes we can imaginesome standard statistical analysis or visualization. Furthermore, we make noassumption on the reporter’s technical aptitude when it comes to queryingdatabases and assume that she must submit a request for the data, eitherthrough some interface or by contacting an administrator directly. We willhowever, assume that our reporter is very familiar with the data itself, andhas enough knowledge to accurately judge the quality of the data.The reporter is interested in checking out all the data contained in Ta-ble 4.1, except the player IDs. This subset of data could be defined by thesimple querySELECT last_name, first_name, team, salary, collegeFROM Players;assuming the name of the table. We interpret the reporter’s request asbeing submitted by her application, and the process is followed as step 1in Figure 4.1. In response, the database administrator will now create a324.1. Scenario 1last name first name team salary college1 james lebron cleveland 19.07 null2 love kevin minnesota 13.67 ucla3 duncan tim san antonio 103.6 wake forest4 parker tony san antonio 12.50 nullTable 4.2: A user defined subset of the source data. No transformationshave been made.structured data copy containing the result of the query. This process wouldbe very similar to the creation of a materialized view, with the exception thatthe end result is not necessarily a relational table to be stored in a database,but more likely a simple spreadsheet-like dataset. In our example, the datacopy resulting from this process will now be stored locally with the reporterand can be seen in Table 4.2. Note that, strictly speaking, the relationalnature of the table has been broken and the numbered rows are only printedfor reference.For the system to work correctly, the data source must keep a record ofall outgoing data copies. The semantics of the reporter’s original query aregiven by what we have dubbed a user data definition: a tuple of metadatawith information on the query that generated the data as well as all necessarycontact details about the reporter. The user data definition is now storedwith the data source, possibly in a relational table contained within itsdatabase management system.When data is checked out of a source for use in a particular applicationfor the first time, a one-to-one mapping will automatically exist betweenthe schema of the data table that exists at the application side and the userdata definition that is stored at the source. No changes have yet been madeby either side, so the schemas remain the same. This information must334.1. Scenario 1SourceApplicationUser Data DefinitionsSchema MappingsData CopyTransformationsTrans-formation ScriptData CleaningUser Mappings146325Figure 4.1: Scenario 1. The application makes a request to the data source.The data source reacts, supplies a data copy that is transformed and cleanedon the application side. All transformations are logged in a script, some aregeneralized as user mappings, and the data is now ready for the application.still be encoded, and therefore the data exchange module on the applicationside is initialized with one-to-one schema mappings for every attribute. Therecording of the user data definition and the initial schema mappings canbe seen as steps 2 and 3, respectively, in Figure 4.1.Having received the data copy of her choice, the reporter in our exam-ple can now start wrangling her data to fit the structure she needs for herapplication. At this point she will also start examining the data for incon-sistencies or errors that need to be corrected. Our system is now in thetransformation phase. The first thing that catches our analyst’s eyes arethe missing values in the college column. Being an expert in the data, shequickly realizes the reasoning behind the missing values. The reporter de-cides that this is insufficient for her analysis and makes a twofold change. Inplace of the missing college values, she inputs the name of the last team the344.1. Scenario 1last name first name team salary before nba before type1 james lebron cleveland 19.07 akron high school2 love kevin cleveland 13.67 ucla college3 duncan tim san antonio 10.36 wake forest college4 parker tony san antonio 12.50 paris proTable 4.3: The user defined data after the transformation phase. The datacan now serve as input for the user application.corresponding player last played for before joining the NBA. Since none ofthe new values are names of legitimate college teams, the reporter renamesthe column before nba and introduces a whole new column, before type thatexplains at what level the team in the previous column plays. The datacopy has now undergone a series of value updates and two schema changes,as can be seen in the last two columns of Table 4.3.With the data now in the correct structural format for her analysis, theNBA reporter looks it over one last time and discovers a couple of anoma-lies. The first one is the exceedingly high salary of $103.6M listed for SanAntonio’s veteran star Tim Duncan. Due to restrictions on yearly salariesin the NBA this high amount is impossible. Because those restrictions areeasily quantifiable and can be stated as a business rule to be applied to thedata instance, they also fall into a category of data errors that are discover-able by semi-automatic data cleaning software. The report concludes thatthe decimal point is wrongly placed, likely due to a human input error. Theother error is harder to discover automatically because that one is caused byan outdated value. The prolific power forward Kevin Love joined the Cleve-land Cavaliers in a trade in August 2014 and this is not reflected in the data.Again the value of our analyst’s domain knowledge is demonstrated for datacleaning. In Figure 4.1, the data cleaning and transformations made by the354.1. Scenario 1reporter can be followed as step 4.The nature in which an application user specifies his changes is impor-tant, because ideally we only want to allow robust transformations from afinite set of machine readable and reproducible operations. Since we alsoassume that the user interacts with the data through a spreadsheet-like in-terface, rather than programmatically, the goals of our transformation phasewould be best met by a controlled interface where only robust changes are al-lowed. An example of this would be the Wrangler system interface [38], thatlooks like a common spreadsheet. The resulting changes are now logged in adeclarative transformation language and stored on the application side. Thesemantics of a suitable transformation language need to be worked out; butto offer an idea, the resulting script from our example could look somethinglike the following:columnName(’college’).to(’before_nba’)row(1).column(’before_nba’).setValue(’akron’)row(4).column(’before_nba’).setValue(’paris basket’)createColumn(’before_type’)row(1).column(’before_type’).setValue(’high school’)row(2).column(’before_type’).setValue(’college’)row(3).column(’before_type’).setValue(’college’)row(4).column(’before_type’).setValue(’professional’)row(3).column(’salary’).setValue(’10.36’)row(2).column(’team’).setValue(’cleveland’)Logging all changes in this manner is almost sufficient, but two steps re-main necessary for further housekeeping. The one-to-one mapping betweenthe user data definition and the current table is now broken. To fix this364.1. Scenario 1our system must be able to translate all schema changes in the transforma-tion script, and update the schema mappings stored at the application side.The second necessary step, is to classify value updates into general transfor-mations and specific transformations, the difference being that the formerwill apply to all new incoming data from the source in the future. Generaltransformations get added to a special user mapping store. The logging oftransformations corresponds to step 5 in Figure Propagating Updates to the SourceNow that all data pre-processing work is done, our reporter is ready to makeher analysis, and our focus switches to the update propagation protocol. Thetask now is to salvage all updates that might be of value to other users of thedata, and send them back to the sources. A walkthrough of this process canbe seen in Figure 4.2. Again, some classification of the application changesis in order. We perform a rough pruning of the transformation script toleave behind changes that are unlikely to be relevant outside the scope ofthe specific application. Transformations that would be of global interestare mostly the ones resulting from data cleaning, so the division here islikely to be the same as in the earlier generation of user mappings. Changesthat simply massage the structure of the data to fit the application needs,or involve some general calculations and functions would probably be leftbehind. However, this might not always be the case and needs a closer lookin the future.Once the system has picked the transformations that are to be propa-gated, the lineage of the tuples involved, stored through some mechanismon the application side, is examined to see what source it is affiliated with.374.1. Scenario 1SourceLineageSQL GenerationTrans-formation ScriptSchema MappingsError ExplanationTransformation PruningFigure 4.2: Scenario 1. Propagating updates to the source. First, thetransformation script is pruned. Next, the remaining transformations aretranslated to SQL using the existing schema mappings and data lineage.The result is presented to the source. An alternative to SQL generationwould be to attempt error explanation, as discussed in Section 3.2.12.Lineage, also called provenance, is any metadata that captures the originand history of a particular piece of data. We defer discussion on provenanceto the next chapter, but stress the importance of keeping track of it for ex-actly this purpose. The NBA data in our example originates in one source,so in our case this objective is fairly straightforward. Even so, complicationscan quickly arise if the application data is the result of an aggregate query,and contributing tuples on the source side need to be pinpointed.Chapter 3 introduced two fundamentally different approaches to howthe system should proceed with update propagation. Now, knowing whereparticular transformations need to be sent back to, the first approach in-volves a module that would take as input the transformations in questionas well as the necessary schema mappings to perform a data exchange op-eration with the source relations as targets. This module would generate384.2. Scenario 2an executable SQL script for each source and notify its administrator, sothat he can choose which parts of it to apply to the source. The goal of thesecond method we mentioned, would be closer to one of error explanationand action prescription. Instead of giving the source administrator a binarychoice of updates to accept, we will now try to highlight and explain prob-lematic tuples in the source, so that a more comprehensive course of actioncan be taken. The implementation of the second module needs some extrathought, but it would most certainly require even more detailed records ofdata provenance.4.2 Scenario 2We have dealt with the simple scenario of one source, one application, andthe backward propagation of valuable updates. Now we consider a similarscenario from the perspective of new updates being made at the data source.This function of the system could be described as forward propagation. Themotivation for this scenario was laid out in Chapter 2, and now we shall seehow different modules of the system might interact to make it happen. Wecontinue with our example from the previous section, and imagine that theNBA data source makes updates to its data at some point in time after allthe work already described has been done.Specifically, we will consider the following changes, as seen in Table 4.4.For illustration, we only imagine two basic changes at this point, but anynumber of them could have been made. We notice that the input error inthe salary column for Tim Duncan has been corrected, presumably resultingfrom our earlier backwards propagation. The second change is the additionof a tuple representing the NBA’s top draft pick in 2014, new player Andrew394.2. Scenario 2player id last name first name team salary college231 james lebron cleveland 19.07 null321 love kevin minnesota 13.67 ucla343 duncan tim san antonio 10.36 wake forest576 parker tony san antonio 12.50 null772 wiggins andrew cleveland 4.59 kansasTable 4.4: The NBA sample data source after applying the highlightedupdates. An earlier data error has been corrected, and a brand new tupleadded representing Andrew Wiggins.Wiggins. The astute reader might also notice the fact that the NBA analyst’supdate of Kevin Love’s current team has been rejected by the data sourceadministrator, possibly because the transfer has not been finalized, thusdemonstrating the factor of choice in the application of presented updates.Our task is the same as before, notifying all interested parties of changesmade to data in the system, but now in the opposite direction. Simplyreversing the roles of the source and application users is not sufficient be-cause the heterogeneity of each party’s technology causes issues that needto be addressed separately. An obvious first solution to this problem wouldbe to regenerate every application user’s data copy in its entirety, replacehis current copy and reapply all his changes. This could potentially be acumbersome operation if sources are updated frequently. It would also notfunction smoothly in the data integration case we will consider later, so wedecide to choose a different path in search of a solution.Since the overall purpose of our system is not to maintain full continu-ous synchronization between users, but rather to make meaningful updatesglobally for everyone to apply at their own discretion, we will not attempta design to incrementally propagate each update. Instead, we propose thatevery data source maintain a separate update buffer, a queue of available up-404.2. Scenario 2SourceUpdate BufferUser Data DefinitionsApplicationDB Transaction LogSchema MappingsUser MappingsUpdate ConsolidationUpdate ManagerTransformationsFigure 4.3: Scenario 2. Propagating new data and updates from a datasource to the application. The update manager monitors the source forchanges and loads the update buffers. The changes are then loaded, trans-formed, and examined on the application side.dates for consideration, for each active user application. Once fresh transac-tions start to accumulate in the update buffer a flag of some sort is given tothe corresponding application’s manager so that he can at any point chooseto refresh his data accordingly.The protocol for loading updates into the update buffer requires someconsideration. Again, a possibility presents itself to process each data sourcetransaction on the fly, but the expected performance drop in throughputwould be less than ideal. If possible, a cleaner solution could involve scan-ning the transaction logs of the database at regular intervals and processingthem in bulk. However, deferring the work in that way does not comewithout its own problems. Changes to user data definitions in the periodbetween database transactions and when updates are logged would affectthe accuracy of the update buffer logs. For example, if an application userchecks out the most recent copy of data before updates have been logged414.2. Scenario 2last name first name team salary college1 duncan tim san antonio 10.36 wake forest2 wiggins andrew cleveland 4.59 kansasTable 4.5: The table of updated tuples in the buffer after the UpdateManager has processed the fresh transactions on the NBA database. Thetable schema fits the original user data definition.the update buffer would become inconsistent. One possible solution to thiswould be to associate timestamps with all user application actions and theupdate buffer.The role of the update manager, as seen in Figure 4.3, is to process allthe database transactions, and manage the update buffers on a per tuplelevel. Its input is received either after every transaction is committed or inbulk from scanning the transaction logs. For every affected tuple, the updatemanager must now consult the user data definition store to determine all theuser application update buffers it belongs to. For each user data definitionthat a tuple is included in, it is now added to the corresponding updatebuffer in the same format as the resulting table of the original query. Inour example, the changes made to the NBA database are processed by theupdate manager and a table of modified tuples, akin to Table 4.5, is addedto the update buffer.Every application user is free to fetch available updates for his data copyat his own discretion. Once a data update is initialized, the contents of theupdate buffer are transferred to the application user and the buffer is cleared.The new data is at first processed by the update consolidation module, onthe application side, whose job it is to conform it to apply any up-to-dateschema mappings necessary for data exchange, as well as to resolve anyconflicting updates. Consulting Figure 4.3, we see how the updates are424.2. Scenario 2name team salary before nba before type1 tim duncan san antonio 10.36 wake forest null2 andrew wiggins cleveland 4.59 kansas nullTable 4.6: The available data updates as presented to the application user,after user mappings have been applied. Note how data in the last columnhas been lost.now subjected to all general user mappings that have been prescribed toincoming data, resulting in a data instance conforming fully to the currentstate of the original data. These tuples are now presented to the applicationuser to decide whether they should be merged with his current data copy.Table 4.6 demonstrates how the updates are represented at this time in thecase of our running example.To help the application user determine which of the available updates tokeep and which to discard, it would be helpful if all tuples clearly indicatedwhether they constitute new data, deleted data or altered data. Once thechanges to be applied have been chosen, the remaining data essentially entersanother round in the transformation phase as described in the previoussection. Whether manual or with the help of some data cleaning software,new tuples especially must undergo an inspection to correct for possibledata quality issues. For our purposes, the only change made during thisphase would be to replace the null in Andrew Wiggins’ tuple with the classcollege. Generally, if any issues are discovered, the application user makesthe appropriate transformations and they get logged as before, prompting anew round in our system’s propagation carousel.At this point, the data updates need to be merged with the currentdata instance. This is easy in the case of new data tuples, they are simplyappended to the table. A harder problem comes with deleted tuples, since434.2. Scenario 2the matching tuples have to be found and removed. An even harder problemstill is the case of updated tuples. The matching tuples must not onlybe found but merged and consolidated with the updates arriving from thesource. Care must be taken to interchange the data values that constitutethe update, while at the same time retaining earlier transformations. Insome cases, data might even have been lost, like we see in the Tim Duncantuple of our example. Because the value of the before type column was addedon the application side and not propagated back, the source has never comein touch with that value before. For our system to be effective, this mergingof updates with existing data in the tabular world of the application sidemust be solved.The motivating case study for this scenario in Chapter 2 touched on anumber of problems and loose ends that our simple example here doesn’tencounter specifically. We mentioned the analogy that exists for this taskwith the maintenance of materialized views in data management systems.The important difference of not insisting on full synchronization, in conjunc-tion with the loss of a strict relational schema on the application side makeour problem a different animal. Similar issues still have to be addressed.For one, we have not formulated any update strategy for either party inour system, and leave that for future work. An even harder problem notaddressed by our example is when user data definitions contain any form ofaggregation. This introduces a flurry of issues that need to be resolved inan implementation of a system inspired by our design.444.3. Scenario 34.3 Scenario 3The basic operation of our system can mostly be demonstrated by follow-ing the example of one data application working with a data copy fromone source. Some important parts of the system and their associated chal-lenges only come to light when a more complicated scenario is considered.Increasing the number of data applications in the system doesn’t changemuch, and has mostly been addressed in describing the system architectureat the source. However, when the number of data sources is increased fromone, complications stemming from data integration must be solved. We willnow describe the functional operation of that scenario. Just like the cor-responding scenario in Chapter 2, we will also drop the assumption of fulllocal storage on the application side, now only leaving space for necessarymetadata.For this data integration scenario, we assume that all data sources in-volved are related through some already established universal schema. Ifthis is not the case, with disparate sources that only make sense in the con-text of the application, the work of constructing the universal schema is laidon the application user’s shoulders. For our purposes, we shall only considerthe case where such a schema exists and represents the only interface theuser has with the sources as a whole. The initial request for data is nowmade by the user through this interface, and a user data definition is storedlike before, only now it is made to the universal schema and is maintainedon behalf of all data sources at once.Like in any data integration scenario, the relation between the multiplesources and the universal schema is stored as a set of schema mappings foreach source. This approach is included in our system, but its data integra-454.3. Scenario 3tion role should be contrasted with the schema mappings maintained fordata exchange between the source and application in our earlier example.Since none of the actual data is now stored locally on the application side,structural changes by the user can now be included as general user map-pings whenever data is loaded from the sources. If schema mappings at thissecond stage become necessary, they should be defined so as to express therelationship between the application data and the universal schema.Limiting local storage means that all of the actual data copy is fetchedfrom its sources every time an application is run. The only things to reside onthe application side are definitions of user mappings and schema mappings,data provenance, transformation scripts, and other necessary metadata toensure seamless operation. This storage design has both advantages andnew challenges that must be faced. On the one hand, updates and newdata coming from the sources need not be maintained at the source andpropagated in a special process as in our earlier scenario. Fresh data andupdates in the data integration scenario now arrive naturally with alreadyseen data. However, we contend that our system should still keep track ofupdates at the sources for each user application. This is so that every timethe data is fetched, an application owner can be presented with a segmentedsection of unseen updates and new data like before. Given that view of hisdata, the user can now inspect it for data quality issues more easily andprescribe appropriate transformations.The tradeoff for this new convenience is that a user application can in noway manipulate data in a persistent manner. This is especially problematicsince this scenario is fairly common in practice and, without a system likewe propose, the options to make corrections are slim to none. We get aroundthis caveat by keeping transformation scripts and user mappings as before,464.3. Scenario 3Source 1Source 2Universal SchemaSchema MappingsUser MappingsTransformationsApplicationTrans-formation Script21223Update ConsolidationSchema Mappings45563Figure 4.4: The data integration scenario. The lack of local storage onthe application side complicates schema mappings and update consolidation.The forward propagation from sources is relatively easier, but user mappingsmust be applied at runtime. Step 2 represents the data integration system,replacing the single data source of Scenarios 1 and 2.and applying them to the whole bulk of data every time the applicationneeds it. Before, user mappings were thought to include mostly generictransformations that apply equally to all new incoming data. Now we needuser mappings to be more specific and able to apply all previous user definedchanges.With the problem of persistent transformations taken care of, we haveaddressed the main issue with the lack of local storage. Still left for discus-sion are complications on the application side introduced by the additionof multiple data sources. This is where the update consolidation moduleshoulders much of the work. Its main task is to resolve conflicting updatesthat might be coming from different sources, and in doing so, making sure474.4. Scenario 4that any updates that involve the same row of the application data playnicely together. This should be more problematic when data is integratedand kept in local storage but in our scenario care must still be taken so thatuser defined transformations and mappings are not negatively affected bychanges made at some data sources.The slightly more involved system schema needed to handle this sce-nario can be seen in Figure 4.4. Since our interest here is still confined toonly a single data application, the backward propagation of updates to thesources has not changed. Hence, this will largely follow the pipeline fromFigure 4.2, with the only change being the addition of a data integrationsystem, consisting of one or more data sources.4.4 Scenario 4The careful reader will remember that our case studies in Chapter 2 includeda fourth scenario. In that scenario, the task was to infer unknown changes,made to a data copy, when no transformation script had been recorded. Aswe discussed in Chapter 2, such a task is quite intricate and out of the scopeof our preliminary system design. Since our design does not have a provisionto solve this problem, we will not include a section on it here, and leave itas an important topic for future research.48Chapter 5Related Work5.1 Collaborative Data Sharing SystemsSharing data across interconnected communities is a difficult problem andincreasingly visible in data integration research. Building on earlier workon peer data management systems [29] (PDMS), the paradigm of collabo-rative data sharing systems [25, 36, 37, 40] addresses the current challengesof scientific data sharing. As part of the CDSS project, its authors havedeveloped the Orchestra system as a prototype.As described by Green et al. [25], a CDSS is a data integration sys-tem that consists of a set of peer databases, each with its own independentschema and local data instance. Peers are connected to each other by compo-sitional schema mappings that explain how their instances are derived fromeach other and differ. This architecture, originally devised for a PDMS,alleviates the need for a universal schema common in data integration andthus facilitates schema evolution. The CDSS paradigm differs from a PDMSby materializing all data locally at the peers. Users of a database only makelocal queries and only a database administrator, the curator, is concernedwith integrating data from other sources. Local materialization allows thepeer administrator to cherry-pick data and updates from other parts of thesystem based on trust conditions.495.1. Collaborative Data Sharing SystemsThe CDSS supports local edits by peers to facilitate data disagreementswithin the community. To implement this, while still ensuring schema map-ping consistency, deleted tuples are stored in a rejection table and newlyinserted tuples in local contribution tables. If local a curator discovers mis-takes in the data, corrections can be propagated back to original data sourcesthrough bidirectional mappings. When a peer administrator requests an up-date exchange from the wider system, the peer’s local edit log is publishedglobally. An incremental update translation is then performed using allother globally available edit logs.The premise of sharing data between peers in a CDSS is quite analogousto the interaction model between users that we propose with our system. Akey difference is that a CDSS assumes that all data is stored locally. Our ap-proach is more lightweight, with data generally stored at designated sourcesand logical copies of it used by data applications. Other differences are thatin a CDSS the peers are homogeneous in their functionality, and they areall relational databases and potential sources for other peers. Our systemhas a collection of data sources on one side and a set of data applications onthe other. These two types of “peers” in our system are technically differentand must be approached as such. In addition, a data application is likelysolving a different problem than a data source, thus having a contrastingperspective on the data while still having the goal of increasing its overallquality.Two ideas are directly applicable from the CDSS design to our system.One is the way peer changes are stored in local edit tables and only sharedglobally when requested by other peers. Our transformation script storecould follow this design. The other key contribution is the mechanism forincremental update exchange, including schema mapping handling, that ad-505.2. Data Transformation Systemsdresses a number of problems in upstream and downstream propagation,and can be efficiently implemented with the use of provenance information.5.2 Data Transformation SystemsThe value in the system we envision, is mostly created in the data trans-formation phase, after an application user has successfully checked out orloaded a copy of his data from the sources. This is where the importantdata cleaning happens, as well as all structural manipulations that serve theapplication. After the transformation phase, our whole system is only in thebusiness of making sure these changes are maintained and made availableto other users.The need to transform and clean data before using it for analysis or asapplication input is ubiquitous in an increasingly data centric industry anda real effort has been made to meet the needs for effective solutions to theproblem. From a large body of research on transformation languages [1, 18,38, 44, 44, 52], we will highlight two publications that set the foundation fora project we see as the most likely solution so far.To improve the interoperability of heterogeneous relational databaseswith regards to querying and data exchange, Lakshmanan et al. [44] con-structed the SchemaSQL transformation language. A natural extension toSQL, SchemaSQL supports the restructuring of databases to fit differentschemas, partly by treating schema information and other metadata in thesame way as the actual data.In an effort to integrate exploratory data cleaning efforts and transforma-tion scripting into a graphical interface, Raman and Hellerstein introducedPotter’s Wheel [52]. They implement a set of common transformations for515.2. Data Transformation Systemsboth values and rows, and prove that this collection can cover all possibleone-to-many row mappings.A recent project, directly related to Potter’s Wheel, vastly improves onits predecessor. The first result of the project was Wrangler [38], a systemwhere users can build transformation scripts in a spreadsheet-like interfaceto manipulate data, or “wrangle” it as the authors call it. Intended to re-duce the need for manual editing and individual scripts, Wrangler lets usersspecify robust transformations that result in a editable and reusable script.The authors define a declarative transformation language that extends thelanguage from Potter’s Wheel, and make the claim that it allows for nearcomplete coverage of all practical data wrangling transformations. In a con-tinuation of the project, the team introduced Profiler [39], a data cleaningsystem that combines automatic anomaly detection with a visual summa-rization feature. The Wrangler project and subsequent work constitutes themost complete solution we know of for an implementation of the data trans-formation phase of our system. The Wrangler system might therefore evenbe made to serve as a module in our system, that is the transformation mod-ule that lies between data sources and applications. As an alternative to theWrangler transformation system, we can briefly mention Google Refine [33].Like Wrangler, Google Refine takes tabular data as input, but its graphicalcommand capabilities are more limited. Google Refine does however includesome useful data cleaning features, such as entity resolution and discrepancydetection.A slightly different approach is taken by Stonebraker et al., in an archi-tecture design for what they call a data curation system [53]. Their main goalis data integration, but in doing so they address issues with data cleaningand transformations. Although sharing the emphasis on non-programmatic525.3. Data Provenanceinterfaces with the Wrangler project, the authors argue that scalability canonly be reached through automation with the help of machine learning. Thisline of reasoning is promising but still at a very early stage in research. Oursystem assumes mostly manual data edits and does not yet include a focuson automation.5.3 Data ProvenanceFor the past two decades, the topic of data provenance and lineage hasgarnered growing attention from groups in the data management researchcommunity. Simply put, provenance is any information about where dataoriginated, how it has been transformed since then, and why it appears inits current context and place [4]. A good survey of early provenance re-search is given by Buneman et al. [12]. Buneman has continued where heleft off and summarized more recent efforts [10], stressing the relevance ofprovenance and calling for increased attention to the issue. Despite being in-creasingly important for scientific research and other practical implications,provenance ideas are still at a very early implementation stage in enterprisesolutions [16].This work of Buneman is rehashed at length by Cheney et al. in a recentsurvey [19]. It expands on the previous survey by covering important newtopics, most notably the notion of how-provenance. Originating with theOrchestra project [36], the idea of how-provenance as put forth by Greenat al. [25] aims to explain what transformations a piece of data has beensubjected to. The authors do this by proposing a new general model forprovenance, one based on a framework of semiring-annotated relations, anddemonstrating how this model encompasses previously proposed models as535.3. Data Provenancespecial cases while encoding more information. The work by Green et al. [25]also demonstrates the flexibility of provenance information by using it as abasis for a system of trust policy enforcement between data sources, as wellas a means of efficiently implementing update propagation. Both ideas couldprove fruitful in our system, with the latter demanding serious considerationin any implementation. Being a system of update propagation, it is clearthat how-provenance should be a first class citizen in our system’s metadatamanagement.One application that is sometimes highlighted in the literature is thecapability of querying provenance [16, 25]. This notion is implemented inthe Orchestra system to help end-users and data curators explore the oftencomplicated provenance graphs. Karvounarakis et al. [41] developed theProQL query language for provenance to support this implementation. Theability to query provenance information directly should be readily applicablein our system, notably for decision support in components like the updateconsolidation store.By definition, as soon as any provenance information is maintained, dataquality has been improved. Aside from that, provenance as a tool can helpsolve a flurry of problems ranging from data integration and query languagesto debugging schema designs and software. In most cases, the task at handwill dictate the need for one of many different provenance models. Theydiffer fundamentally in the granularity of provenance chosen, with the spec-trum covering task-based workflow at one end down to a single value levelat the other. Buneman and Davidson [11] give an overview of provenancemodels for different tasks while arguing the need for model unification infuture research.Early implementation attempts stored provenance information as meta-545.4. Data Exchangedata, quickly requiring a vast amount of storage for fine-grained models withmany transformations. Provenance storage has been shown in practice togrow larger than the base data, sometimes by many factors [16]. Woodruffand Stonebraker suggested an early remedy [57] to this problem, by com-puting provenance lazily instead of using complete storage and only havingknowledge of minor details of transformations undergone by the data. Amore recent effort by Chapman et al. [16] to solve this problem has resultedin a number of algorithms, based on ideas of factorization and inheritance,that managed to reduce provenance storage by a factor of 20 in experiments.Bose and Frew have written a survey [9] on provenance in scientific workflowsystems where many lineage system implementations are chronicled.In context of our system design, it is clear that provenance informationwill play a key role. It should be incorporated from the ground up and serveas the main artifact of what data sources a data tuple belongs to and whomaintains a copy of it. To serve this purpose, we contend that a model mustbe chosen that works at least at the tuple level of granularity. In addition,it is also feasible that how-provenance be maintained that encodes pasttransformations at the same level. We recognize that finding an efficientprovenance model and implementation is a major challenge that remainsunsolved in the wider propagation system.5.4 Data ExchangeData exchange is a problem fundamental to many data management tasks,most notably data integration. The problem is defined as taking data struc-tured under a source schema, and translating it to accurately fit some otherpre-existing target schema. The restructured data should be the best possi-555.4. Data Exchangeble representation of the data under the original schema. An overview of thetheory underlying data exchange is given Fagin et al. [24]. The translationfrom one schema to the other is achieved by following so-called source-to-target dependencies, expressed in some logical formalism. The source-to-target dependencies encode how the schemas relate and what constraintsthe structure of the data instances must follow. They are the rules to thegame and must be maintained as the schemas change.The foundation of data exchange systems draws on elements from re-search on data integration [28, 45, 46]. Data integration relates data sourcesand their schemas to a global schema that queries can be posed against.Data exchange can be modeled as data integration from one source to theglobal target schema, or as integration in both directions. A difference be-tween the two is that a target schema typically exists beforehand in dataexchange whereas it is specially constructed for data integration. The targetschema in data integration is also most often virtual, but the data instancesare physically materialized in data exchange.Data integration and data exchange are the two main tasks in whathas been dubbed data interoperability [27]. Underlying both is the need toexpress relationships between different schemas. This can be done at twocontextually different levels. In schema matching [6, 49] syntactic corre-spondences are generated that establish relationships between elements ofone schema and elements of another schema. A more operational level is theone of schema mappings [27, 31, 47], where the focus is on moving instancesof data from a source to a target. The source-to-target dependencies usedin data exchange are a form of schema mappings. Fagin et al. [24] designatethe semantics of data exchange on one hand and query answering on theother as the two main challenges. For our purposes, we are almost solely565.4. Data Exchangeinterested in the former, namely moving data from source to target.The first system to generate schema mappings in a semi-automatic man-ner between two relational databases was the Clio system [24, 27, 47]. Cliotakes schema matching correspondences as input, and generates source-to-target dependencies as their interpretation. The schema mappings are thenfed into a query generation tool that produces an executable transformationscript in some language, SQL for example.In our system, the data application users would either load data fromsources in a data integration scenario, or check out their own copy for lo-cal storage in a data exchange setting. An implementation to solve eitherscenario would require schema mappings to be generated and maintained.For data exchange we would need mappings established between a sourceschema and a target schema. In the data integration scenario, where onedata application loads data from multiple sources, we would require map-pings from every source to a single virtual universal schema. If howeveran application only loads data from one source without local storage it ispossible that schema changes can be inferred from user mappings, withoutknowing exact schema mappings. A system like Clio seems to be a good fitas a tool to generate schema mappings in our system. Its functionality as ablack box is what we are looking for, but an added complexity in our case isthat schemas on the application side are not necessarily relational. However,Clio was designed as a practical tool for industry from the start, and besidesrelational schemas it can now handle XML schemas [27]. It remains to beexamined what constraints must be enforced on data application schemasin our system so that a schema mapping tool like Clio can be used in thecontext we need.Manipulating schemas of data sources and the mappings that relate them575.5. Probabilistic Databasesare common operations in data integration systems. To define and investi-gate a general set of operators for these tasks, in different contexts and datamodels, is the goal of Model Management [5, 7]. Aside from all operatorsthat help with the data exchange and integration parts of our system, specialattention should be paid to operators devised for schema differencing. Theycould bring possible value to the table in situations where schema mappingsare incomplete or have been lost. If a transformation script is not in placeschema differencing could help shed light on structural changes that a dataapplication user has made.5.5 Probabilistic DatabasesInterest in probabilistic databases resurged with a publication by Dalvi andSuciu [20]. Drawing their idea from the way queries are defined in informa-tion retrieval systems [3], they wanted to address the problem of answeringstructurally rich SQL queries with possibly uncertain predicates, and rank-ing the results. Uncertain predicates lead to uncertain matches, so a proba-bility value must be assigned to every tuple in the database. The problem athand is now one of evaluating queries over database tables with uncertainty,and the authors devise efficient algorithms to do just so.Research on probabilistic databases is expanded on, and integrated withideas about provenance, in the Trio project [4, 56]. They present a newdata model, dubbed ULDB, that treats uncertainty and lineage as firstclass citizens in a database management system. Uncertainty is encodedin the system at both the attribute and the tuple level, representing confi-dence levels of all possible data instances. The relationship between differenttuple-alternatives and how they are derived is then stored as lineage. The585.6. Data Cleaningend product is the Trio prototype system that sits as a layer on top of atraditional, relational DBMS. Queries in the system are posed with a strictextension of SQL that the authors describe in the same work.The idea of combining two different hot topics of research into one sys-tem, where they can complement each other in an effort to solve some classicproblems of data management, is the key reason for our interest in the Triosystem. Like Trio, which combines provenance and uncertainty, our goal isto design a composite system that combines data exchange and integrationwith transformations and lineage. Differences between the two systems arethat ours does not yet involve ideas on uncertain data, and is at a very earlystage in its conception. The takeaway is the direction of combining inter-esting topics into composite systems that can be used for solving practicalproblems.5.6 Data CleaningData sets are very often plagued by various issues, anomalies, and errorsof all different kinds. These errors often originate at the source, whetherthat source is some scientific sensor equipment or simply an input errormade by a human. Common sources of anomalies found in practice arethe inconsistencies that can be introduced by data integration. Efforts toenumerate common data errors have been widely published [21, 23, 30, 42,50]. Among many problems that have been addressed, techniques havebeen introduced to detect outliers [15, 32], duplicate records [23], and keyviolations [30]. Reducing redundancy in databases by entity resolution [8]is another important topic.The onus to create better data cleaning systems is great. Studies have595.7. Extraction, Transformation, and Loadingestimated the cost of correcting data cleaning issues in large data projects tohover around 80% of total project cost [21], and the cost of flawed analysisdue to errors in data runs in billions of dollar per year [22]. Although currentsolutions have made huge strides, most of them are still semi-automatic inthe sense that they are designed to flag potential issues. These issues mustthen be checked by a human and corrected by manipulating the appropriatedata, possibly covering some unflagged tuples when dealing with deriveddata.Many systems have been implemented to solve different tasks in datacleaning [27, 35, 38, 39, 52]. The need to solve different tasks is dictated inpart by the flavour of data management project the data cleaning effort is apart of. Many systems, often focusing on duplicate detection, are purposelybuilt to help with data integration. They will certainly be needed in ourcontext for just that task, but of even greater interest is the direction takenby the Profiler system [39]. With a main emphasis on issues occurring withina single relational table, Profiler flags anomalies that include missing anderroneous data, inconsistent or extreme values, and key violations. Theseare the same issues we expect our application user to be dealing with in thedata cleaning phase of our system. The seamless integration of Profiler withthe Wrangler transformation system [38]—the two having been developedtogether—make the two an interesting possibility as modules in the widersystem we imagine.5.7 Extraction, Transformation, and LoadingAs the role of data management systems increasingly shifted to meet theneed for decision support, data warehouses were invented. Data ware-605.7. Extraction, Transformation, and Loadinghouses [17] and on-line analytical processing (OLAP) gained popularity inthe 90s and have only grown in importance since. A data warehouse is a“subject-oriented, integrated, time-varying, non-volatile collection of datathat is used primarily in organizational decision making.” [34] As such, itemphasizes summarized historical data over the transactional variation ofregular databases, therefore facilitating temporal or categorical analysis.Data warehouses are typically populated by data from one or more on-line transactional databases, the operational databases that serve on thefrontlines of data retrieval and storage. The process of moving data from op-erational data sources, changing it appropriately and using it to populate adata warehouse is called extraction, transformation, and loading (ETL) [55].The two biggest challenges faced in the ETL process are the integration ofdata from multiple sources on one hand and performing data cleaning andall necessary transformations on the other.Extraction is the task of defining the subset of data in a source that isto be imported to the warehouse and querying for it in bulk. This mustbe done with the least amount of interference in the standard operationof the source, and is therefore often performed at night. Comparing snap-shots of data to pinpoint incremental updates is a way to lessen the loadof the process [43]. The transformation step is where the data can undergothe range of changes needed for the particular warehouse operation. Theseinclude data cleaning, solving data integration issues, and other transfor-mations at schema-, instance-, and value-levels that we have described indetail above. The changes also include ones that are uniquely common indata warehouses, such as replacing keys with surrogate keys to improve per-formance and adding redundancy by pivoting tables and attributes. Thelast step of the ETL process is loading the data into an existing warehouse.615.8. View MaintenanceThis usually involves a bulk operation and must consider both the mergingof existing data with updates as well as maintaining all materialized viewsand indexes of the warehouse.The ETL pipeline has many parallels with moving data back and forth inour system. When a user application imports data from one or more sourcesto be kept in local storage, that is a complete ETL process. Important dif-ferences still remain. The first one is our data integration scenario where theuser application never keeps a copy of data in its storage. Another differenceis that the ETL process only works in one direction. Data warehouses andtheir views are designed to be end products, their data is rarely updatedand doesn’t send feedback to data sources. Furthermore, data warehousesare generally large and expensive enterprises. They are costly to design andmaintain and are usually deployed for analysis in larger data projects. Oursystem is designed to handle smaller, lighter data application projects andshould be general enough that a difficult setup process is not needed.5.8 View MaintenanceA materialized view is the stored result of a query on some database that iseither kept remotely or locally in a redundant fashion. This is different fromlogical views, whose virtual tables are never actually stored. The purposeof materializing views is usually to increase query performance. When largetables need to be queried at high cost, it can be useful to instead have accessto a pre-computed table requiring less work. This is particularly useful inspeeding up analytical queries on data warehouses [54].In addition to needing additional storage, the largest overhead of ma-terialized views comes with maintaining them. Whenever the underlying625.8. View Maintenancesource data changes, the materialized view must be updated accordingly toperfectly reflect its original query. This is called view maintenance. One ap-proach would be to recompute the whole view, but that wouldn’t scale toowell. A better solution would be to opt for incremental view maintenance.Surveys can be found on both materialized views [26] and view maintenancealgorithms [2].In Chapter 2, we mentioned how the relationship between a data sourceand an application’s local data copy was akin to the one between a databaseand a materialized view. Propagating updates from the source would thenbe analogous to maintaining a materialized view. A key difference is thatin our case, we do not strive for full synchronization. Nevertheless, theincremental update techniques of view maintenance are likely to be of valueto our system.63Chapter 6Conclusions and FutureWorkIn this thesis we proposed a novel framework for update propagation incollaborative data sharing systems, and suggested a high level system archi-tecture as a first step in implementing that framework. As data gathering ef-forts continue to expand and data applications become ever more importantacross society, we are motivated by a strong need for greater sophisticationin data management systems.We classified three use cases that generalize to countless situations thatshould ultimately be covered by a system following our design. Composedof a number of separate modules, each carrying the load of an importantfunction, the overall system draws from ideas and prior work that span asignificant part of current data management research topics. We explainedthe role and importance of each system function in the context of its relevantprior work, and mapped out requirement guidelines and possible issues forfuture implementation efforts.The high level system overview we described here only represents theearly stages of our group’s focus on update propagation and we can identifya number of challenges that must be overcome for the system actually tobe realized in some form. A particularly sensitive assumption in our work64Chapter 6. Conclusions and Future Workis the need to somehow force application users to only perform meaningfultransformations on the data. This could be enforced through a specializedversion of spreadsheet software, and has been done by others as we have seen,but a much preferred option would still be an integration with standardspreadsheet software while somehow limiting the user’s freedom to makerandom changes that are difficult to parse programmatically.Another major challenge that we have not discused at length, is theadded complexity of propagating updates between players when an applica-tion user has checked out a data copy through an aggregate query. These bydefinition make a single row of application data dependent on any numberof data rows at the source. This can lead to difficult problems in updatepropagation on both sides of the table. When propagating to the source, asyntactic meaning of changing the aggregate value must be interpreted onthe original rows, and on the flip side, value changes or additions/deletionsof data rows at the source must be reflected in recomputed aggregate valuesat all relevant applications.The fact that we assume all data stored locally at the application sideis only in a structured relational format, and not in a full-fledged databasetable, needs to be considered closely before implementation and could causesome unforeseen difficulties. As an example, we have discussed the need forincoming data updates from the source to identify with the correct rows inthe application data copy for consolidation. We mentioned this in passing,but in situations where these rows are now only recognized by a row numberthis matching could be quite difficult. It is also easy to see how careless andinconsistent management of such row numbers can lead to numerous issuesin the system operation over time.A somewhat related issue ties in with how the system should segment65Chapter 6. Conclusions and Future Workbetween general ETL transformations and more tuple-specific data cleaningtransformations. We defined the former as any changes that are generallyapplicable to all incoming new or reloaded data on the application side. Wesuspect that the boundary between the two could at times become some-what fuzzy and this would have to be addressed in the system. In the dataintegration scenario, when data is fetched from the sources at applicationruntime, the correct matching between all prior user mappings and the in-coming data needs to be established. Implementing the functionality is likelyto suffer from this concern.Two more system modules have thus far only been laid out in roughdrafts and might be challenging when it comes to integration. The first isthe schema mapping store. Although schema mapping storage between tworelated schemas is common in data integration, arbitrary changes on eitherside in our system would call for incremental updates to schema mappings.Since such changes could potentially be triggered by different modules ofthe system, and affect mappings between more than one schema, this couldprove to be a tough task. The second module whose operation needs moreconsideration, is the one handling update consolidation. This module shouldadjust all incoming data from sources to the correct schema and form ofthe application data. In addition, we want this phase to consolidate anyconflicting updates—updates that might arrive from different sources in adata integration scenario and affect the same data row. Presumably, mergingupdates should not be a big problem but design choices need to be madeabout how the system should handle conflicting updates to the same datavalues in a particular row.The last leap of faith we mention that will prove challenging forms oneof the key steps in update generation. Given a transaction log of all user66Chapter 6. Conclusions and Future Workdefined changes, one suggested way of propagating updates involved gener-ating an executable SQL script, in the source schema, and presenting it toits administrator. First of all, interpreting the transformation script intoSQL is no easy task, but great care must also be taken that errors are notintroduced in this phase since the purpose of its output is to alter the orig-inal source data. Although it is a logical and necessary step in one of theupdate propagation schemes, it remains to be seen whether this module willbe a feasible option in implementation.As our first effort in this direction we feel we have laid out a vision andnecessary groundwork for the continuation of this project. Much work is stillneeded before a working prototype of our system can be achieved. Obviously,meeting the challenges drafted above will be a big factor in getting to thatgoal but we can identify further steps that need to be taken and will nowconclude this report by listing some possible milestones of future work.A big step necessary for the realization of such a system is to developa suitable provenance model. We have described how in other work prove-nance research has been molded in recent years to fit particular objectives,and how formalisms at different granularities have been used in applications.Many of the challenges we have described could be met by a well designedprovenance model as well as other questions that might arise. One would bedeciding how the system should deal with rejected updates once they havebeen suggested. Should they be kept track of for a later recommendation,or should they be discarded? What about updates made later to valuesaffected by earlier rejected updates? What if an update affecting more thanone user is rejected by one and accepted by the other? Those are examplesof questions that could affect choices about how data history should be keptin provenance records.67Chapter 6. Conclusions and Future WorkAnother task at hand is to develop update strategies for different actorsin our system. The means and ways of choosing when data is loaded, orwhen new updates are either fetched or pushed, can have large effects onthe system operation and design. It is necessary to map out the subtle effectdifferent update strategies might have. In doing so, it is also important todetermine exactly what connection an implementation of our system wouldhave with the notion of materialized views in database systems. We havediscussed some preliminary similarities but just as obvious are importantdifferences. Nonetheless, it is likely that ideas and algorithms from viewmaintenance research would be applicable to similar problems in our frame-work.In Chapter 3, we described two different formalisms on propagating up-dates. One involved generating an executable SQL script from the trans-formation logs and presenting it to the data source, while the other is anattempt to identify erroneous tuples in the data source and describe theissues it might have. Both methods should be examined in future work todetermine which is more suitable to the objectives of our system.In addition, we floated a couple of new features that would improve theusability of the end product, if implemented. The first would be function-ality to rank the results of update propagation suggestions. This might bepossible by imposing an order of importance on different types of updatesor implementing varying trust conditions on different update sources fromthe receiver’s perspective. Again, the feasibility of both would rely on thechoice of provenance model. The second improvement we propose, is themeans of grouping updates into composite aggregations in parts of the sys-tem and describing them in human readable form. This would improve usercomprehension, and could possibly be implemented both when data trans-68Chapter 6. Conclusions and Future Workformations are prescribed as well as before update suggestions are presentedto a user.The way we described the data sharing pipeline in earlier chapters wouldimply that a data application owner is able to examine and clean all his dataprior to inputting it to the application. This should often be the case butwe must still consider that certain data errors might not be discovered untilruntime by the application user. To accommodate this fact it would behelpful to look into how our system could interface with the application tohandle real-time feedback about data quality issues.At last, we reiterate the unsolved problem of the transformation inferenceengine. Given data from a source and a modified copy of some subset ofthat data, is it possible to infer an executable script to transform the lattercopy to the former? This is obviously a big question to ask and wouldrequire much effort to answer fully. However, if possible, an inference enginelike that would bypass the requirement of an existing transformation scriptbefore propagating valuable updates to other users. This would prove to bebeneficial to many existing data sharing interactions since our system wouldnot need to be in place right from the start.The topics we have specifically mentioned here as possible next steps inno way represent a complete collection of future research avenues. Many stillremain, and future challenges are bound to surface. Once further progress ismade, we foresee a system that could benefit a great number of applicationsand would significantly reduce the cost of repeated, manual labour in manydata processing pipelines.69Bibliography[1] Serge Abiteboul, Sophie Cluet, Tova Milo, Pini Mogilevsky, JeromeSime´on, and Sagit Zohar. Tools for data translation and integration.IEEE Data Eng. Bull., 22(1):3–8, 1999.[2] D. Agrawal, A. El Abbadi, A. Singh, and T. Yurek. Efficient view main-tenance at data warehouses. SIGMOD Record, 26(2):417–427, June1997.[3] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern informationretrieval. ACM press New York, 1999.[4] Omar Benjelloun, Anish Das Sarma, Alon Halevy, and Jennifer Widom.ULDBs: Databases with uncertainty and lineage. In Proceedings of the32nd international conference on Very Large Data Bases, pages 953–964. VLDB Endowment, 2006.[5] Philip A Bernstein. Applying model management to classical meta dataproblems. In CIDR, volume 2003, pages 209–220, 2003.[6] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. Genericschema matching, ten years later. Proceedings of the VLDB Endow-ment, 4(11):695–701, 2011.[7] Philip A Bernstein and Sergey Melnik. Model management 2.0: ma-70Bibliographynipulating richer mappings. In Proceedings of the 2007 ACM SIGMODInternational Conference on Management of Data, pages 1–12. ACM,2007.[8] Indrajit Bhattacharya and Lise Getoor. Collective entity resolution inrelational data. ACM Transactions on Knowledge Discovery from Data(TKDD), 1(1):5, 2007.[9] Rajendra Bose and James Frew. Lineage retrieval for scientific dataprocessing: a survey. ACM Computing Surveys (CSUR), 37(1):1–28,2005.[10] Peter Buneman. The providence of provenance. In Big Data, pages7–12. Springer, 2013.[11] Peter Buneman and Susan B Davidson. Data provenance–the founda-tion of data quality, 2013.[12] Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. Why andwhere: A characterization of data provenance. In Database Theory-ICDT 2001, pages 316–330. Springer, 2001.[13] Peter Buneman, David Maier, and Jennifer Widom. Where was yourdata yesterday, and where will it go tomorrow? data annotation andprovenance for scientific applications. In Position paper for NSF Work-shop on Information and Data Management (IDM 2000): ResearchAgenda into the Future, Chicago IL, 2000.[14] Anup Chalamalla, Ihab F Ilyas, Mourad Ouzzani, and Paolo Papotti.Descriptive and prescriptive data cleaning. In SIGMOD, pages 445–456,2014.71Bibliography[15] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly de-tection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.[16] Adriane P Chapman, Hosagrahar V Jagadish, and Prakash Ramanan.Efficient provenance storage. In Proceedings of the 2008 ACM SIG-MOD International Conference on Management of Data, pages 993–1006. ACM, 2008.[17] Surajit Chaudhuri and Umeshwar Dayal. An overview of data ware-housing and OLAP technology. ACM Sigmod record, 26(1):65–74, 1997.[18] Weidong Chen, Michael Kifer, and David S Warren. Hilog: A foun-dation for higher-order logic programming. The Journal of Logic Pro-gramming, 15(3):187–230, 1993.[19] James Cheney, Laura Chiticariu, and Wang-Chiew Tan. Provenancein databases: Why, how, and where. Foundations and Trends inDatabases, 1(4):379–474, 2009.[20] Nilesh Dalvi and Dan Suciu. Efficient query evaluation on probabilisticdatabases. The VLDB Journal, 16(4):523–544, 2007.[21] Tamraparni Dasu and Theodore Johnson. Exploratory Data Miningand Data Cleaning. John Wiley, 2003.[22] Wayne W Eckerson. Data quality and the bottom line. TDWI Report,The Data Warehouse Institute, 2002.[23] Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios.Duplicate record detection: A survey. Knowledge and Data Engineer-ing, IEEE Transactions on, 19(1):1–16, 2007.72Bibliography[24] Ronald Fagin, Phokion G Kolaitis, Rene´e J Miller, and Lucian Popa.Data exchange: semantics and query answering. Theoretical ComputerScience, 336(1):89–124, 2005.[25] Todd J Green, Grigoris Karvounarakis, Zachary G Ives, and Val Tan-nen. Provenance in ORCHESTRA. IEEE Data Engineering Bulletin,33(3):9–16, 2010.[26] Ashish Gupta and Iderpal Singh Mumick. Materialized views: tech-niques, implementations, and applications. MIT press, 1999.[27] Laura M. Haas, Mauricio A. Herna´ndez, Howard Ho, Lucian Popa, andMary Roth. Clio grows up: from research prototype to industrial tool.In Proceedings of the 2005 ACM SIGMOD International Conference onManagement of Data, pages 805–810, 2005.[28] Alon Halevy, Anand Rajaraman, and Joann Ordille. Data integration:the teenage years. In Proceedings of the 32nd International Conferenceon Very Large Data Bases, pages 9–16. VLDB Endowment, 2006.[29] Alon Y Halevy, Zachary G Ives, Dan Suciu, and Igor Tatarinov. Schemamediation in peer data management systems. In Proceedings of the 19thInternational Conference on Data Engineering, pages 505–516. IEEE,2003.[30] Joseph M Hellerstein. Quantitative data cleaning for large databases.United Nations Economic Commission for Europe (UNECE), 2008.[31] Mauricio A Herna´ndez, Rene´e J Miller, and Laura M Haas. Clio:A semi-automatic tool for schema mapping. ACM SIGMOD Record,30(2):607, 2001.73Bibliography[32] Victoria J Hodge and Jim Austin. A survey of outlier detection method-ologies. Artificial Intelligence Review, 22(2):85–126, 2004.[33] David Huynh and Stefano Mazzocchi. Google refine.[34] William H Inmon. Building the data warehouse. John Wiley & Sons,2005.[35] Zachary Ives, Craig Knoblock, Steve Minton, Marie Jacob, ParthaTalukdar, Rattapoom Tuchinda, Jose Luis Ambite, Maria Muslea, andCenk Gazen. Interactive data integration through smart copy & paste.In CoRR, 2009.[36] Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas ETaylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fer-nando Pereira. The ORCHESTRA collaborative data sharing system.ACM SIGMOD Record, 37(3):26–32, 2008.[37] Zachary G Ives, Nitin Khandelwal, Aneesh Kapur, and Murat Cakir.Orchestra: Rapid, collaborative sharing of dynamic data. In CIDR,pages 107–118, 2005.[38] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer.Wrangler: interactive visual specification of data transformationscripts. In CHI, pages 3363–3372, 2011.[39] Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein,and Jeffrey Heer. Profiler: integrated statistical analysis and visualiza-tion for data quality assessment. In AVI, pages 547–554, 2012.74Bibliography[40] Grigoris Karvounarakis, Todd J Green, Zachary G Ives, and Val Tan-nen. Collaborative data sharing via update exchange and provenance.ACM Transactions on Database Systems (TODS), 38(3):19, 2013.[41] Grigoris Karvounarakis, Zachary G Ives, and Val Tannen. Queryingdata provenance. In Proceedings of the 2010 ACM SIGMOD Interna-tional Conference on Management of Data, pages 951–962. ACM, 2010.[42] Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, andDoheon Lee. A taxonomy of dirty data. Data mining and knowledgediscovery, 7(1):81–99, 2003.[43] Wilburt Labio and Hector Garcia-Molina. Efficient snapshot differen-tial algorithms in data warehousing. Proceedings of the InternationalConference on Very Large Data Bases, pages 63–74, 1996.[44] Laks V S Lakshmanan, Fereidoon Sadri, and Subbu N Subramanian.SchemaSQL: An extension to SQL for multidatabase interoperability.ACM Transactions on Database Systems (TODS), 26(4):476–519, 2001.[45] Maurizio Lenzerini. Data integration: A theoretical perspective. InProceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART sym-posium on Principles of Database Systems, pages 233–246. ACM, 2002.[46] Alon Levy, Anand Rajaraman, and Joann Ordille. Querying heteroge-neous information sources using source descriptions. In Proceedings of22th International Conference on Very Large Data Bases, pages 251–262, 1996.[47] Rene´e J Miller, Laura M Haas, and Mauricio A Herna´ndez. Schema75Bibliographymapping as query discovery. In Proceedings of the International Con-ference on Very Large Data Bases, volume 2000, pages 77–88, 2000.[48] Raymond T Ng, Patricia C Arocena, Denilson Barbosa, GiuseppeCarenini, Luiz Gomes, Jr, Stephan Jou, Rock Anthony Leung, Evan-gelos Milios, Rene´e J Miller, John Mylopoulos, et al. Perspectives onbusiness intelligence. Synthesis Lectures on Data Management, 5(1):1–163, 2013.[49] Erhard Rahm and Philip A Bernstein. A survey of approaches to au-tomatic schema matching. The VLDB Journal, 10(4):334–350, 2001.[50] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and currentapproaches. IEEE Data Eng. Bull., 23(4):3–13, 2000.[51] Raghu Ramakrishnan and Johannes Gehrke. Database managementsystems. Osborne/McGraw-Hill, 2003.[52] Vijayshankar Raman and Joseph M. Hellerstein. Potter’s wheel: Aninteractive data cleaning system. In Proceedings of the InternationalConference on Very Large Data Bases, pages 381–390, 2001.[53] Michael Stonebraker, George Beskales, Alexander Pagan, Daniel Bruck-ner, Mitch Cherniack, Shan Xu, Verisk Analytics, Ihab F. Ilyas, andStan Zdonik. Data curation at scale: The data tamer system. In InCIDR 2013.[54] Michael Teschke and Achim Ulbrich. Using materialized views to speedup data warehousing. Technical Report, IMMD 6. Universitat Erlangen-Nurnberg, 1997.76Bibliography[55] Panos Vassiliadis and Alkis Simitsis. Extraction, transformation,and loading. In Encyclopedia of Database Systems, pages 1095–1101.Springer, 2009.[56] Jennifer Widom. Trio: A system for data, uncertainty, and lineage.Managing and Mining Uncertain Data, pages 113–148, 2008.[57] Allison Woodruff and Michael Stonebraker. Supporting fine-graineddata lineage in a database visualization environment. In Proceedings ofthe 13th International Conference on Data Engineering, pages 91–102.IEEE, 1997.77


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items