UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A study of provenance in databases and improving the usability of provenance database systems AlOmeir, Omar 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_february_alomeir_omar.pdf [ 1.96MB ]
JSON: 24-1.0221370.json
JSON-LD: 24-1.0221370-ld.json
RDF/XML (Pretty): 24-1.0221370-rdf.xml
RDF/JSON: 24-1.0221370-rdf.json
Turtle: 24-1.0221370-turtle.txt
N-Triples: 24-1.0221370-rdf-ntriples.txt
Original Record: 24-1.0221370-source.json
Full Text

Full Text

A study of provenance in databases and improving theusability of provenance database systemsbyOmar AlOmeirBSc in Computer Science, Prince Sultan University, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)December 2015c© Omar AlOmeir, 2015AbstractProvenance refers to information about the origin of a piece of data and the pro-cess that led to its creation. Provenance information has been a focus of databaseresearch for quite some time. In this field, most of the focus has been on the sub-problem of finding the source data that contributed to the results of a query. Moreformally, the problem is defined as follows: given a query q and a tuple t in theresults of q, which tuples from the relation R accessed by q caused t to appear inthe results of q. The most studied aspect of this problem has been on develop-ing models and semantics that allow this provenance information to be generatedand queried. The motivations for studying provenance in databases vary acrossdomains; provenance information is relevant to curated databases, data integrationsystems, and data warehouses for updating and maintaining views.In this thesis, I look extensively at provenance models as well as different sys-tem implementations.I compare the different approaches, analyze them, and pointout the advantages and disadvantages of each approach. Based on my findings, Idevelop a provenance system based on the most attractive features of the previoussystems, built on top of a relational database management system. My focus ison identifying areas that could potentially make provenance information easier tounderstand for users, using visualization techniques to extend the system with aprovenance browsing component.I provide a case study using my provenance explorer, looking at a large datasetof financial data that comes from multiple sources. Provenance information helpswith tracking the sources and transformations this data went through and explainsthem to the users in a way they can trust and reason about. There has not been muchwork focused on presenting and explaining provenance information to databaseiiusers. Some of the current approaches support limited facilities for visualizingand reporting provenance information. Other approaches simply rely on the userto query and explore the results via different data manipulation languages. Myapproach presents novel techniques for the user to interact with provenance infor-mation.iiiPrefaceThis thesis is an original intellectual product of the author, O. AlOmeir. No part ofthis thesis was previously published. No ethics certificates were required.The GLEI Watch system and data described in chapter 2, sections 2.1 and 2.2is based on the work of V. L. Lemieux, P. Phillips, H. S. Bajwa, and C. Li. Noneof the text of the thesis is taken directly from previously published or collaborativearticles.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 42 GLEI Watch System . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 GLEI Watch system architecture . . . . . . . . . . . . . . . . . . 52.1.1 GLEI Watch system current architecture . . . . . . . . . . 62.1.2 GLEI Watch system new architecture . . . . . . . . . . . 72.2 The GLEI data . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Incorporating data from other sources . . . . . . . . . . . 9v2.3 Provenance use case . . . . . . . . . . . . . . . . . . . . . . . . . 93 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Types of provenance . . . . . . . . . . . . . . . . . . . . . . . . 123.1.1 Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Why and where provenance . . . . . . . . . . . . . . . . 133.1.3 How provenance and semiring polynomials . . . . . . . . 143.2 Overview of approaches . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Lineage tracing in data warehouses . . . . . . . . . . . . 163.2.2 Why and where semi-structured model . . . . . . . . . . 183.2.3 DBNotes . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.4 Orchestra . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.5 Perm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.6 Other approaches . . . . . . . . . . . . . . . . . . . . . . 323.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5 Further background . . . . . . . . . . . . . . . . . . . . . . . . . 364 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1 Design principles . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 SQL support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 PI-CS Provenance type . . . . . . . . . . . . . . . . . . . . . . . 434.3.1 Relational algebra provenance rules . . . . . . . . . . . . 444.3.2 Provenance attributes renaming function . . . . . . . . . . 454.4 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . 454.4.1 Query translation . . . . . . . . . . . . . . . . . . . . . . 464.4.2 Query converting . . . . . . . . . . . . . . . . . . . . . . 464.5 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 505.1 Performance and overhead . . . . . . . . . . . . . . . . . . . . . 505.1.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . 505.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50vi5.1.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 536 Provenance Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1 Provenance browsers . . . . . . . . . . . . . . . . . . . . . . . . 566.2 User tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2.1 Data related tasks . . . . . . . . . . . . . . . . . . . . . . 586.2.2 Provenance information tasks . . . . . . . . . . . . . . . 596.3 Visualization principles . . . . . . . . . . . . . . . . . . . . . . . 626.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.4.1 Data visualization . . . . . . . . . . . . . . . . . . . . . . 636.4.2 The provenance graph . . . . . . . . . . . . . . . . . . . 646.5 Provenance Explorer implementation . . . . . . . . . . . . . . . . 646.6 GLEI case study . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71viiList of TablesTable 2.1 GLEI data schema . . . . . . . . . . . . . . . . . . . . . . . . 8Table 3.1 Employee table . . . . . . . . . . . . . . . . . . . . . . . . . 13Table 3.2 Project table . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Table 3.3 Q1 results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Table 3.4 Intermediate view from Q1 . . . . . . . . . . . . . . . . . . . 16Table 3.5 Employee table with annotations . . . . . . . . . . . . . . . . 21Table 3.6 Project table with annotations . . . . . . . . . . . . . . . . . . 21Table 3.7 Q1 results in Perm . . . . . . . . . . . . . . . . . . . . . . . . 28Table 3.8 Perm query results with missing values . . . . . . . . . . . . . 31Table 3.9 Provenance systems comparison table . . . . . . . . . . . . . . 38Table 4.1 Results of an aggregation query without provenance . . . . . . 43Table 4.2 PI-CS example results of an aggregation query with provenance 43Table 4.3 Relational algebra conversion rules . . . . . . . . . . . . . . . 44Table 4.4 Examples of relational algebra conversion rules . . . . . . . . 45viiiList of FiguresFigure 2.1 The GLEI Watch system architecture . . . . . . . . . . . . . 6Figure 2.2 A geographic map visualization showing registrations per coun-try . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 2.3 The new suggested system architecture . . . . . . . . . . . . 7Figure 3.1 Mapping a relation into a graph . . . . . . . . . . . . . . . . 19Figure 3.2 A data sharing graph . . . . . . . . . . . . . . . . . . . . . . 25Figure 4.1 Provenance system architecture . . . . . . . . . . . . . . . . 45Figure 4.2 Converted ASPJ query . . . . . . . . . . . . . . . . . . . . . 47Figure 4.3 Provenance system user interface . . . . . . . . . . . . . . . 48Figure 5.1 A line plot of the queries running with overhead of my systemmeasured . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 5.2 A line plot of the queries running without the overhead of mysystem measured . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 5.3 A barchart of the result sizes of different queries . . . . . . . 54Figure 6.1 The interface of the provenance browser from [2] . . . . . . . 57Figure 6.2 The visualization of the provenance of a value in a table inDBNotes. From [9] . . . . . . . . . . . . . . . . . . . . . . . 58Figure 6.3 The data visualizations with search, filter and sort options . . 59Figure 6.4 Selecting an individual data item to see its provenance . . . . 60Figure 6.5 The provenance data tables with the provenance graph . . . . 60Figure 6.6 Drilling down to the ancestor of the table in Figure 6.5 . . . . 61ixFigure 6.7 A visualization of the query and the rewritten provenance query 61Figure 6.8 The derivation history of GLEI data from origin to the finalaggregated view . . . . . . . . . . . . . . . . . . . . . . . . . 64Figure 6.9 A world map with aggregated count of registrations in a country 66Figure 6.10 A barchart showing aggregated count of registrations in a country 67Figure 6.11 A treemap with aggregated count of registrations in a country 67xAcknowledgmentsFirst of all, all praises go to Allah most merciful for his guidance.I would like to start by expressing my utmost gratitude to my supervisor,Rachel Pottinger for her support through the past two years and for being a con-stant source of guidance and inspiration through normal and extraordinary circum-stances.My gratitude to the second reader, Raymond Ng, for his time, original ideas,and thorough questions and feedback.I am incredibly grateful to my family. First and foremost, my mother andfather, for their unwavering dedication to our education throughout the years andtheir unrelenting love and support. They stood by me throughout this whole processand I shall always be grateful.My gratitude to my wife, Abrar, my partner through this journey. Without herlove and support I would not have been able to complete this work. I am gratefulfor her bearing with me through this difficult process even though she was goingthrough her own share of difficulty.My gratitude to my siblings, Nora, Othman, and Khold for believing in meand for pouring support and encouragement whenever I needed them.To my daughter Serene, thank you for being my guiding light.Thanks to my fellow students and lab-mates. Special thanks to Jessica Wongfor helpful discussions and feedback.Finally, many thanks to the Saudi government and Prince Sultan Universityfor their funding and financial support.xiDedicationTo my loving family.xiiChapter 1IntroductionData provenance is defined as any information about the origin of a piece of dataand the process that led its creation. In [6], the definition is as follows “Data prove-nance, sometimes called lineage or pedigree, is the description of the origins of apiece of data and the process by which it arrived in a database” [6]. Provenancehas been a focus of research in a number of different areas. In database research,most of the focus has been on the subproblem of tracking the source and processthat contributed to the results of a query. There has been a lot of work on devel-oping comprehensive models and representations of data provenance. There hasalso been work on different implementations that looked at different aspects of theproblem, such as propagating provenance data and querying provenance informa-tion.1.1 MotivationsThe motivations for studying provenance in databases vary from the prevalence ofcurated databases to updating and maintaining views. The study of provenancestarted as a study of lineage in data warehousing environments [12] and later ex-tended into all relational databases. The challenge of data provenance comes fromthe fact that when multiple sources contribute to a result, tracking the differentsources and the transformations that took place is difficult. This difficulty is ampli-fied by the different models and semantics used to describe provenance informa-1tion.In this thesis, I will take an extensive look at techniques to represent and de-scribe provenance. I aim to also look at the pros and cons of current theoretical rep-resentations and system implementations. I will also look at how different systemimplementations generate and query provenance information. I will look at someof the seminal works in representing database provenance such as [6] and [17],some of their implementations, as well as the numerous surveys and primers thatsum up the state of provenance research, including [14] and [8]. My goal is to com-pare them in terms of common traits, point out the pros and cons of each approach,and find out what facilities could be useful for users when looking at provenanceinformation.The end goal and motivation of this project is to find novel ways to explainprovenance information to users. In my case, I started out by looking at a largedataset of financial data that comes from multiple sources. This dataset requiresthe use of provenance information to find the sources and transformations this datawent through to explain them to the users in a way they can trust and reason about.The challenge here is that there has not been much work focused on presentingand explaining provenance information to database users. Most of the focus hasbeen on formulating the problem and coming up with formal models and querylanguage extensions. Some of the current approaches support (limited) facilitiesfor visualization and reporting provenance information. Other approaches simplyrely on the user to query and explore the results via different data manipulationlanguages. I reach the conclusion that most of the work on data provenance islacking in terms of taking the end-user’s perspective. The models can be complexand some query languages can be unintuitive. I summarize the most useful featuresof current models and systems, and classify them according to what I find to bethe most important features. Using this scheme I implement my own provenancesystem and extend it with visualization facilities making it a more user-centeredeffort than current approaches.21.2 ContributionsThere has been a lot of work on the development of provenance enabled databasemanagement systems as explored in Chapter 3. However most of the work hasbeen focused on developing comprehensive systems with extended query languagesupport over SQL or other semi-structured languages. My goal in this thesis is notto propose a new system or a new model but to explore the space of provenanceresearch, implement a system based on previous approaches with any necessarymodifications, and to study and improve how users handle provenance data in thesystem. So to summarize I identify my contributions as:• Take a comprehensive look at provenance research in databases and come upwith a concise summary of desirable features for future systems.• Implement an efficient and extensible system based on my findings with suf-ficient support for query types that I can test.• Extend my system with web APIs that enable my visualization component.I incorporate provenance information into my visualizations in a simple andeasy to use fashion.• Develop a provenance exploration component for database provenance thatis both novel and effecient.• Verify my approach with a case study by developing an application for theGLEI financial dataset using my provenance system and provenance ex-plorer.1.3 TerminologyTo avoid confusion, I will define my terminology in this subsection. The fieldof data provenance is relatively new and research is often presented with differ-ent words used to describe the same things. I would like to define and unify theterminology used throughout this thesis here.The word model is used throughout this thesis to refer to logical models. Stor-age or physical models will be referred to explicitly as that. Lineage, why, where,3and how provenance will be referred to as notions or types of provenance or prove-nance semantics, this can be a point of confusion since some works refer to themas models. Other works refer to them as part of a bigger provenance category titled“data provenance”, I forgo this sub-categorization and call them types of prove-nance or provenance semantics.Examples will for the most part be presented using relational tables and queriesusing SQL or SQL-like notation. However, depending on the system, exceptionswill be made for systems or models that use different models and data manipulationlanguages.I use the words lineage and derivation history interchangeably when talkingabout provenance information. I also use the word ancestor to refer to a source andimmediate ancestor to refer to a direct source.There are two main approaches to computing provenance, the information iseither actively recorded and stored when the data is generated or a transformationis applied (eager) or it is generated when a user requests it (lazy). I will use thesetwo terms to categorize systems in later discussion but there are systems that blurthe line between the two.1.4 Thesis OrganizationThis thesis is structured as follows: I describe our motivation and the GLEI finan-cial system in Chapter 2. Chapter 3 contains an overview of provenance research. Idescribe the system I implemented in Chapter 4, complete with a description of thetheory behind it. I conduct some experiments to validate the system and measureperformance and present my results in Chapter 5. The visualization component isdescribed in Chapter 6. Finally, I conclude with my conclusions and a discussionof future work in Chapter 7.4Chapter 2GLEI Watch SystemThe Global Legal Entity Identifier (GLEI for short) Watch Project was establishedwith the aim to improve financial transparency and stability monitoring. The GLEIWatch system integrates financial datasets from various heterogeneous sourcescalled legal operating units (LOUs) and presents them to users for analysis andvisualization. Each data unit (tuple) represents a legal entity identified by its LEI(legal entity identifier). LEIs are designed to be a single, universal standard iden-tifier for any organization or firm involved in a financial transaction internation-ally [7].In this chapter I look at the GLEI Watch system architecture and the changesI made to it in Section 2.1. I examine the GLEI data in Section 2.2 and finallypresent my GLEI provenance use case in Section GLEI Watch system architectureThe GLEI Watch system was already an up and running project when I joined tolook over the challenges they were tackling. I start by looking at the system archi-tecture that was in place when I joined then I move on to look at the suggestions Imade to change the system.5Figure 2.1: The GLEI Watch system architecture2.1.1 GLEI Watch system current architectureWhen I started working on this project, the system had an architecture composedof different solutions that run sequentially to integrate and normalize the differentdatasets. The results are visualizations shown to the user designed to answer anumber of questions.The GLEI system is composed of the following components:• The Collector is a number of scripts that run daily to collect the data fromLOUs (Legal Operating Units) either by directly downloading them or usingweb scraping techniques. The collector also performs some of the cleaningand normalization tasks such as normalizing dates and adding geo-locationto each legal entity. The LOUs data comes in XML or CSV formats. Thedata files also have different schemas, the differences are resolved in the nextstage.• The Pentaho suite of tools provide drag-and-drop facilities for integratingdata titled Spoon. The data integration scripts, described in XML, combineall data files into one file. some normalization is performed by the tool torename the attributes and solve some XML problems. The results are thencombined into a single file. The datasets lose some details in the process,such as the different semantics for similar attributes, and the data types ofattributes. The scripts are updated manually in case of new data formats orerrors.• The visualizations are implemented using Tableau [18]. A visualizationand analysis tool that enables fast visualization and rapid prototypes. Theoutput of the previous component is converted into a compatible file formatand loaded manually into Tableau. Tableau visualizations provide the user6with the ability to browse and gain insights into the data. However there isa number of challenges with visualizing data of this volume. The least ofthese challenges is data clutter when attempting to show all the informationin a single visualization see Figure 2.2.Figure 2.2: Heavy data clutter can be seen in places with large numbers ofregistrations like the US and Germany2.1.2 GLEI Watch system new architectureFigure 2.3: The new suggested system architecturePart of my work was to suggest a design change and a new system architecture.Figure 2.3 shows the suggested design: The new system would be implementedusing general purpose programming languages instead of relying on ready-madetools.The following components make up the new system design:• The Collector would remain essentially the same. However, the cleaningand normalization tasks done at this step would be moved over to the nextcomponent.7LEI RegistryNumber LegalName LegalForm City State Country PostCode Address Latitude Longitude LastUpdateDate PortalDate SourceTable 2.1: GLEI data schema• The data extraction component would handle setting up and extractingrecords from XML data and performing the necessary cleaning and normal-ization steps for raw data obtained from the LOUs or other sources. The out-put of this component would be the different datasets with a unified schemaready for integration.• Entity resolution is a component to link records that correspond to the samereal world entries. It is referred to as different names in the literature such asreduplication or record matching. Entity resolution is only used when newdata sets are added that do not use the legal entity identifier. In the GLEIcurrent sources, each company is identified using the LEI, adding new datasets with a different identifier can make things difficult. Matching companyrecords using the company name and address fields for example would re-quire entity resolution methods.• The API exports data to visualization and analysis components. Instead ofTableau I would use D3.js, this will support the goal of giving the user moreflexibility to explore the data and query the system. It also gives me theability to integrate provenance information into the visualizations.2.2 The GLEI dataThe Global Legal Entity Identifier dataset is collected from various sources for theGlobal LEI watch project using the GLEI Watch system described in the previoussection. The combined size of all the data is approximately 70 MB of size andcontains 312,295 rows and growing daily, the data describes the same number oflegal entities. Table 2.1 shows the schema of the final dataset.82.2.1 Incorporating data from other sourcesOne goal of the GLEI Watch system is to incorporate datasets from sources otherthan the LOUs. Other datasets considered for the project include:• Federal Financial Institutions Examination Council Reports of Conditionand Income (FFIEC Call Reports). Contains hierarchy ownership informa-tion of American financial institutions.• Company information from the Securities and Exchange Commission Elec-tronic Data Gathering, Analysis and Retrieval system (SEC EDGAR).• A member list from the Federal Home Loan Banks (FHLB).• Legal entity identifier information from the GMEI utility, a pre-Local Oper-ating Unit (LOU) of the Global Legal Entity Identifier System (GMEI LEI).• The International Consortium of Investigative Journalists contains a datasetof the leaks on offshore tax havens. http://offshoreleaks.icij.org/searchThe biggest challenge is that these datasets do not utilize the LEI, making it asomewhat difficult integration task. We can use the LegalName or address fieldsto match depending on the datasets. There is little overlap with exact matching( 200 tuples with the FFIEC reports), using research in entity resolution can help.However, most entity resolution methods give fuzzy matches with varying degreesof certainty. The variation in names across datasets could be simple or complex.Presenting the results of matches to users can enhance the trustworthiness of thedata, this is one area where provenance information can help.2.3 Provenance use caseThe nature of this project uncovers various interesting data-related challenges indata integration, data cleaning, information extraction and data provenance. Thelast of which is my current topic of interest. I am interested in techniques to presentand visualize provenance information to the users. In my work on the GLEI Watchproject I hope to integrate provenance capture and presentation techniques to im-prove trust in the data in order to support users making decisions using the collected9and normalized final results. I use the finalized GLEI dataset to validate my dataprovenance approach.Queries are entered into the system by users when asking for a visualizationor a view of a subset of the data. The system and visualizations are designed toanswer the users’ questions. An example query would be: What are the cities inwhich the most registrations occur?. Translated into SQL the aggregation querywould look like this:SELECT count(LEI), City FROM GLEI GROUP BY CityThe provenance system would transform the query and generate a query that givesthe data results as well as the provenance information. The results of the querywith provenance would be the entire dataset. This amount of information can beoverwhelming. Presenting the user with interactive visualization they could inspectand use to browse the provenance results can help. The user could also edit the data,annotate it, or export subsets of it to files. All the while maintaining the originaldata provenance information. The user could also dig deeper and inspect a deeperlevel of provenance ancestors. A full description of the experiment is providedin Chapter 6.10Chapter 3BackgroundData Provenance has been studied in a number of different contexts resulting anumber of data models and system implementations based on those models. In thischapter, I look at provenance models as well as different system implementations,compare the different approaches, analyze them, and point out the advantages anddisadvantages of each approach.My focus is on identifying areas that could potentially make provenance infor-mation easier to understand for users and help me gain insight before developingmy own system and provenance exploration component. A summary of my com-parison results gives a breakdown of systems according to their most importantfeatures, this scheme helps me gain insight into the research and identify essentialfeatures for a provenance system.Guided by my findings in this chapter, I develop a system that contains the mostidentified desirable features. My system is also made to be applicable to a set of fi-nancial data (described in chapter 2) with dissemination techniques inspired by thefacilities present in systems like DBNotes [9] and work-flow provenance systems,that also goes a step further with visualization components and provenance brows-ing capabilities. I believe that such a system can make for a great contribution tothis field of research.I start this chapter by looking at the different provenance types in Section 3.1.Then I explore the current system implementations and some theoretical modelsin Section 3.2 with a discussion of each approach. I move on to summarize my11results in Section 3.3, I present several research efforts that looked at provenanceresearch in Section 3.4. In Section 3.5 I briefly look at other works I believe to berelevant to provenance research.3.1 Types of provenanceThe discussion will focus on four main types of “data” provenance: Lineage, why,where and how provenance. How provenance can be seen as a type of transforma-tion provenance since it represents the operations that a tuple went through frominput to output. There are also other types, that do not fall under the category of“data” provenance, such as transformation provenance shown in [16]. However,only “data” provenance types were given widespread attention in the field. There-fore, discussion of other types will be limited to their respective systems.The goals of these four provenance types can be summarized as finding thesource of a piece of data and the process it went through to make it into the resultsof a query. Most of the work in database literature has been focused on that samesub-problem, explaining where data in the results of a query come from and theprocess they went through. To define the problem formally:Definition 1. For a database D, a query Q, and the results of applying the queryto the database Q(D). The provenance problem in database is to find the sourcesin D that contributed to Q(D).The aforementioned four notions of provenance have been the focus of mostprovenance research in databases. One of the earliest applications of provenancein databases is lineage in data warehouses as seen in [11]. In [6] the authors de-fine a semi-structured model and through it, formally define the notions of whyand where provenance. In [17], the authors define the notion of how provenanceusing semiring polynomials. All of these models have had a substantial effecton the research and have resulted in the implementations of multiple systems andquery languages. It is worth noting that despite adhering to the abstract notions ofprovenance defined in those works (lineage, why, where and how), implementa-tions often differ from the logical models presented in [11], [6] and [17]. Hence adiscussion of abstract provenance types is necessary before we go into the differentimplementations.12name salary phonet1 Omar 20k 1234t2 John 22k 1233t3 Ali 25k 1239Table 3.1: Employee tablename department projectt4 Omar CS Printingt5 Omar CS Programmingt6 Ali Management TrainingTable 3.2: Project table3.1.1 LineageOther than the fact that lineage is often used as a synonym for provenance, it is alsoknown as one of the earliest types of provenance. Defined in [11], lineage givestuples from the source that caused a tuple to appear in the answer of a query or ina view. To illustrate, applying the query Q1 to the data in Table 3.1 and Table 3.2,we get Table 3.3 with two distinct answers.Q1:SELECT name, phoneFROM employee e, project pWHERE e.name = p.nameTo answer the question: “What is the lineage of r1 in the results of Q1?”. Thenotion of lineage gives the answer as simply the tuples that caused r1 to appear inthe results. In this case, the answer would be {t1, t4, t5}. The notion of lineagewas criticized as not being precise enough and caused the development of the nexttype, why provenance.3.1.2 Why and where provenanceIn [6], the authors define a more formal notion of provenance. Through a semi-structured model that I will explore later on, they define the notion of why and13name phoner1 Omar 1234r2 Ali 1239Table 3.3: Q1 resultswhere provenance. For my purposes, I will present illustrative examples usingSQL and relational data and refer to the next section for discussion on the semi-structured model. Intuitively, why provenance explains which tuples (witnesses)are necessary to make a tuple appear in the answer. More precisely, the authorsdefine the idea of witnesses, witness basis and minimal witness basis, I will look atthose concepts more formally when I discuss the why and where semi-structuredmodel. For now, the intuition is to define a more precise notion of lineage wherea smaller set of tuples are used to represent provenance information. To continuewith our running example, the why provenance of r1 in Table 3.3 would be {t1, t4}or {t1, t5}, either is sufficient to explain the presence of r1 in the results. In otherwords, you do not need both t4 and t5 to have r1 in the results of Q1, you only needone. This is something that lineage cannot represent.Where provenance answers the question of where in the input that piece of datais copied from. It is a more precise notion of provenance that is only concernedwith the location of a source of data. In our example, the where provenance ofr1 in Table 3.3 would be {t1:name, t1:phone} since data is copied directly fromattributes in tuple t1.3.1.3 How provenance and semiring polynomialsIn [17], Green et al. introduce the use of semiring polynomials to represent prove-nance information and the notion of how provenance.Since the two concepts aretightly related, it is worth explaining what semirings are and how they are usedbefore discussing how provenance.14SemiringsA semiring can be defined as “a domain plus two operations satisfying certain prop-erties” [4]. The definition and inspiration for semirings comes from the constraintsatisfaction problems domain [4]. A formal definition is as follows:Definition 2. In the semiring (N,+,∗,0,1), the domain is natural numbers N. Theoperations are addition + and multiplication ∗, both are closed, associative andcommutative. 0 is the unit element of the + operation (0+a = a) and 1 is the unitelement of * operation (1 ∗ a = a) and 0 is the absorbing element (0 ∗ a = 0). ∗distributes over +.These properties can be changed to represent data provenance using polyno-mials. One such semiring is (N[X ],+, ,0,1) where the domain is natural numbersused as coefficients for the set of tuple annotations (X). The addition operationrefers to union and multiplication refers to join. This domain and operations giveus polynomials that can be used to represent detailed provenance information suchas the ones in the next subsection.We lay out the theoretical foundation of semirings here and discuss them fur-ther in Section 3.2 when we discuss the Orchestra system.How provenanceThe notion of how provenance answers the question of what process the piece ofdata went through to end up in the output. This notion captured using polynomialscan give a more detailed account of the transformation applied to the piece of data.In our example, the how provenance of r1 in Table 3.3 would be the polynomialt1.t4+ t1.t5. This representation shows that t1 was joined with t4 and t5 andsubsequent results were unioned to get the final result r1.3.2 Overview of approachesIn this section, I will look at different implementations of provenance databasesystems. The focus is on how these systems represent and query provenance infor-mation. All the approaches we explore are system implementations, except for thewhy and where theoritical data model from [6] which as far as I know has not seen15an implementation with that exact data model. The data model in [6] is included be-cause it is one of the earliest efforts into formally defining provenance in databasesand it gives insight into some of the possible problems that come up when dealingwith provenance data. A discussion follows every subsection to analyze and lookat the advantages and disadvantages of these approaches.3.2.1 Lineage tracing in data warehousesname salary phone department projectOmar 20k 1234 CS Programming t1, t4Omar 20k 1234 CS Printing t1, t5Ali 25k 1239 Management Training t3, t6Table 3.4: Intermediate view from Q1 (the last column shows the lineage forillustration purposes)In [11] and [12] there is one application of lineage (provenance) tracing in datawarehouses. In a warehouse environment, data is collected and integrated fromdifferent sources. The data is then presented to users as materialized views. Theuser of a data warehouse may need to drill down and see the source of a data item.This system provides that ability to the users.The system guarantees that the lineage it computes will contain only relevanttuples and that it would be complete, it will contain all contributing tuples. Lineageis computed at tuple granularity, users can ask for lineage of any tuple in a trace-able view, the system would use the inverse query to generate the source tuples.Provenance information is simply represented as the source tuples. An interestingaspect of the system is that the user specifies, when defining a view, whether a viewis traceable. If it is, the auxiliary views are generated immediately. The tracing andmaintenance procedures are also generated and stored as meta-data. So while thissystem is lazy in that the provenance data is only computed on request, it eagerlygenerates the facilities that allow the lineage to be traced are computed and storedonce a view is defined as traceable.To illustrate how the system works we can look back at our example of queryQ1. The system looks at the view definition Q1, divides it into two ASPJ seg-16ments, before and after the join and defines the necessary intermediate view, inthis case Table 3.4. Then it generates a set of queries on the intermediate viewsfor the lineage of the results. After that, it traces the lineage of the intermediateviews from the source tables as shown in Table 3.4. For every tuple, it unions theprovenance from the intermediate views that contribute to the source tuple. Hencethe provenance of r1 would be returned as the union of the lineage of the two tuplesin Table 3.4: {t1, t4}∪{t1, t5}= {t1, t4, t5}.DiscussionLineage is criticized for being imprecise compared to why provenance. However,to its advantage, it is very intuitive and relatively easy to compute. Lineage ofviews also offers some strong guarantees on the relevance of provenance generated.Despite being an early system, the representation of lineage is simple and intuitiveas simply source tuples. The generation of intermediate views and queries whena view is defined is very efficient. While the approach is lazy it eagerly computesall the necessary components for efficient generation of provenance for traceableviews. The lazy approach makes storage space a non-issue, the trade-off betweenspace and time is a key point we will see with future systems. On a different note,support for aggregation queries from such an early implementation of provenancein databases is interesting to see since it is missing from some later systems.On the downside, the fact that lineage is not as precise as the notion of whyprovenance could cause problems in explaining the provenance of some results (asseen in the subsection on why provenance in 3.1). Furthermore, lineage is onlypresented as source tuples and the view definition. Such a representation can makeit difficult to explain how the results actually came to be, especially in complexqueries and limits dissemination tools design. One of the biggest missing featuresis the lack of storage of provenance information, the system is defined to onlyexplain the tuples in views and not much else. However, this could be due to thenature of data warehousing problems that the system aims to tackle.Lineage in data warehouses is limited to functions that cannot be generalizedand used in my case. No storage of provenance data and reversing queries wouldlimit provenance dissemination to simply showing the ancestors of a piece of data17when requested. This may be sufficient in view update problems but it is notenough to explain derivation history of a piece of data or for data trust applica-tions.3.2.2 Why and where semi-structured modelIn [6], the authors define the why and where notions of provenance within the con-text of a semi-structured model. The data model described in [6] is deterministic,each location of a piece of data can be described by a unique path. It’s an edge-labeled tree model, the out edges of each node are labeled with pieces of data.Relations can be expressed in this model by mapping keys to edges.The following is a summary of some properties of this model:• The model represents label:value pairs as follows x:y. These pairs representthe edge x and the sub-tree it leads to y.• {a:1} represents the path a leading to value 1.• x.a:1 is equivalent to {x:{a:1}}.• v(p) denotes that p occurs in v otherwise it is undefined.• The path representation of {a:{1:c,3:d}} is {(a.1,c),(a.3,d)}.• w is a substructure of v, if the path representation of w is a subset of the pathrepresentation of v.• The deep union of v1 and v2 is the value with path representation that isthe union of both path representations of v1 and v2. This is only defined ifthe result is a partial function (If the result is not a set of unique paths it isnot defined). Example: {a:1, b:{c:2, c:3}, e:5} is not a partial function, itviolates injective property of partial functions (c leads to both 2 and 3!).Relations are encoded into this model as follows: the relation’s name is thefirst edge at the root, then the keys of the relation are mapped as edges, each key ismapped to the tuple it identifies, if there is no key the entire tuple is modeled as anedge. Compound keys are modeled as semi-structured pieces of data on the edgefollowing each other. Figure 3.1 shows a mapping of table 3.1 into a this model.18Figure 3.1: Mapping a relation into a graphThe query language for the above model is referred to as DQL (D for deter-ministic) and is defined as follows:wherep1 in e1:pn in en;conditioncollect xpi is a pattern of the data in the model but it can also include variables. ei islike pi but can include nested where.. collect statements. condition is a predicateapplied to pis. The query basically means: consider each assignment of variablesthat makes pi a substructure of ei, evaluate the condition for each statement andif true add the value of e to the results. Finally, union all results. In this case theunion used is a deep union (see definition above). It is worth noting that since thisis derived from semistructured query languages, there are similarities. One canthink of the query as finding matching patterns to pi and conditions, if available,and returning the structure defined for the output in collect.The aforementioned model and query language are used as a setup to com-puting why and where provenance. Why provenance is determined by finding the19witness for a result in the query by syntactic analysis of queries. A witness isformally defined as follows:Definition 3. A value s is a witness for a value t with respect to a query Q and adatabase D, if t is a subset of the results of Q(s) and s is a subset of D.The definition of a witness basis is more restrictive than the full set of wit-nesses. A witness basis is a smaller witness set determined by looking at the queryand the data. There is also an algorithm to efficiently calculate the witness basisin [6] that I will not cover here. The downside to witness bases is that they arenot invariant for query rewrites, a minimal witness basis is. Equivalent queries aredefined as those that generate equivalent results from the same sources. A minimalwitness basis is defined as a set of all minimal witnesses where a witness is mini-mal if none of its members is also a witness. In my example, {t1, t4} is a minimalwitness and so is {t1, t5}. {t1, t4} and {t1, t5} can be included in a minimal wit-ness basis, but the set {t1, t4, t5} is not included since it contains the previous twosets which are witnesses.For where provenance, the intuitive approach is to follow a syntactic analysisof queries similar to the one seen in why provenance. However, this proves to bedifficult for a number of reasons. For instance, a query that contains a constantvalue can have the where provenance as the query itself. There can also be mul-tiple contributing inputs to the output. Since a syntactic approach is difficult, aclass of traceable queries is introduced that limits the queries to avoid undesirableproperties and defines properties that make where provenance easy to compute andinvariant over rewrites.DiscussionThe why and where provenance approach, using a data model based on semi-structured data models, was the first to introduce the notions of why and whereprovenance. While the notions themselves were very influential, the semi-structuredmodel and definitions did not catch on. There are some merits to it, using a deter-ministic model with a unique path for each value can prove useful for comput-ing provenance information for a tuple. Furthermore, the query language lendsitself to syntactic analysis to find and generate the provenance information. A ma-20name salaryt1 Omar {good} 20kt2 John {bad} 22kTable 3.5: Employee table with annotationsname department projectt4 Omar {great} CS PrintingTable 3.6: Project table with annotationsjor contribution of this work is the focus on defining subclasses of queries whereprovenance information would be invariant to query rewrites. In other words, allequivalent queries would have equivalent provenance information. Thinking aboutquery rewrites will affect future works as we will see later in this section, but thisapproach and associated algorithms were not as effective.The model also imposes a number of difficulties. It can be difficult to readand reason about the data model, it can also be difficult to query. Reminiscent ofXML shredding, converting relational tables to a graph can be cumbersome. Thequery language, also similar to semi-structured languages, is very difficult to useand understand. This is probably due to the theoretical nature of this work, butthere is no real concern for querying or viewing the provenance information itself,which proved to be a challenge for later approaches.3.2.3 DBNotesDBNotes [9] [3] is a system for annotating databases. It allows for adding annota-tions to any tuple in a database and propagating that annotation into query results.The system also explains provenance to users by showing a possible journey thata value in the results took to get there. Annotations do not adhere to the databaseschema, they are added to any attribute in a tuple and stored as extra informationnext to the attributes they annotate. Annotations are propagated in different waysspecified by the user when writing a query. The default scheme (default) prop-agates the annotations according to where the data is copied from. It naturallycopies annotations from the source of the data to the results. The downside is that21two equivalent queries can have the same results with different annotations. Todemonstrate, Table 3.5 and Table 3.6 are two tables with annotations to the nameattribute, shown in curly brackets. The annotations measure performance on a scale(bad-good-great). Applying the following query to the two tables:SELECT DISTINCT e.name, e.salary, p.projectFROM Employee e, Project pWHERE e.name = p.namePROPAGATE DEFAULTGives us the result: (Omar {good}, 20k, Printing). However, applying anequivalent query:SELECT DISTINCT p.name, e.salary, p.projectFROM Employee e, Project pWHERE e.name = p.namePROPAGATE DEFAULTWould give us the same result with a different annotation: (Omar {great}, 20k,Printing).The other scheme (default-all) propagates the annotations according to wherethe data comes from in all equivalent queries, essentially guaranteeing invarianceunder rewrites. For example, applying any of the previous two queries with thedefault-all scheme:SELECT DISTINCT p.name, e.salary, p.projectFROM Employee e, Project pWHERE e.name = p.namePROPAGATE DEFAULT−ALLWould give us the following results: (Omar {good, great}, 20k, Printing). Theuser can also define a custom scheme, specifying which annotation to favor in suchcases. These schemes are specified at the time the query is written. The default-allscheme handles variation in query rewrites, the system can compute a finite setof equivalent queries that represents the infinite possibilities of equivalent queries.Queries are equivalent as long as the data in the results is the same, regardless ofannotations. This set of equivalent queries is called a query basis and is defined as22a finite set of pSQL queries with default propagation schemes that are equivalentto the query. The provenance for a query under the default-all scheme is the unionof all annotations of the output returned by the query basis. The authors define analgorithm for computing the query basis in [3] that I will not cover here.The approach to provenance is based on the notion of where provenance. Thisis evaluated eagerly by propagating annotations from the source or lazily by com-puting a reverse query if a user asks for an explanation of why a value is in theresults.DBNotes also introduces a query language pSQL. In pSQL, there are two newfeatures: a PROPOGATE clause where a user specifies one of the 3 methods ofpropagation available. To describe how querying works “for each tuple in theCartesian product of relations in the FROM clause, we check that the conditions inthe WHERE clause are satisfied and then propagate the qualifying annotations tothe output as specified in the PROPAGATE clause.” [9]. For duplicate tuples, theannotations are merged. The keyword DISTINCT is always used since the systemdoes not support bag semantics. To query provenance annotations, the ANNOTkeyword can be used, which makes annotations first class citizens that a user canmatch against. This can be seen in the following example query where the annota-tion on the name field in Employee table is matched to a pattern:SELECT DISTINCT ∗FROM Employee e, ANNOT(e.name) aWHERE a LIKE %bad%PROPAGATE DEFAULTThe resulting tuple would be: (John {bad}, 22k).DBNotes features a number of other facilities to help users understand prove-nance information. To explain provenance, when a user asks about a piece of data,DBNotes generates a reverse query that extracts the sources and binds them tovariables. Then it explains the results by showing the user the bindings and theconditions (WHERE and SELECT clauses) that give the results. DBNotes canalso produce a graph of the journey a variable took through different databases andtransformations. It stores a database for the results of a query and it keeps track ofsources and transformations via annotations. This way it produces a visualization23for the user explaining where a piece of data came form and through what queries.DiscussionDBNotes uses a simple query language, very similar to SQL, with intuitive exten-sions to query annotations. It also implements a set of facilities that a user coulduse to interact with provenance information. It is the only database system I couldfind that offers visualization and explanation of provenance results. The genericapproach to annotations is very interesting and allows for the implementation ofmultiple types of provenance. DBNotes’ approach to handling equivalent queriesis intuitive although performance is clearly an issue as seen in the evaluation in [3].To quote: “the performance of a query under the default-all scheme can be at worsteight times slower than the performance of the same query under the default or nopropagation scheme (i.e., SQL query). At best, it runs about twice as slow.” [3],considering the default scheme is already slower than a regular query by 18-40%.On the downside, eager annotation approaches can also have serious scalabilityconcerns in terms of space, this is not an inherent flaw in DBNotes but it is onethat is not mitigated by DBNotes designers, for example by limiting the numberof levels to propagate annotations. Although admittedly the designers attempt totrade space for time. The tight-coupling of provenance annotations and storing itwith the data prevents the system from implementing bag-semantics and hindersthe flexibility of the system. Finally, when explaining provenance, the verbosity ofoutput can be a concern. Looking at a large amount of text with pieces of the queryto explain a simple result can be prohibitive to users.For my purposes, DBNotes is a generic system for handling annotations andcan be generalized to any context. However, if those annotations are completeprovenance information, the system may not be able to handle such a load. Theexamples shown in [9] [3] have a small number of notes compared to the numberof attributes. More importantly, the GLEI system does a lot of aggregation andresolving aggregation for such an approach with where provenance can prove tobe difficult. I will leave discussing the provenance explaining and visualizationcomponents in DBNotes to Chapter OrchestraFigure 3.2: A data sharing graph. The oval with + is the data source, rectan-gles are tuples and m is the schema mappingIn [17], the authors propose that the use of symbolic representations of semir-ings are sufficient to track and carry information from input to output in relationalalgebra operations. The authors use polynomials with integer coefficients. Thisrepresentation also works in the space of incomplete, uncertain, and probabilisticdatabases with a theorized superior representation to such structures as maybe ta-bles, c-tables, and probabilistic event tables. Polynomial representations of prove-nance can capture provenance detailed information with the semiring (N[X ],+, ·,0,1).We discussed semirings previously in Section 3.1.Orchestra uses provenance semirings to represent provenance information. Or-chestra is a collaborative data sharing system where users contribute and sharedata using schema mappings. This network of mappings and relations explainsthe model that is the basis for Orchestra. Orchestra also supports a language forquerying provenance called ProQL, as well as methods for storing and indexingprovenance information.The data model is basically a graph of data sources and mappings, showing theoriginal base data, tuples and derivations. Figure 3.2 shows how Table 3.1 from ourrunning example can be represented using a schema mapping expressed in Datalog.The schema mapping is as follows:m : E(name, salary, phone) :− A(name, salary)25Querying this model is basically showing elements of the graph to the userwith the results. Queries can be asking for common provenance, specific sources,or specific tuples. The model itself is an encoding of provenance semirings, thusit can be used by specifying different sum and product operations to represent anumber of different things, including derivability, probability, lineage, trust, andweight/cost. In the derivability example, which is the closest to the notion of howprovenance, all base nodes are given a value (true), a derivation node is mappedto the product (AND) of the annotation of all source nodes, a tuple node whosederivations are all annotated is mapped to the sum (OR) of all mappings that leadto it. The language ProQL has two main functions: (1) return a projection ofthe graph (a subgraph) for specific tuples or paths with bindings for distinguishedvariables and (2) return how to evaluate a subgraph or bindings as an expressionfor a specific semiring.I will focus on the first function since it is more demonstrative of how prove-nance queries work. In (1) the user specifies in the FOR clause which variablesto bind to relations or derivations. The WHERE clause is for conditions. TheINCLUDE PATH clause specifies which nodes and paths to include in the out-put graph. The RETURN clause specifies bindings to be returned as result tuples.Figure 2 shows shows one data sources, two derivations A and E, and schema map-ping m between A and E. The query has a variable $x bound to E and a variable$y bound to A, the INCLUDE path says to include all paths between A and E.It returns the part of the graph in Figure 3.2 and the tuples bound to $x as resulttuples.FOR [E $x] <−+ [A $y]INCLUDE PATH [$x] <−+ [$y]RETURN $xIn (2) two new clauses are added. The EVALUATE semiring OF clause hasthe user specify what function they want to evaluate for the output (e.g derivability,trust). The second clause is the ASSIGNING EACH clause, in this one the userspecifies values for leaf nodes (e.g. base tuples or nodes without derivations) in afashion similar to switch statements in programming languages, a second one canused to define a function to give values for mappings.26In the described implementation, the storage model is as follows. Provenanceinformation is encoded in relations. Tuples in relations are identified by keys.Mappings are stored in relations with the keys for source and target relations. Toexecute queries: (1) schema mappings are converted into a provenance schemagraph which contains one node for each relation and one node for each mapping,edges are added according to mappings’ target or source keys. (2) Match the pathsin the query against the provenance schema graph to identify mapping and relationnodes. (3) Create a Datalog program that contains the mapping and relation nodesidentified as well as any source relations to use. (4) Execute this program in SQLon mappings and tuples. For indexing, the authors use access support relations(ASRs), which intuitively amount to storing joins between provenance relationspaths in materialized views or relations.In [19], the authors conduct a thorough evaluation of graph projection and an-notation queries. The system performs well and the indexes set up show speedupin the execution. This is pretty much new territory so it’s nice to see a performanceevaluation.DiscussionProvenance semirings can be used to model anything that uses annotations: lin-eage, probability, uncertainty. This allows Orchestra to cover provenance withdifferent semantics and possibly different types such as how and why provenance.This is due to the fact that semirings can still represent the simpler notion of whyprovenance. ProQL can also express complex queries in different context. Lastbut not least, a full performance evaluation with optimizations such as support forindexes is refreshing to see.While Orchestra excels in many ways, there are a few challenges when lookingat generalizing its findings to generic provenance applications. Most importantly,the graph model lends itself well to the collaborative data sharing application butcan prove to be difficult to generalize to other areas. The semantics for the semiringoperations, addition and multiplication, are tied to the nodes in the graph. Simi-larly, the query language ProQL is similar graph query languages. As a result oftying everything to the graph, the provenance representations can be difficult to27Results Employee Projectname phone p1 p2 p3 p4 p5 p6Omar 1234 Omar 20k 1234 Omar CS PrintingOmar 1234 Omar 20k 1234 Omar CS ProgrammingAli 1239 Ali 25k 1239 Ali Management TrainingTable 3.7: Q1 results in Permunderstand and the queries themselves are complex as a result. As a result, it maynot be worthwhile to try and implement this approach in any other context.Orchestra provides a great basis for a provenance enabled data sharing contextwhere schema mappings are prevalent. However, I am looking for a generic systemimplementation that would generalize to most datasets. For example, If I apply thisapproach to the GLEI system (described in Chapter 2) which does not fit this usecase, the results would be a small graph and the queries would become needlesslycomplex.3.2.5 PermThe main idea behind Perm [16] is to generate and query provenance informationusing easily optimize-able SQL code. It has support for complex SQL queries(nested, correlated sub-queries and aggregation) as well as storage of provenanceinformation in the same relation as the data.The authors define the following principles for an ideal provenance system andproceed to build Perm accordingly: (1) support for different types of provenance(why, where and how) with sound semantics, (2) provenance generation for SQLincluding complex queries, (3) support for complex queries over provenance data,(4) efficient generation and querying for large databases by only generating thenecessary provenance information. Perm contributes to the four requirements asfollows: (1) it supports the 3 types of provenance why, where, and how nativelywithout using a different model. (2) Perm supports ASPJ queries and set operations(union, intersection, difference) and it is the only system to support nested andcorrelated sub-queries that I have seen. (3) Perm uses a relational representationof provenance and full SQL queries over provenance information. (4) The user28chooses when and what provenance to generate, a sub query within the provenancequery would generate the provenance information. Hence generation and queryingare both in the same extended SQL query, that is subsequently rewritten into asingle SQL query which utilizes the DBMS optimizer for non-nested queries. Fornested queries, it uses novel un-nesting techniques.When a user asks a query Q and asks for its provenance in the same query,Perm returns the provenance of results of Q to the user with the results of Q. Forexample, the results of Q1, using PI-CS provenance, are represented with theirprovenance information in Table 3.7. Perm represents provenance types differ-ently. PI-CS provenance is represented as witness lists, the input tuples that wereused to derive the output. This is represented by a relational table that includesthe result tuples’ attributes and values appended by all attributes from contributingtables, and NULLs from tuples/attributes that do not contribute to the results. Forresult tuples with multiple tuples in their witness lists, duplicates are added to thetable as in Table 3.7. All relations’ attributes carry a prefix and the name of theoriginal relation and attribute (e.g. prov Employee name). C-CS provenance isjust a variation of PI-CS provenance where only tuples that had information copiedfrom them are shown in the results, hence it is a tuple-granularity where prove-nance. For transformation provenance, the information is represented by wrappingXML tags around the parts of the query that do not contribute to the result for eachresult tuple. The PI-CS representation is not compact enough but it can easily bequeried for interesting information. The duplication can also cause a problem byhiding original results’ multiplicities. However, this is still an exciting new ap-proach to representing provenance in a relation, and it shows how the provenanceinformation is associated with the result tuples explicitly.Rewriting queries into provenance generating queries, in Perm, works as fol-lows: (1) Define the semantics declaratively of the provenance approach and de-fine a relational representation. (2) From the declarative definition come up withalgebraic rewrites to transform a query into a ”provenance-generating” query, (3)Translate the relational algebra rewrites into SQL rewrites that use the optimizer.This is explained in detail for the 3 types of provenance PI-CS, C-CS, and transfor-mation provenance in [16], and I also go into further detail describing our approachwhich is based on Perm in Chapter 4.29Perm data manipulation language adds the following extensions to SQL: (1) thePROVENANCE keyword in SELECT clause is added if the user wants the prove-nance of the query, (2) ON CONTRIBUTION allows the user to specify the type ofprovenance (PI-CS is the default), (3) BASE RELATION allows the user to limithow far back to go for results (e.g. Adding BASERELATION to an intermediateaggregation result would only use that in provenance representation, rather than goall the way back to the base), this feature is not to be confused with propagation ofprovenance data. Perm also allows users to add and query provenance informationgenerated manually or from other systems to relations as long as it abides to thestructure used by Perm and is stored in additional attributes, this is done by addingPROVENANCE keyword to the FROM clause to the provenance attributes (e.g.FROM PROVENANCE(name, employee, date)).DiscussionPerm seems to be built on a solid foundation by defining sound principles inspiredby previous provenance systems. It has support for aggregation and nested querieswhich is a novel contribution to the provenance space. Everything is done in SQLwith natural extensions, which gives it the ability to utilize the query optimizer. Italso has support for multiple types of provenance which I would argue is a necessityat this point, since the different semantics serve users differenly, this can be seen inthe discussion of the different types of provenance in Section 3.1. Furthermore, itnotably features a very intuitive representation of provenance that users can browsethrough or query using SQL. Since Perm is an optimized lazy approach, it avoidsthe trade-off of time vs. space. It also allows the users to store provenance fromother approaches or possibly provenance that is manually input or generated.There are a few problems that can arise from the representation of provenancein Perm. The first is the verbosity of output. This is something that cannot behelped since the system opts for a complete representation of witness lists. Thesecond problem comes from null values for provenance attributes. Null values inprovenance attributes mean that those attributes did not contribute to the results.Although such missing values are semantically meaningful, problems can ariseif the data itself contained missing values. As an example on this, applying the30Results Employee Projectname p name p salary p phone p name p department p projectOmar Omar 20k 1234 Omar CS PrintingOmar Omar 20k 1234 Omar CS ProgrammingAli Ali 25k 1239 Ali Management TrainingJohn John 22k 1233Table 3.8: Perm query results with missing valuesfollowing query to Table 3.1 and Table 3.2:SELECT PROVENANCE ∗FROM SELECT name FROM Employee UNIONSELECT name FROM ProjectThe results are shown in Table 3.8. As described earlier, the provenance ofthe results contain all attributes from all contributing relations. The values forattributes that did not contribute to a result is left empty.The last problem is inherent in all lazy approaches, Perm trades off propaga-tion of provenance for smaller space. Perm stores provenance information in thesame table as the data that show the immediate ancestor(s). While Perm can beclassified as eager in that provenance can be calculated immediately without usersasking for it, the distinction remains that eager systems usually offer propagationof annotations like we saw with DBNotes [9].Perm’s biggest strength is its ability to support provenance operations overa fully relational DBMS with good performance. The problems outlined aboveeither arise in very specific scenarios or can be mitigated using different provenancedissemination techniques. The effects of verbose output can be reduced by dividingthe data and showing the users small manageable subsets of it with provenanceinformation. Missing values from data can be made distinct from those that referto non-contributing tables by filling in a different value for the latter. I am awarethat this could prove tricky in certain scenarios so the option can be left to the useron how to fill these non-contributing cells. The last problem, even though Permonly stores information about immediate ancestors, graphs of derivation historycan be constructed by tracing the source tables all the way to the origin. This31problem can be resolved by enabling the users to browse derivation history graphsof data.3.2.6 Other approachesAs part of this survey, I have looked at other systems briefly that do not warranta full subsection or are outside the scope of the thesis. The first is similar to [11]in a lot of ways so I am not going into further details on it. The second is relatedto workflow provenance, since I am not covering workflow provenance systems, Iwill briefly write about this approach here.The database system Trio [26] supports lineage as part of its components andis built on the principles of lineage of views in data warehouses [11]. Trio supportslineage, along with uncertainty as first class citizens that can be queried and stored.Trio features a lineage table which contains inverse queries for every tuple in amaterialized view, this table can either be queried directly or using a proposedTriQL query language. The lineage inverse queries generate the provenance onrequest.The other approach is referred to as a Graph model for workflow and databaseprovenance [1]. This is an interesting venture into bridging the gap between work-flow systems and database provenance. The two fields have remained separatefor a while with little effort beyond suggesting and spotting similarities [25]. Themotivation is that while workflow provenance is developed and used very oftento describe the computation steps performed in constructing intermediate and fi-nal results, it lacks unified semantics. Each system can have its own semanticsfor provenance information making it difficult to compare approaches. In databaseresearch, contribution semantics are some of the most researched aspects of prove-nance. Furthermore, where provenance and lineage lend themselves to graph rep-resentations while how and why provenance can be more difficult to represent asgraphs. The graphical model is explained in [1], expressed in a data flow languagebased on nested relational calculus. This work sets the groundwork for possiblyestablishing a connection between scientific workflow and database provenance.323.3 SummaryTable 3.9 shows a summary of the models and systems I looked at in this survey,broken down by basic features to make them easier to compare. I can look at andexplain each property before I go further into comparing them. Provenance types,query languages, models, storage and computing provenance have been explainedalready or are self explanatory. It is worth noting that querying provenance refersto querying provenance information and the operations available to do that, allapproaches have query languages to query data but not all support querying theprovenance data itself. Browsing provenance refers to any other facilities other thanquery languages, a system should ideally have both to support different types ofusers. Provenance granularity refers to the granularity at which the system supportsfinding the provenance of a piece of data (table, tuple, attribute). Provenance datacoupling is borrowed from [15] and is defined when I talk about related work insection 3.4. However, in that paper the authors only categorize one system (lineagein warehouses), here I take it a step further and look at five different systems.In this chapter I have looked extensively at the types of provenance, modelsand query languages in several database systems. My findings lead me to believethat the ideal system should have the following properties (the first three agree withthe principles outlined in [16]):1. Support for multiple types of provenance. Why, where and how and alsotransformation provenance can provide different information, and supportdifferent tasks for different users.2. Support for a query language similar to SQL with intuitive extensions thatallow for generating and querying provenance data. Moreover, the systemshould allow SQL queries over provenance data.3. The model should be relational, as lineage in warehouses [11], DBNotes [9]and Perm [16] prove the relational model is sufficient for most cases. Fur-thermore, using the relational model will make it possible to use existingoptimizations for provenance queries.4. Storage of provenance data gives different trade-offs to consider. Looselycoupled provenance to the data or without coupling has the advantage that33querying provenance information would become simpler without data itemscarrying around the ancestors’ attributes. Tightly coupled information how-ever, keeps track of tuples and their respective provenance information moreeasily. Tightly coupled approaches are the only ones to do that. Ideally asystem should be able to support both storage methods for different granu-larities.5. The system should allow for a hybrid approach for computing provenance,lazy generation of provenance with propagation like eager systems. All ofthis while keeping the user in the loop to determine the level of propagation.The goal is to avoid scalability concerns, “in one public database (the GeneOntology) we often observed 10MB of lineage (provenance) data for a singletuple” [22].6. While data provenance semantics and systems are being developed to be ascomprehensive as possible, there is little regard for the user side of things.Very few systems offer facilities for the user to visualize/browse throughprovenance information. I argue that this is necessary to facilitate the under-standing of provenance data.7. Provenance systems should support provenance information at different gran-ularities, tuple granularity is sufficient in most cases. However, attributegranularity can help when I am trying to explain changes at an attribute level(data cleaning tasks). Table granularity can also help with space concerns.In the realm of theoretical models, thinking beyond the sub-problem of ex-plaining the origin of tuples in the result of a query is limited, there is a need forthinking about what data to store, how to store it, propagate it, present it and ex-plain it to users. Now that the theoretical models are set in place, these problemsshould come to the forefront of provenance research.3.4 Related workA lot of the content in this survey is inspired and influenced by some of the nu-merous surveys in data provenance. In [8] the survey is dedicated to comparing34why, where and how provenance in terms of how they relate to each other, as wellas how the different approaches fit into these notions. They also compare systemsand develop simple examples using SQL or datalog that facilitate comparisons.Similarly, I attempt to present examples using SQL queries and simple relationsthat are easy to grasp.I take inspiration for Table 3.9 mainly from [25]. In [25], this survey comparesfour different scientific workflow systems and Trio [26]. It looks into why theyrecord provenance, how they represent it, and how they use it. The goal is to pro-vide a taxonomy of provenance approaches. The approaches are compared basedon multiple criteria. Provenance information can have different “subjects”, it canbe collected directly about the data (data-oriented) or derived indirectly from theprocess and how the data came to be (process-oriented). It can also be collectedon different granularities, although most of the research on database provenanceis tuple-based. Provenance methods can be either eager or lazy, this is a popu-lar classification that I have seen in multiple works. The eager methods involveusing annotations which can be rich and can carry more information, which haveimplications on storage and scalability. The lazy methods involve inversion, likereversing a SQL query. In terms of representations, There is no standard acrossdisciplines, though most approaches explored in this survey store it as XML, thedatabase models I looked at in my survey all store provenance information in rela-tions. Storing and updating provenance information is another issue. Storing theannotations with the data can make it difficult to query. Involving the user and leav-ing the storage of provenance to users to do manually can prevent completeness.As for dissemination of provenance data, this information is usually presented tousers as relations they can query and browse through.In [15] similar to [25] the authors attempt to categorize provenance approachesbased on the following features: provenance model, query and manipulation func-tionality, storage model, and recording strategy. The goal is to uncover researchproblems and suggest a sort-of unification of these approaches across different dis-ciplines by looking at things from a more conceptual level. This motivates me aswell to look into simplified, broken-down versions of systems to make them easierto compare. In this categorization scheme there is a number of novel ideas. In theprovenance model category, they are looking at transformation and source prove-35nance and further subcategories that look at exactly what kind of data is being mod-eled. In terms of storage, systems are categorized into: (1) tightly coupled, whereprovenance information is stored with the data, (2) loosely coupled, provenanceinformation is stored with the data but logically separated, and (3) no coupling,provenance information is stored in independent repositories. I adopt this idea inmy results in Table 3.9. The authors also suggest a number of provenance-basedmanipulation operations: (1) a merge operation to merge transformations, (2) asplit operation to split complex transformations, (3) recreate source data from re-sults. Which so far, I have not seen implemented in a database provenance system.This paper uncovers some interesting details on provenance systems. However, thecategorization scheme, in an attempt to predict the future, shows some features thathave not and may not be implemented in any systems. Therefore in my comparisonI only look at what is already there and leave prediction and potential problems tothe future work section.Finally, the primer in [14] classifies provenance information into data, trans-formation and auxiliary provenance. The last category I have yet to see in anydatabase system, although storing information about the data types, users or envi-ronments can be valuable in certain contexts. In [14] the author also looks at therelationship between provenance, audit logs, and temporal databases.3.5 Further backgroundThe focus of this chapter is on models and provenance systems. For the future, Iplan to look further into how systems handle things like query rewrites, and thealgorithms that facilitate these operations.I look further into scientific workflow provenance in Chapter 6 and how it isexplained to users to see if there are any lessons applicable to a database context.Scientific workflow systems seem to provide a greater emphasis on user experi-ence in general. However they lack the unified semantics of database systems. Thesurvey at [25] suggests a close relationship between workflow and database prove-nance that is worth exploring. Therefore studying the relationship between the twostarting with [1] will prove worthwhile. It would also be interesting to add a fewscientific workflow systems into this comparison to see how they fare in issues of36scalability and usability.The notion of “approximate lineage” [22] in uncertain databases is worth ex-ploring as well. The idea of “working models” [13] is similarly inspiring. Thetheoretical focus is shifting from complete provenance information mainly due tothe scalability for large databases. Taking in the user’s perspective on what is use-ful and sufficient would be a great contribution to this field of research and a firststep into making sufficient models for provenance that are not necessarily as per-formance intensive.37Lineagein ware-houses [11]Why andwhere [6]DBNotes [9] Orchestra [19]Perm [16]ProvenancetypeLineage Why prove-nance,whereprovenanceWhereprovenanceHow prove-nanceWhy prove-nance,whereprove-nance, howprovenance,transfor-mationprovenanceQuery lan-guageSQL DQL pSQL ProQL ExtendedSQLModel Relational Semi struc-turedRelationalwith anno-tationsA data shar-ing graphRelationalStorage No storage - Relational Relational RelationalProvenancegranularityTuple - Attribute Tuple Tuple, at-tribute orrelationProvenancedata cou-plingNo storage - Tightly cou-pledLooselycoupledLooselycoupledComputingprovenanceLazy Lazy Eager andlazyEager Lazy or ea-ger (no pro-pogation)QueryingprovenanceNone - Query an-notationsusing pSQLas first classcitizensQuery thegraph usingProQLQuery theresults us-ing SQL asrelationaltablesBrowsingprovenanceNone - Visualizeand explainprovenanceNone Browse ta-blesTable 3.9: Provenance systems comparison table38Chapter 4ImplementationIn this chapter I describe my work on a system implementation based on my find-ings in Section 3.3. From looking at the different system implementations, I cameup with a number of desirable features that would help me shape my own cus-tomized system. Other approaches simply rely on the user to query and explore theresults via different data manipulation languages. The models can be complex andsome query languages can be unintuitive making things more difficult than usinga relational DBMS. For that end, I developed a system based on relational DBMSapproaches. My approach differs from others in a number of areas highlighted insection 4.1.One of the conclusions that I made based on my finding is that provenancecontains a large amount of information that could have a serious implication onstorage and complexity of manipulation. Not to mention that it could be prohibitiveto users. My goal is to design a system that takes the users’ perspective from allof this. See what information is valuable to users over others. I attempt with myprovenance explorer to show the information to the users in way that is manageable.This leads me to design a system where the user can easily and efficiently browsethrough provenance information.Moreover, using my system I can study and experiment with provenance meta-data, and evaluate the different provenance dissemination techniques and how theyserve the user. My goal is to develop novel provenance browsing techniques toallow users to explore and understand provenance meta-data in the context of rela-39tional databases.In this chapter I look at the entire process of developing a provenance system.I start by listing the design principles and goals of the system and how I achievethem in Section 4.1. I list and justify the subset of SQL operations I choose towork with in Section 4.2. I explain the provenance semantics I use in Section 4.3with how they are implemented. I then give an overview of the system architecturein Section 4.4. I end by presenting a number of potential improvements to thesystem as future work items in Section Design principlesThe system is built on the principles detailed in the related works chapter. As areminder I list them briefly here and provide details on how I tackled each one:1. Support for multiple types of provenance. I start by supporting the lineagetype PI-CS provided in [16], in future work I intend to offer the user theability to use Where provenance semantics as well.2. A query language similar to SQL. I utilize the SQL language for queryingdata and provenance information alike. I also offer provenance informationexplicitly via an API.3. The model should be relational. My approach can work over any relationalDBMS.4. Storage of provenance data. I give the option of either tightly coupled or un-coupled storage with different trade-offs for each for different granularitiesand tasks.5. The system should allow for a hybrid approach, between lazy and eager, forcomputing provenance. My system provides lazy provenance computing,however in the provenance browser I give the user the ability to dig deeperinto ancestors and browse provenance information.6. Making provenance simpler for the user with different dissemination tech-niques. I provide provenance explaining and provenance visualizations thatmake it easier for the user to look at an overview whenever they want.407. Provenance systems should support provenance information at different gran-ularities. I provide tuple and table granularities, and the storage of prove-nance information changes with the level of granularity.While there are certainly systems that attempted the first 3 items, my approachdiffers from systems like Perm [16] as follows:• The system is implemented as an extra layer rather than modifying an exist-ing DBMS. This allows me to implement it over any relational DBMS withminimal modifications.• I provide different levels of granularity depending on the user task. Thereare instances where table granularity would be sufficient, such as looking atan overview of where the data comes from.• I develop an API that allows users to extract data or subsets of it. The maingoal of this API is to support the visualizations and analysis components,and support custom user visualizations.• I provide a provenance explorer as a first class component of my system.4.2 SQL supportThe goal of my system is to study the usability of provenance data, and developnovel provenance dissemination techniques. My goal is not to develop a compre-hensive provenance system, I believe this has already been done. Furthermore, theimplementation of a provenance database management system is not a trivial task.To that end, I define a subset of SQL that I deem sufficient for my purposes andleave the implementation of other operators on a need-to-use basis. As it stands, Icurrently support the following subset of SQL operations in my system:• Projection with renaming operations.• Selection queries with comparison conditions.• All join types; natural join, left join, right join, cross join, inner join withconditions where they they are needed.41• Aggregation, all aggregation functions work. I focus on the following threein my tests: count, sum and average. Aggregation can also have selection inthe HAVING clause.I do not currently support the following operations. However, the system caneasily be extended to do so.• SET operators: UNION, DIFFERENCE and INTERSECTION. I can imple-ment these operators on a need-to-use basis.• LIMIT, ORDER BY, and OFFSET operators. These operators cannot beconverted into relational algebra, and I believe they can be ignored whencomputing provenance (the same goes for TOP and similar operators) sincethey do not really change the query and just modify the output. I utilize othermethods using the API to simulate their effects.• The system does not currently support sub-queries. I handle sub-queries bydefining views then querying them, making it easier to reason about andtrack data changes.Currently the system has sufficient capabilities to answer queries from theGLEI datasets, even when joined with other datasets (such as offshore entities orFFIEC). The focus right now is on ASPJ (aggregate-select-project-join) querieswith comparison conditions, such as:SELECT ∗ FROM glei, offshoreWHERE glei.entityID = offshore.entityID AND glei.country = ”United States”ANDSELECT count(LEI) AS count, Country, City FROM gleiGROUP BY Country, CityThe reason for this subset is that most of the requirements for the GLEI systemcenter around aggregation, for example: What are the countries in which the mostregistrations occur? Which LOUs have the highest number of registrations?In chapter 2, I covered all the requirements for the GLEI watch system.42Query:SELECT count(id) AS count FROM employee GROUP BY departmentcount12Table 4.1: Results of an aggregation query without provenancecount provenance employee name provenance employee phone provenance employee id provenance employee department2 Omar 604111 1234 CS1 John 604101 1254 ECE2 Ali 604003 2354 CSTable 4.2: PI-CS example results of an aggregation query with provenance4.3 PI-CS Provenance typeThe provenance type implemented is part of the Perm Influence contribution se-mantics (PI-CS) [16]. It is based on lineage with modifications that make it workfor more SQL query types. This type of contribution semantics or provenance typegives back the results with their corresponding provenance information or witnesslists. It works by defining a relational algebra set that maps SQL operations to rela-tional operators. The main difference between Perm relational algebra and standardalgebra is the inclusion of bag and set semantics projection operators. These bagprojection operators are used in all provenance generating queries.Provenance information under PI-CS is represented as witness lists, the inputtuples that were used to derive the output. This is represented by a relational tablethat includes the result tuples’ attributes and values appended by all attributes fromcontributing tables, and NULLs from tuples/attributes that do not contribute to theresults. For result tuples with multiple tuples in their witness lists, result tuples areduplicated and are added to the table. Table 4.2 shows an example of the results toan aggregation query.The authors also define a set of relational algebra rules that convert a regularrelational algebra expression into a provenance generating expression, I look at thesubset of those rules that I implemented in the next subsection.43Rule number Operator Original ConvertedR1 ACCESS R R+ =ΠR,R→n(R)(R)R2 SELECT σC(R) σC(R+)R3 PROJECT ΠA(R) ΠA,A+(R+)R4 AGGREGATE αG,agg(R) ΠG,agg,A+(αG,agg(R) ./G=X ΠG→X ,A+(R+))R5 (a) JOIN R ./C T ΠR,T,A+(R+ ./C T+)R5 (b) LEFT JOIN R ./C T ΠR,T,A+(R+ ./C T+)R5 (c) RIGHT JOIN R./C T ΠR,T,A+(R+ ./C T+)R5 (c) CROSS JOIN RXT ΠR,T,A+(R+XT+)Table 4.3: Relational algebra conversion rules4.3.1 Relational algebra provenance rulesThe rules to convert a relational algebra query into a provenance generating queryare based on the work of [16]. Table 4.3 shows the rules to convert a relationalalgebra expression into a provenance generating expression based on the semanticsof PI-CS. The A+ symbol represents all attributes from all contributing tables. R1shows a simple relation access converted into a provenance generating expression,in this rule a table is converted into select all attributes and all attributes renamedinto provenance information format using the n() function. In R2 the SELECToperation is straight forward with changing the table into a provenance table. APROJECT operation in R3 is converted to project both the required attributes Aand the A+ as well. Aggregation in R4 is trickier since information cannot bepropagated through this operation, so a left join between the original query anda second query that selects all attributes along with the group by attribute. Thetwo tables are joined over the group by attribute G, agg refers to the aggregationfunctions. R5 shows the JOIN operations, A+ in this case would mean all attributesin all contributing tables, different join types work the same way with or withoutthe C condition.Table 4.4 contains examples of each rule applied to a SQL query.Table 4.3 shows the rules that were implemented in my system. (R5) is used forall possible join types by simply substituting the JOIN symbol in place of JOIN.For formal proofs that these rules work the reader can refer to [16].44Rule number Original query Provenance queryR1SELECT *FROM RSELECT *, R.a AS prov R a, R.b AS prov R b, R.c AS prov R cFROM RR2SELECT *FROM RWHERE a > 1SELECT *, R.a AS prov R a, R.b AS prov R b, R.c AS prov R cFROM RWHERE a > 1R3SELECT a, bFROM RSELECT a, b, R.a AS prov R a, R.b AS prov R b, R.c AS prov R cFROM RR4SELECT sum(a) AS saFROM RGROUP BY bSELECT orig.sa, prov R a, prov R b, prov R cFROM (SELECT sum(a) AS sa, b FROM R GROUP BY b) AS ORIGLEFT JOIN(SELECT b, R.a AS prov R a, R.b AS prov R b, R.c AS prov R c FROM R) AS SUBON orig.b = sub.bR5SELECT *FROM R JOIN TON R.b = T.bSELECT *, R.a AS prov R a, R.b AS prov R b, R.c AS prov R c,T.d AS prov T d, T.b AS prov T bFROM R JOIN TON R.b = T.bTable 4.4: Examples of relational algebra conversion rules4.3.2 Provenance attributes renaming functionThe n() function shown in R1 table 4.3 is responsible for renaming generatedprovenance attributes. The naming scheme I developed renames a provenance at-tribute into the following: Provenance OriginalTableName AttributeOriginalNameProvenance is a keyword to distinguish provenance attributes from regular attributes.Adding the original attribute name makes it possible to identify where it comesfrom in the original table. Table 4.2 shows an example of how this works.4.4 System architectureFigure 4.1: The idea behind the system is to take a query, parse it and processit to produce a provenance generating query.The plan is to implement a system based on the findings of the survey ondatabase provenance, mainly inspired by [16]. The idea behind the system is to takea query, parse it to produce tokens of each context (SELECT, FROM, WHERE,GROUP BY.. etc contexts). The parser is non-validating, any errors in the SQL45query is reported back by the DBMS. The tokens are translated to a relational al-gebra expression. The relational algebra expression is converted into a provenancegenerating expression. Then it is translated back into SQL and executed by theDBMS. It is worth noting that in case the user does not want to see the individualtuples and their corresponding provenance, I can have the results of a query andprovenance information decoupled to show the results’ provenance information ina different table. The current approach is based on PI-CS but the system is exten-sible enough to include other provenance semantics (such as where provenance).I describe the query translation and converting in the next two subsections.4.4.1 Query translationTranslating a query from SQL to relational algebra and vice versa is straight for-ward. The SQL query is parsed into a tree structure allowing for non-sequentialaccess to nodes. Each SQL query is translated into the following format:σH(α(Π(σC(R1 ./ R2 ./ ...Rn))))The σH refers to the HAVING clause of a query.Translating the relational algebra expression back to SQL format is also straightforward, each symbol (node in a tree) is translated into a SQL query clause withits respective operators. For sub-queries, recursively constructing sub-queries firstthen going back to the main query. Currently, sub-queries are only used in trans-lating provenance generating aggregation queries for now.4.4.2 Query convertingQueries are parsed, tagged with their respective types, SPJ or ASPJ, detailing whatoperations are contained in them, and finally passed to the translator. The translatorwould convert the query into a relational algebra expression tree according to thequery translation scheme in the previous subsection. The tree is then passed to theconverter. The converter would transform the query into a provenance generatingrelational expression tree. To access the complete list of attributes for contributingtables, the system queries the database catalog and gets the list. The list is theninjected into the relational expression. In cases where the query transformation46Figure 4.2: Converted ASPJ queryrequires completely changing the query structure, such as aggregation, the querytree is split and then reconstructed with the sub-queries. Finally the provenancegenerating relational algebra tree is passed back to the translator and is translatedback to SQL and executed in the underlying DBMS.Figure 4.2 shows an example of a rewritten aggregation query. The query isrestructured into a left join of the original query with a new rewritten query overthe group by variables G. The system accesses the database catalog to get the at-tribute names A+ for the GLEI table. The same catalog access operation is used toprevent provenance attributes from appearing in query results that do not requireprovenance, or prevent them from propagating to new tables.474.5 User interfaceFigure 4.3: Provenance system user interfaceFigure 4.3 shows a screen-shot of the main page of my system. The user hasthe ability to write a query and execute it. Select whether or not to generate prove-nance information and the type of provenance. In case provenance information isrequested, the query results are shown with and without provenance. The prove-nance generating query is also shown to the user. The user can also view treevisualizations of the original query and how it was transformed. Statistics on theexecution time of both regular and provenance queries are also provided.The user can also go into the visualization component to look at provenanceinformation more closely.4.6 Future workThe system is designed to be extensible and efficient. However, there are still anumber of ways in which the system could be improved. The following are someitems I intend to address:• The current approach is lazy, however I would also like to study propagation48of provenance meta-data throughout a transaction using the same provenancesemantics.• Complete the implementation for all SQL operators to make the system ap-plicable to other case studies and to conduct more experiments.• Implement other provenance types, Where provenance, How provenance andtransformation provenance for different granularities.• Experiment with capturing different information such as the user or the op-eration that caused a change in the data.49Chapter 5Experimental Evaluation5.1 Performance and overheadI ran a number of experiments to determine the overhead of my provenance system.I am expecting a difference in performance between running regular queries andprovenance generating queries. The difference in performance could be attributedto the extra time taken to process queries and the extra time taken to access thedatabase catalog to acquire table columns for provenance generating queries, or itcould be attributed to simply executing longer, more complex queries. I ran myexperiments to determine the factors that contribute the most to this overhead.5.1.1 Experimental setupMacbook Pro running OS X 10.10.5 with 2.2 GHz Intel Core i7 and 4 GB 1333MHz Memory. The underlying database management system is PostgreSQL ver-sion 9.4. The database driver is Psycopg2 since most of the code is in python.5.1.2 DataCurrently, as far as I know, there is no real benchmark for provenance systems.Benchmarks such as TPC-H by the Transaction Processing Council [10] can onlybe used with a complete system, they are geared heavily towards systems withsupport for subqueries. The goal of such benchmarks is also to stress test the50system, which is not my interest at the moment.Therefore I ran the experiments on synthetic data instead. The data is generatedusing the faker (https://pypi.python.org/pypi/fake-factory) pytohn package. Thedata is composed of tables that can be joined over common attributes and reducedby aggregating by an attribute. I vary the size of tuples of tables between 100to 100,000. My goal is to measure the time overhead for a number of queries anddetermine where most of it comes from. The data is generated in a way that ensuresthat the performance results can be generalized to our typical workloads.5.1.3 QueriesI ran the following query types: SPJ queries and ASPJ queries. The SPJ queriesperform a join on a common attribute, the join type is varied to measure all possibil-ities. The ASPJ queries aggregate results according to the most common attributein tuples. I also made sure when writing the queries and generating the data thatthe queries’ results size in tuples would scale with the size of tables. The querieslike the tables are representative of our typical workloads. I ran every single query100 times on each table size and claculated the mean of the durations.5.1.4 ResultsFigure 5.1 and Figure 5.2 show the results of each query for the varying numberof sizes of the input tables. Figure 5.1 shows different results for different queries.The important thing to keep in mind is that the cost of processing the query andaccessing the database catalog is trivial compared to the cost of the query executionat larger databases, making my approach of developing this system valid. This isproven by showing the performance of my system without the overhead of queryprocessing and catalog access in Figure 5.2.In terms of space overhead, the generation of provenance data can have dra-matic implications on the size of the results. This is especially true in the case ofaggregation queries. To demonstrate, Figure 5.3 shows the different size of resultsfor three different queries running over the GLEI data (I go into detail describingthe GLEI data in Chapter 2). The queries are as follows:q1: SELECT LEI, Address, LegalName, LegalForm FROM GLEI WHERE51Figure 5.1: A line plot of the queries running with overhead of my systemmeasuredCountry = ’United States’Agg q1: SELECT count(LEI) AS count FROM GLEI GROUP BY CountryAgg q2: SELECT count(LEI) AS count FROM GLEI GROUP BY Country, CityThis is not something we can generalize, aggregation queries would give dif-ferent results based on the aggregation group by variables or the data. The implica-tions however are clear that provenance information can be prohibitive with suchdatasets. The simple SELECT query does not give the full picture either. While thenumber of tuples remains constant, the number of attributes in the results is muchbigger, going from the 4 requested attributes to a full list of attributes from the twocontributing tables. This is also true for aggregation queries but the large numberof tuples overshadows the increase in attributes size.Therefore it is absolutely essential to give the user full control over when togenerate provenance data. It is also necessary to summarize the results or at least52Figure 5.2: A line plot of the queries running without the overhead of mysystem measuredbreak them down so the user does not get overwhelmed by the sheer volume ofinformation.5.1.5 DiscussionThe results demonstrate the minimal overhead of query processing for the querytypes I ran. The results also show that the Perm [16] approach still maintainslow overhead even when implemented as an extra layer over the DBMS. I alsoverified that the only operation causing the overhead is the extra database catalogaccesses since the cost of query processing is trivial (linear) at large table sizes (>10k). However, the results of the size in tuples of the provenance information alsoraise serious questions about the usability of such information reveals an interestingdirection of research where provenance information should be reduced somehow.I propose my provenance explorer component in the next section as part of the53Figure 5.3: A barchart of the result sizes of different queriessolution where I attempt to give a summarized overview of the data itself so theuser can easily navigate through the large volumes of provenance information.54Chapter 6Provenance ExplorerProvenance information can increases data size exponentially. This can have se-rious implications not just on storage but also on usability. Furthermore, I wouldlike to enable users to browse through provenance information without being over-whelmed by the size of it. I would also like them to be able to explore smallerportions of the data with its provenance information. Therefore, I believe that thebest approach would be to provide an overview or a summary of the data that ad-heres to a visualisations’ limitations and also shows a big picture overview that theycan drill down and isolate portions of it and examine the provenance information.This is what I attempt with my provenance explorer component.Provenance data is represented as annotations or extra attributes attached tothe results. However, data often goes through a series of steps to arrive at thefinal results. Therefore, a more intuitive approach is to show provenance as aconnected graph of the derivation history. I use this graph to summarize the tripthat a data item took to arrive at the query results. This summary gives an overviewof provenance that the user can use to navigate the lineage or derivation history ofa portion of the dataset or the entire database.Data size can overwhelm the user but it can also affect visualisations. In in-formation visualization, visual idioms are limited in terms of size of inputs. Visu-alizing large graphs or large trees is a challenge and a different research problemthan what I am trying to solve. Therefore, my focus is on visualisations that caneffectively show an overview without going further into showing a large number of55data items.In this chapter I look at my provenance explorer component. In Section 6.1I briefly look at the provenance browsers implemented in other systems. In Sec-tion 6.2, I identify a set of tasks users would like to do when interacting with mysystem. Section 6.3 outlines the visualization principles I followed to design mysystem. Section 6.4 gives an overview of the currently available functions. In Sec-tion 6.5 I look at how the provenance explorer is implemented. To show how a userwould use the provenance explorer I present a case study with the GLEI Watch sys-tem and data. I explore the results in Section Provenance browsersWhile database provenance propose very limited utilities for visualizing and guid-ing the user through provenance information, provenance browsers are evident inthe field of work-flow provenance. While they bear different semantics that arenot as unified as those in database provenance, they still offer insight into how tovisualize such data and make it more manageable for the user.In MyGrid [27] [2], provenance information is captured automatically in prove-nance logs when a workflow is executed. The logs contains such things as the ser-vices invoked, their parameters, and the data products used and derived. Semanticannotations are applied to work-flows and services manually. These annotationsare carried over to end data products and used to link relevant pieces together. Us-ing the semantic web browser users can browse through the semantically linkedprovenance information and reason about provenance related problems. The inter-face is like a web page with hyper-links to relevant information.In [2], the authors present a provenance browser that shows the users the pro-cess that generated a piece of data. The provenance is represented as a depen-dency graph of the workflow. The graph can be queried using QLP a graph querylanguage. The provenance is recorded by a scientific workflow system called Ke-pler [20]. In Kepler, processes that change the data (or actors) take in XML files asinput and alter them. These processes and data changes are recorded as provenancetraces. The browser enables the users to navigate or write QLP queries to look atdifferent aspects of the workflow graph.56Figure 6.1: The interface of the provenance browser from [2]In [9] a database provenance system, a simple visualization of the provenanceof a value can be displayed to the user. This visualization shows where the valuecame from and traces it to the original source. It also shows the transformationsor queries that resulted in the final results. I looked at DBNotes in more detail inChapter 3.Those systems and others offer insight into what it takes to visualize or justoffer different dissemination techniques than simply asking queries and lookingat results. However the complexity of the problem in database provenance comesfrom the fact that provenance information is huge, and most of it is stored in tables,either coupled or decoupled from the data items themselves. In my case it is storedas witness lists with its data. I will look at my approach next and how I handle sucha challenge.57Figure 6.2: The visualization of the provenance of a value in a table in DB-Notes. From [9]6.2 User tasksI define two categories of tasks and try to identify tasks that the user would performusing the provenance explorer.6.2.1 Data related tasksSince the amount of provenance information is too big for users to browse through,and the data itself can be too large to handle, I offer an initial visualization ofaggregated and/or filtered overview of the data. Through this visualization theusers can perform the following tasks:• Aggregate or filter the data by writing SQL queries.• Filter or search the data further with its respective provenance information.• Navigate and select individual data items.• Edit the data, annotate it, or export subsets of it to files along with relevantprovenance information.58Figures 6.3 and 6.4 show some of these functions as implemented in the ex-plorer.Figure 6.3: The data visualizations with search, filter and sort options6.2.2 Provenance information tasksInteracting with the visualization of the data the users can also examine the prove-nance information and perform any of the following:• Click on a piece of data to see the provenance information of that piece.• Drill down to the original source of the data.• Ask for the provenance to be explained. An overview of the process of gener-ating the provenance information through rewriting queries will be displayedas tree visualizations.• See where they are in the derivation history of an item through the prove-nance graph.• Browse through provenance using a provenance graph that shows all ances-tors of a subset of the data.59Figure 6.4: Selecting an individual data item to see its provenanceThese tasks can be seen in Figures 6.8, 6.6 and 6.7 showing screenshots fromthe system.Figure 6.5: The provenance data tables with the provenance graph60Figure 6.6: Drilling down to the ancestor of the table in Figure 6.5. Clickingon the provenance graph lets the user navigate the derivation history ofthe displayed provenance table, it also works as a navigation tool lettingthe user know where they are in the lineage of a data itemFigure 6.7: A visualization of the query and the rewritten provenance query.Explaining provenance to the users can be done by showing the user thequeries and a visualization of the original, and provenance generatingrelational expressions616.3 Visualization principlesMy design of the provenance explorer component is guided by identifying a num-ber of user tasks seen in Section 6.2 as well as a set of visualization principles.Here I list a few of those principles that I adhered to to improve the effectivenessof my visualizations:• Overview first, zoom and filter, then details on demand [24]. This principleis essential for dealing with large scale information. The main goal of myvisualizations is to show the user a summarized view of the data (usuallyan aggregation). The volume of data is usually large enough without intro-ducing at least double the information with provenance information. Havingthe user browse through an overview made by the user by specifying an ag-gregation or filtering query. The user can then zoom in and look at specifictuples and look at its provenance information. Provenance information canalso exported in small sets in different formats (JSON or CSV).• Eyes beat memory [21]. Navigating through data tables and drilling downcan confuse the user as to where they are in the system. Keeping track ofwhere the user is via visual elements beats having to remember where theyare. Hence I try to show a persistent overview, in the form of the provenancegraph, that highlights which level the user is currently browsing. This canhelp keep track of where the user is at all times and limit the cognitive loadimposed by trying to remember where they are.• Multiple views are most effective when explicitly linked[23]. The user isgiven the option to utilize multiple views at once. Highlighting one elementin a certain view (The bar chart for example) would highlight the same itemon a different view (country on a map for example).• I use aggregation and filtering to reduce the data. Visual idioms have lim-itations in terms of the number of items they can display [21]. Reducingguarantees better scalability for large datasets and more efficient visualiza-tions.626.4 OverviewThe task at hand is to visualize provenance information that is stored in relationaltables side by side with their respective tuples. The data itself can be large andoverwhelming and the provenance information multiplies the size. The user needsto see the information divided into smaller subsets, and the provenance informationshould be hidden until the user asks for it.Thus I offer two components to facilitate browsing through the provenanceinformation: The data visualization and the provenance graph. In the data visual-ization, the data itself is summarized and the user is presented with a manageableoverview. In this overview, the user can click on any data item and see the prove-nance results, this way the user is only looking at a subset of the information at atime. The provenance graph gives an interactive overview of the derivation historyof a piece data and lets the user see where they are in that history. I go into moredetail on both in the next subsections.6.4.1 Data visualizationI enable the visualization of the dataset through the following process: Load thedataset or query the database for results. Filter or aggregate the dataset to get amanageable subset. Visualize the data using one of the following idioms: Treemap,for hierarchical data or aggregated data. A graph for networks data. Geographicalmap for data containing geographical locations. Bar or line charts for quantitativedata. This visualization step is completely optional, data that does not fit any of theabove criteria can simply be displayed as a table.The user can then proceed to examine the visualization or table of data andzoom or select a subset of it. The user can click on any data item to see the prove-nance results. The user can also drill down on any tuple in the provenance to seeits provenance all the way to the original source. Optionally the user can also seethe provenance of an entire table. This could be a reasonable approach for smallerdatasets.63Figure 6.8: The derivation history of GLEI data from origin to the final ag-gregated view6.4.2 The provenance graphThe derivation history of a piece of data can contain multiple tables and multipletransformations. The provenance graph representation of this history comes nat-urally and is similar to previous efforts, such as [9] [2]. The interaction with thisrepresentation is to allow the user to browse through the history and display infor-mation based on where they click. The graph also works as a navigation guide,letting users know where they are in the derivation history of a data item.Figure 6.8 shows a provenance graph of the GLEI dataset. The provenancegraph shows the original source of the data all the way to the final query results.The graph helps users keep track of where they are in the provenance levels andalso enables users to browse through it. The graph persists throughout the visu-alization and offers an overview that is always there. It is worth noting that thegraph only shows the derivation history of data, it does not offer any informationon dependencies or relationships between tables or views.6.5 Provenance Explorer implementationThe provenance explorer is a web based interface to the system. It is powered bymy API and D3 [5] visualizations. D3 allows for scalable, efficient, and interactivevisualizations. It enables me to attach data to visual elements in the HTML DOM(Document Object Model). I export the data or query results as JSON objects thatI later map as DOM elements. I also attach information such as table names andattribute names to HTML elements. This mapping enables me to query and presentprovenance information for each data item. This enables the user to click on a dataitem to request its provenance information.To construct the provenance graph I derive the table names in the history of apiece of data from provenance attributes in tables. Provenance attributes are named64as follows: provenance tableName attributeName. The tableName part of the at-tribute gives the table names to add to the graph. Drilling down to the provenancetableName gives its respective provenance tableName. Constructing the graph isdone through queries to the database catalog collecting relevant table names. Thisis done in the back-end system and the output is sent to the explorer as a JSONtree.The user can also change a data item to update it. However, the provenanceinformation can be preserved in cases where the original data owners may notwant them to be altered.I also offer the users services that grant the ability to export a subset of thedata with its provenance for further analysis or custom visualizations. This data isoffered in JSON, CSV or XML format.6.6 GLEI case studyI looked at the GLEI Watch System and data in detail in Chapter 2. My interestis in the GLEI system as an example of a curated database. Users can look at thedata, clean it, modify it or annotate it as they see fit. The provenance informationshould be preserved as it comes from the original system. Maintaining provenanceinformation would guarantee changes to the data can be verified by future users. Iam also interested in guaranteeing to users that they can trust the data presented bythe GLEI Watch system, convincing them that it has not been altered in ways thatlead to errors and that no important data items were dropped or erased. Allowingthe users to inspect the original and intermediate sources can ensure that.In addition to the tasks highlighted in Section 6.2, the GLEI Watch system isintended to answer the following queries:• What are the cities in which the most registrations occur?• What are the cities with the highest number of registrations?• Which LOUs have the highest number of registrations?• What is the pattern of registrations over time by country and by city and byregistration authority?65• What is the proportion of types of legal entities (legal forms) being registeredby country, by city and by registration authority?• What companies have been registered?• Where are the companies registered?Providing an overview of the data on the shape of a geographical map is thenatural way to represent this information and answer some of those queries. Mostof the queries are also aggregation queries. I offer the user an overview of the dataas a geographical map of countries that contain registration numbers.I also offer a linked view of a bar-chart showing the number of registrations ineach country. The barchart and map views are linked. Hovering over a bar-chartitem highlights the country on the map. Clicking on a country shows the prove-nance results. It explains the number of companies in that country by displayingprovenance information of the aggregation operation. The users can drill downfurther all the way to origin of any data item.Figure 6.9: A world map with aggregated count of registrations in a countryAlternatively, to answer queries 2, 3, and 4 the user could opt for a count ag-gregation by city, LOU data source, or legal form. In such a case, the user couldopt for a treemap representation of this aggregation data. The provenance infor-66Figure 6.10: A barchart showing aggregated count of registrations in a coun-trymation would explain and show the user the individual tuples that make up thisaggregation.Figure 6.11: A treemap with aggregated count of registrations in a countryThe other queries posed involve detecting patterns and trends in data. Usingline-charts to show patterns in time-series data is standard practice. Enhanced byprovenance information the user can see where every data item comes from andtrace it all the way to the origin.The purpose of applying the provenance explorer to the GLEI data is to en-sure that the users can trace every change that has been applied to the data. The67GLEI watch system already guarantees freely available data, visualisations andtotal transperancy. Provenance information adds an extra layer of transparencyshowing the users every step that was taken by the system owners to alter the data.68Chapter 7ConclusionsIn this thesis, I have looked extensively at provenance research in the database field.my findings are summarized as a classification scheme of provenance approaches.I identify the most desirable features in a provenance system and list them out.I also developed a system based on my findings on top of a relational DBMS. Ishow the minimal overhead of my approach, making it ideal for the experiments Icarry out and a good fit for the GLEI Watch system.I used this system to develop a provenance explorer, a system to visualize queryresults and explore provenance information. I used the provenance explorer on acurated dataset of financial data to demonstrate how it works and give a case studyto show its uses. I believe that working on a solution that makes provenance infor-mation more manageable for users fills a gap in this area and is a valid contributionto this field of research.There is still a lot to be done in this field before a single approach is deemedcomplete. The size of the provenance data is still too large to handle, and scalabil-ity is a real concern. I feel that solutions in the vein of my provenance explorer area step in the right direction. I intend to extend my system and my visualizationsto support more general use cases. I also intend to validate my system through athorough quantitative user study in the future. Other ideas for extensions includeassisting the user in formulating queries that summarize the data, limiting the at-tributes to use in provenance generation, and extending query rewrites to supportother things such as guaranteed querying of only trusted sources or including only69trusted sources in provenance information.70Bibliography[1] U. A. Acar, P. Buneman, J. Cheney, J. Van Den Bussche, N. Kwasnikowska,and S. Vansummeren. A graph model of data and workflow provenance. InTaPP, 2010. → pages 32, 36[2] M. K. Anand, S. Bowers, and B. Ludascher. Provenance browser:Displaying and querying scientific workflow provenance graphs. In DataEngineering (ICDE), 2010 IEEE 26th International Conference on, pages1201–1204. IEEE, 2010. → pages ix, 56, 57, 64[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An annotationmanagement system for relational databases. The VLDB Journal, 14(4):373–396, 2005. → pages 21, 23, 24[4] S. Bistarelli, U. Montanari, and F. Rossi. Semiring-based constraint logicprogramming. In IJCAI (1), pages 352–357, 1997. → pages 15[5] M. Bostock, V. Ogievetsky, and J. Heer. D3 data-driven documents.Visualization and Computer Graphics, IEEE Transactions on, 17(12):2301–2309, 2011. → pages 64[6] P. Buneman, S. Khanna, and T. Wang-Chiew. Why and where: Acharacterization of data provenance. In Database TheoryICDT 2001, pages316–330. Springer, 2001. → pages 1, 2, 12, 13, 15, 16, 18, 20, 38[7] K. K. Chan and A. Milne. The global legal entity identifier system: Will itdeliver? Available at SSRN 2325889, 2013. → pages 5[8] J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why,how, and where. Now Publishers Inc, 2009. → pages 2, 34[9] L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. Dbnotes: a post-it system forrelational databases based on provenance. In Proceedings of the 2005 ACM71SIGMOD international conference on Management of data, pages 942–944.ACM, 2005. → pages ix, 11, 21, 23, 24, 31, 33, 38, 57, 58, 64[10] T. P. P. Council. Tpc-h benchmark specification. Published at http://www.tcp. org/hspec. html, 2008. → pages 50[11] Y. Cui and J. Widom. Lineage tracing in a data warehousing system. InData Engineering, 2000. Proceedings. 16th International Conference on,pages 683–684. IEEE, 2000. → pages 12, 13, 16, 32, 33, 38[12] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in awarehousing environment. ACM Transactions on Database Systems(TODS), 25(2):179–227, 2000. → pages 1, 16[13] A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working modelsfor uncertain data. In Data Engineering, 2006. ICDE’06. Proceedings of the22nd International Conference on, pages 7–7. IEEE, 2006. → pages 37[14] B. Glavic. A primer on database provenance. 2014. → pages 2, 36[15] B. Glavic and K. R. Dittrich. Data provenance: A categorization of existingapproaches. In BTW, volume 7, pages 227–241. Citeseer, 2007. → pages 33,35[16] B. Glavic, R. J. Miller, and G. Alonso. Using sql for efficient generation andquerying of provenance information. In In Search of Elegance in the Theoryand Practice of Computation, pages 291–320. Springer, 2013. → pages 12,28, 29, 33, 38, 40, 41, 43, 44, 45, 53[17] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. InProceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGARTsymposium on Principles of database systems, pages 31–40. ACM, 2007. →pages 2, 12, 14, 25[18] P. Hanrahan, C. Stolte, and J. Mackinlay. visual analysis for everyone.Tableau White paper, 4, 2007. → pages 6[19] G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. InProceedings of the 2010 ACM SIGMOD International Conference onManagement of data, pages 951–962. ACM, 2010. → pages 27, 38[20] B. Luda¨scher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. B. Jones,E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management and thekepler system. Concurrency and Computation: Practice and Experience, 18(10):1039–1065, 2006. → pages 5672[21] T. Munzner. Visualization Analysis and Design. CRC Press, 2014. → pages62[22] C. Re´ and D. Suciu. Approximate lineage for probabilistic databases.Proceedings of the VLDB Endowment, 1(1):797–808, 2008. → pages 34, 37[23] J. C. Roberts. State of the art: Coordinated & multiple views in exploratoryvisualization. In Coordinated and Multiple Views in ExploratoryVisualization, 2007. CMV’07. Fifth International Conference on, pages61–71. IEEE, 2007. → pages 62[24] B. Shneiderman. The eyes have it: A task by data type taxonomy forinformation visualizations. In Visual Languages, 1996. Proceedings., IEEESymposium on, pages 336–343. IEEE, 1996. → pages 62[25] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance ine-science. ACM Sigmod Record, 34(3):31–36, 2005. → pages 32, 35, 36[26] J. Widom. Trio: A system for integrated management of data, accuracy, andlineage. Technical Report, 2004. → pages 32, 35[27] J. Zhao, C. Goble, R. Stevens, and S. Bechhofer. Semantically linking andbrowsing provenance logs for e-science. In Semantics of a Networked World.Semantics for Grid Databases, pages 158–176. Springer, 2004. → pages 5673


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items