UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Managing data updates and transformations : a study of the what and how Wong, Jessica Hei-Man 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2016_may_wong_jessica.pdf [ 1.47MB ]
JSON: 24-1.0300153.json
JSON-LD: 24-1.0300153-ld.json
RDF/XML (Pretty): 24-1.0300153-rdf.xml
RDF/JSON: 24-1.0300153-rdf.json
Turtle: 24-1.0300153-turtle.txt
N-Triples: 24-1.0300153-rdf-ntriples.txt
Original Record: 24-1.0300153-source.json
Full Text

Full Text

Managing Data Updates and Transformations: A Study ofthe What and HowbyJessica Hei-Man WongB. Sc, The University of British Columbia, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)April 2016c© Jessica Hei-Man Wong, 2016AbstractCleaning data (i.e., making sure data contains no errors) can take a large part of aproject’s lifetime and cost. As dirty data can be introduced into a system throughuser actions (e.g., accidental rewrite of a value or simply incorrect information), orthrough the process of data integration, datasets require a constant iterative processof collecting, transforming, storing, and cleaning [20]. In fact, it has been estimatedthat 80% of a project’s development and cost is spent on data cleaning [12].The research we are undertaking seeks to improve this process for users whoare using a centralized database. While expert users may be able to write a script oruse a database to help manage, verify, and correct their data, non-computer expertsoften lack these skills and thus, trawling through a large dataset is no easy featfor them. Non-expert users may lack the skills to effectively find what they needand often may not even be able to efficiently find the starting point of their dataexploration task. They may look at a piece of data and be unsure of whether or notthis piece of data is worth trusting (i.e., how reliable and accurate is it?).This thesis focuses on a system that facilitates this data verification and updateprocess to help minimize the amount of effort and time put in to help clean the data.Most of our effort concentrated on building this system and working on the detailsneeded to make it work. The system has a small visualization component designedto help users determine the transformation process that a piece of data has gonethrough. We want to show users when a piece of data was created along with whatchanges users have made to it along the way. To evaluate this system, an accuracytest was run on the system to determine if it could successfully manage updates. Auser study was run to evaluate the visualization portion of the system.iiPrefaceThis system discussed in this thesis was initially based off of the design created inthe thesis work done by Arni Mar Thrastarson (2014). Subsequent tweaks to thesystem design and the resulting implementation was done by the author, J.Wong.No part of this thesis has been published. This thesis is an original intellectualproduct of the author.The two datasets used in the experiment section of this thesis (Chapter 5) wereobtained from two sources. The GLEI data was obtained from the work of V. L.Lemieux, P. Phillips, H. S. Bajwa, and C. Li. The bird data was obtained from J.Jankowski.Part of the “Related Work” chapter is from the class project completed forCPSC 504 in collaboration with Laura Cang and Kailang Jiang. Specifically, the“Transformation Languages” and “View Maintenance” sections within Chapter 2were adapted from the final paper written for CPSC 504.The user study discussed in Chapter 6 has undergone ethics approval by theUBC Behavioural Research Ethics Board committee. The project title was “Un-derstanding Data Changes” and the approval certificate given to this study has theidentifier H16-00331.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Database Management Systems . . . . . . . . . . . . . . . . . . 72.1.1 Collaborative Data Sharing Systems . . . . . . . . . . . . 82.1.2 Examining Data Changes . . . . . . . . . . . . . . . . . . 92.1.3 Probabilistic Databases . . . . . . . . . . . . . . . . . . . 102.2 Data Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Transformation Languages . . . . . . . . . . . . . . . . . . . . . 112.4 View Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Transformation Visualizations . . . . . . . . . . . . . . . . . . . 133 System Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 15iv4 System Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1 Transformation Script . . . . . . . . . . . . . . . . . . . . . . . . 194.2 UUIDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.1 Dataset UUIDs . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Row and Column UUIDs . . . . . . . . . . . . . . . . . . 224.3 Database Architecture . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Data Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Trust Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.6 User Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.7 Requirements for Exporting Data . . . . . . . . . . . . . . . . . . 264.8 Restrictions on User Actions . . . . . . . . . . . . . . . . . . . . 274.8.1 Add a Column . . . . . . . . . . . . . . . . . . . . . . . 274.8.2 Remove a Column . . . . . . . . . . . . . . . . . . . . . 294.8.3 Add a Row . . . . . . . . . . . . . . . . . . . . . . . . . 304.8.4 Remove a Row . . . . . . . . . . . . . . . . . . . . . . . 314.8.5 Edit a Value . . . . . . . . . . . . . . . . . . . . . . . . . 324.9 Pushing Data Changes to the Server . . . . . . . . . . . . . . . . 334.10 Data Transformations on Non-locally Stored Data . . . . . . . . . 354.11 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.12 Generating a Transformation Script from Data Snapshots . . . . . 375 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.1 GLEI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.2 Bird Data . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Technical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . 435.3 Testing for Correctness . . . . . . . . . . . . . . . . . . . . . . . 446 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3.1 Two Factor Repeated Measures ANOVA . . . . . . . . . 50v6.3.2 Trust Differences . . . . . . . . . . . . . . . . . . . . . . 536.3.3 Results Discussion . . . . . . . . . . . . . . . . . . . . . 537 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.1 Intelligently Inferring Where to Add a Column . . . . . . . . . . 597.2 Inferring the Transformation Script . . . . . . . . . . . . . . . . . 607.3 Achieving Full Synchronization . . . . . . . . . . . . . . . . . . 607.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.5 Determining a Good Conflict Time Measure . . . . . . . . . . . . 617.6 Automatically Adjusting Trust Values . . . . . . . . . . . . . . . 617.7 Visualization Improvements . . . . . . . . . . . . . . . . . . . . 627.8 Automatically Pushing Changes . . . . . . . . . . . . . . . . . . 638 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A Demographic Questionnaire . . . . . . . . . . . . . . . . . . . . . . 71B Task Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . 73viList of TablesTable 4.1 The data transformations logged to the transformation script asthe user makes changes to the data. . . . . . . . . . . . . . . . 21Table 4.2 Movie database example 1. . . . . . . . . . . . . . . . . . . . 23Table 4.3 The columns within a data provenance table. . . . . . . . . . . 25Table 4.4 Movie database example 2. . . . . . . . . . . . . . . . . . . . 28Table 4.5 Result for selecting all the information about Pixar along withany movies produced by Pixar which received a user rating of 5/5. 28Table 4.6 The list of transformations that would be generated when youadd a new row to Table 4.4. . . . . . . . . . . . . . . . . . . . 31Table 6.1 Basic information about the participants who participated in theuser study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Table 6.2 Averages of the quantitative data collected from the user study. 49Table 6.3 Mean and standard deviation for the time to task completion.Although we had 15 participants, N = 9 because not all partici-pants managed to successfully time all their tasks. . . . . . . . 50Table 6.4 ANOVA table generated when determining if there were signif-icant differences in time to task completion. . . . . . . . . . . 51Table 6.5 Mean and standard deviation for the degree that participantstrusted the data that was shown. Trust was ranked on a scalefrom 1 to 5 where 1 was very unlikely and 5 was very likely. . . 51Table 6.6 ANOVA table generated when determining possible significantdifferences in how much user study participants trusted the datathey were shown. . . . . . . . . . . . . . . . . . . . . . . . . 52viiTable 6.7 Mean and standard deviation for how easy participants foundthe tool to use when trying to determine how much to trust thedata shown on the screen. . . . . . . . . . . . . . . . . . . . . 52Table 6.8 ANOVA table generated when trying to determine if significantdifferences existed in how easy it was for user study participantsto examine the data. . . . . . . . . . . . . . . . . . . . . . . . 53viiiList of FiguresFigure 3.1 What the user interface would look like once the user has load-ed the data to be viewed. . . . . . . . . . . . . . . . . . . . . 17Figure 3.2 The first screen the DBA sees as he/she prepares to get thesubset of data needed by the domain expert user. . . . . . . . 18Figure 4.1 The red box shows the “Record Changes” checkbox that deter-mines whether or not changes will get synced to the database. 26Figure 4.2 If the “Record Changes” checkbox is checked, any changesthat are made will be recorded in a transformation script thatwill be used later to push changes to the database. On theright side of this figure, we show the transformation script forchanges that a user has made using the user interface on theleft side of the figure. . . . . . . . . . . . . . . . . . . . . . . 27Figure 4.3 What the user UI would look like when the user starts to add acolumn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 4.4 The result of performing the add column action shown in Fig-ure 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 4.5 What the user sees when he tries to add a row. There is norestriction on needing to fill all the columns shown unless acolumn has been specified as not null. . . . . . . . . . . . . . 32Figure 4.6 The result of having performed the add row action from Figure4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 4.7 How the user would sync changes to the database using theuser interface. . . . . . . . . . . . . . . . . . . . . . . . . . . 34ixFigure 4.8 What the user sees if he/she is in the “Year” view and hasclicked on a bar to view changes that have occurred on thatday. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 4.9 What the user sees if he/she is in the “Month” view and hasclicked on a bar to view changes that have occurred on that day. 38Figure 4.10 What the user sees if he/she is in the “Day” view and hasclicked on a bar to view changes that have occurred on thatday. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Figure 5.1 The structure of the GLEI data within the MySQL database. . 42Figure 5.2 The structure of the bird data from 2011. . . . . . . . . . . . 43Figure 6.1 The timer interface that user study participants use to time taskcompletion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 7.1 A possible design iteration on the system visualization thattries to show users data characteristics that user study partici-pants have deemed useful. . . . . . . . . . . . . . . . . . . . 62Figure 7.2 Progression of how a change would start to appear in the pro-posed new system visualization design. . . . . . . . . . . . . 63Figure 7.3 How multiple changes would appear in the new proposed sys-tem visualization design. . . . . . . . . . . . . . . . . . . . . 64xAcknowledgmentsI would like to start by thanking my sister, Jennifer, for her unwavering support atall hours of the day and for her initial encouragement to go into computer science.From being my midnight snack buddy to my coding rubber ducky, this thesis wouldhave been so much harder without her. I would also like to thank my parents beingso supportive and for giving me the courage to pursue what interests me. Eventhough to this day they may not understand what exactly I spend my time workingon, they have never ceased to be encouraging and supportive.My immense gratitude goes to my supervisor, Rachel Pottinger, for being suchan amazing mentor and supervisor to me. She has been a constant source of guid-ance and encouragement and through her, I’ve learned so many skills ranging frompumpkin carving to how to do research. She has given me the freedom to try andparticipate in so many different things and that has made all the difference in howmuch I’ve enjoyed my Masters.I would also like to thank my second reader, Ed Knorr, for encouraging meto try research and for being so supportive throughout every step of my Masters.Without his initial encouragement, I would very likely not have even consideredgrad school.Finally, I would like to thank my friends for all their support. These past twoyears have so special to me with so much laughter and cake–there is nothing aboutthat I would change. Special thanks goes to my lab-mate Omar AlOmeir who hasalways been willing to listen to my ideas and offer suggestions for improvement.xiChapter 1IntroductionData has always been an integral part of everyday life. From whether or not to pur-chase an item to issues like improving student success [29] to business intelligenceapplications that help companies make million dollar decisions [19], data is theessential ingredient needed to drive what happens. Data driven decision making isthe common theme that reappears throughout many different situations and its im-portance cannot be underestimated. The everyday user now has more data availableat their fingertips than ever before. From data sources published by local govern-ments (e.g., DataBC) to genome data published by various research institutions, anastronomical amount of information is now being stored and made available.While the public seems to have a voracious appetite for analyzing and digestinglarge datasets (as evidenced through the rapidly increasing popularity of websitessuch as reddit.com’s data visualization subreddit or tools such as Tableau or thelarge number of page hits given to Wikileaks), a large portion of the general publiclacks the skills needed to traverse, verify, and/or clean large datasets effectively.While technologically savvy users can handle writing small scripts or programsto clean data down to what they need, there is a significant portion of users whocannot effectively digest the information they are given.On top of the data navigation problem, non-technical users may be frustrated bythe problem of dirty data (i.e., data that is incorrect). If the data traversal problem isconsidered a slight hinderance to users, the dirty data problem can cause a completestandstill in the progress of a project. Before any meaningful conclusions can1be drawn, the data must first be deemed trustworthy and accurate to the best ofthe user’s knowledge. This constraint holds true for all users regardless of theirtechnical expertise. Determining whether or not to trust certain pieces of dataoften poses a problem. Although one could manually factcheck data, it is basicallyimpossible to do so for large datasets. For a group of individuals who share acommon dataset (e.g., a research lab), it is mutually beneficial for everyone toshare the fruits of their laborious data cleaning process. It would cut down on theamount of work needed per person and everyone could benefit from the cleaneddata.Data cleaning is a problem that appears in many different situations rangingfrom a traditional single database to many different data sources. It is not a problemthat can be ignored nor is it a problem that is resolved after one intense round ofediting and cleaning. While there are algorithms that can detect data abnormalities,user judgement is often needed to determine what is the correct value given theconditions of the data and parameters of the question. Thus, we need to have someway to help these users manage their data updates and transformations.In this thesis, we assume that there is a central repository where data is stored,and that many users will make local copies of this data. Our goal is to help bothsets of data stay synchronized and up to date. Any missing values in a row aretreated as purposefully missing (i.e., the values are treated as if they have not beenentered on purpose). We want to help non-technical users understand a data point’sevolution from the time of its introduction to the dataset to the current moment ofexamination. In this work, we make the assumption that the domain expert is adifferent person from the database administrator (DBA) and that the domain ex-pert only works with a subset of the dataset that is stored on his/her local machine.The domain expert is responsible for making changes to the dataset given his/herknowledge and the DBA will be responsible for technical aspects of data mainte-nance.This thesis builds on the theoretical framework created by Arni Thrastarsonin his M.Sc thesis [40] and focuses on implementing a system that helps usersmanage their data updates and transformations. Although we started with a the-oretical framework, a lot of the practical implications of what it meant to trulysupport update propagation developed slowly over time. While there are compa-2rable data management systems such as Trio [1] and Orchestra [18] which alsoaddress the problem of multiple individuals collaborating, both systems do so un-der a different set of circumstances than we do. Orchestra works with a system ofdistributed databases and tackles managing data updates and transformations fromthat viewpoint, while Trio focuses on situations where the data currently stored inthe database may or may not be accurate (i.e., users are not certain that the datathey see in the database is reliable to work with). There currently is no system thatlooks at the problem of collaborative data improvement with the intent of helpingusers work with their data in a centralized data environment.A large portion of the work in this thesis focuses on how to help users un-derstand and work with their data, based upon the system design decisions thatwe made after examining similar problems and the relevant solutions in other datamanagement systems. When it came time to figure out how to best help usersunderstand their data changes (both with the visualization (Section 4.11) and thetransformation script (Section 4.1)), many logical data data transformation frame-works and systems like Wrangler [24] were examined and compared against eachother to create a transformation script that would be useful to users. Other contri-butions include a discussion about the various challenges encountered and expla-nations about the design decisions made during the implementation of the system.The resulting system was also tested for correctness (Chapter 5) and a user studywas run (Chapter 6) to examine how users reacted to using the information pro-vided by the system in their data analysis tasks.The four particular data usage scenarios that this thesis focuses on were chosenas representative use cases that would be encountered by typical users. These fourscenarios were chosen partly because of the work done in [40] but also because wefelt that these were scenarios that show up in everyday life. For example, a largepart of this thesis deals with bird data obtained from one of the UBC Zoology labs(Section 5.1.2) and during the process of understanding the bird data, we came tothe conclusion these four scenarios do indeed appear within the zoology lab’s datacollection workflow. The four scenarios are as follows:1. When changes are made locally (i.e., to a copy) on the user’s machine. Inthe context of the bird data, this happens quite often when teams go out into3the jungle to capture and measure birds. As the middle of the jungle does nottend to have readily available Internet, teams often collect lots of data thatis stored locally on their machines until they get the opportunity to send theinformation back to UBC.2. When changes are made on the central database and need to be propagatedto the user. With any team that is working collaboratively, changes often oc-cur at many different times of the day. In the case of software development,this may take the form of code commits to Git or another repository whereteam member A has made some changes to the code that team member Bhas to download and retrieve.3. When changes are made on the database and the data the user is viewing hasto be updated. For time critical data (e.g., stock price changes over the courseof a day or elections data as results roll in from across the country), it makesmore sense to view data directly from a server as opposed to downloadinga copy. In these cases, it makes more sense for users to view data directlyfrom a source and store any work they are basing on this data locally. Forexample, a journalist reporting on elections night may aggregate statisticsabout something not commonly reported on. Election results can changerapidly as each electoral district reports its results; so, it does not make sensefor the journalist to continually download a new copy of the results everyfive minutes or so. The notes that the journalist are making do not need tobe pushed to the database as it has nothing to do with actual election results;therefore, it is reasonable for her to simply store the changes on her localmachine. Hence, you can sometimes have situations where you are viewingdata directly from a database and making local changes on your machine.4. When generating a list of changes by comparing two different versions ofa dataset. This need may arise during times when users end up with twoversions of a document that they need to determine differences between. Anexample of this would be when a group of students are working together ona term paper but have decided to email each other copies of the term paper asthey work together on it. Two or more team members may work on the doc-4ument at the same time and thus the team may end up with multiple copiesof the paper that are all slightly different from each other. To determinewhich copy of the paper to hand in, the team will have to determine whathas changed from one copy to another. They could use Microsoft Word’s“Compare” feature with two versions of their document to come up with thechanges. Microsoft Word only works with a certain type of data. In thiscase, the team could use Microsoft Word to help but in the case of data in anon-Microsoft Word format, we will need to come up with a list of changesto the data in some other way.In this context, this thesis makes the following specific contributions:• Examines issues related to managing data updates in a centralized data envi-ronment• Identifies restrictions that a centralized database environment would have toadhere to in order to manage data updates• Determines the metadata necessary for providing enough information to han-dle managing data updates for multiple individuals• Creates a data transformation language to represent changes without beingspecific to any query language.• Implements a system to manage data updates using a centralized databaseenvironment• Determines the metadata that both expert and non-expert users would needto understand and trust dataThe remainder of this thesis is organized as follows:• Chapter 2 discusses related work.• Chapter 3 introduces and motivates the system that was built to manage dataupdates.• Chapter 4 examines the various design decisions that had to be made in orderto create the system.5• Chapter 5 discusses how system correctness was determined.• Chapter 6 describes the user study that was conducted to determine the use-fulness of the visualization (Section 4.11).• Chapter 7 discusses future directions of this work.• Chapter 8 concludes the thesis with a brief summary of what was done andthe impact it has.6Chapter 2Related WorkThis thesis examines the problem of how to help multiple users, whether they beexpert or non-expert computer users, to manage their data updates and transforma-tions. We first examine other data management systems that try to address similarquestions that we are examining in this thesis. After this discussion, we examinedata provenance in the literature since a large part of managing data updates andtransformations involve determining what type of provenance to store, and how tobest manage it. The next section will look at transformation languages as they arecentral to how we represent data updates and transformations in this thesis. We willalso briefly look at view maintenance as it is one of the central problems we facewhen trying to deal with users who have local copies of data and have not pullednew changes from the database. We conclude this section by examining other toolsthat have tried to tackle the problem of data transformation visualization. In par-ticular, we try to help users understand their data by showing them a visualizationabout the changes their data has gone through.2.1 Database Management SystemsA database management system (DBMS) is a set of programs designed to helpusers organize and manage their data so that operations such as querying and up-dating can be done correctly, efficiently, and concurrently with other users. In thisthesis, we develop a DBMS that helps users manage updates and transformations to7their data when one or more users work from data stored in a centralized relationaldatabase. There have been various DBMSs developed to help tackle the problemof helping people use data. However, these systems tackle the problem from adistributed databases standpoint (Section 2.1.1) or from a point of view where allchanges should be tracked and explained (Section 2.1.2). This thesis brings thesetwo types of DBMSs together to so that users can manage their updates and exam-ine their data using data provenance.2.1.1 Collaborative Data Sharing SystemsA collaborative data sharing system (CDSS) is a system that connects a set of dis-jointed databases, each with its own schema and local data, to allow users to sharedata [18]. The main idea behind a CDSS is to allow users to share data loosely (i.e.,a database can have data which differs from another database) [17]. Each databasein the CDSS model has its own schema and uses schema mappings, a method ofspecifying which table a column is from, to relate its own data to other databases.In a CDSS, the database stores its own schema mappings to allow for data updatepropagation to occur. In our work, schema mappings are stored on the client sidewhich presents a different set of problems as the amount of schema mapping infor-mation can be more limited. Whereas schema mappings in a database have accessto virtually all the data in the CDSS, schema mappings that are exported from thedatabase and stored on the client side do not have access.Orchestra [23] is the main CDSS that we will examine. Although Orchestra hasbeen designed for a distributed system, it also focuses on trying to lower the barri-ers standing in the way of data sharing and collaboration between multiple users.Many of the concerns raised by Orchestra are analogous to issues we face in thisthesis. However, as Orchestra focuses on data update management in a distributedsystem and we focus on data sharing in a centralized database environment, thereare differences in the main concerns that we each face. For example, Orchestrahas a focus on ensuring that data is not lost through hardware failure and there-fore replicates/stripes the data in a way to allow for data recovery in the case ofhardware failure [22]. While data loss through hardware failure is a concern thatmay afflict our system, this thesis does not examine this issue in depth and has8the underlying assumption that database backups will be sufficient to address thisconcern for the time being.The way that we approach data reconciliation in times of conflicting data up-dates is also different. As mentioned before, CDSSs can allow for databases tohold different instances of the data. While data updates that do not conflict arepropagated across databases, data updates that do conflict are held locally [17]. Or-chestra has a concept of trust which translates to setting rules to determine whichindividuals/institutions are trusted over another. These rules are used when thereare data conflicts to determine which change gets stored locally. In our system, wealso use the concept of trust values to try to resolve data conflicts.2.1.2 Examining Data ChangesSome DBMSs have taken the approach of helping data become more useful byleveraging the lineage information data (i.e., the complete history of an item).While some DBMSs have taken the approach that we have where data provenanceis stored as metadata [30], others have decided to use annotations to help users set-tle disagreements about data [10, 14]. In this thesis, we are also concerned aboutstoring lineage information as we want users to be able to reflect upon what hashappened to their data over time. As such, we have observed the different methodsin which other systems have approached tracking provenance and used those sys-tems to inform our own decisions about how to track provenance, and to identifythe pros/cons of our decisions.Provenance-aware storage systems (PASSs) are systems that try to store prove-nance and lineage information as metadata [30, 35]. While PASSs focus on track-ing provenance at the file system level and even though we are working in a central-ized relational environment, we can still learn from the motivation and automatedprocesses that a PASS uses to track provenance. In PASS, provenance informa-tion is generated automatically. We examine the kind of information stored abouteach provenance instance (a unique identifier for the object, the complete set of allinputs, and a description of the hardware platform the change occurred on) to ver-ify that our minimalist approach to storing transformation provenance information(Section 4.1) is sufficient.9We also briefly examine the idea of approaching provenance by having usersexplain their changes similar to the approach taken by DBnotes [10] and a beliefdatabase [14]. DBnotes takes the approach of having notes attached to each dataupdate/transformation and then continually propagates those notes as the piece ofdata in question gets further transformed. A belief database is a database that al-lows users to explain data conflicts to each other (e.g., if user A disagrees with userB on the value of something, user A can attach a note in the database explainingwhy she thinks the value is wrong). These systems provide a good starting pointfor thinking about how to best help users examine the provenance of their data.However, as our main focus was on managing data updates and not on provenance,belief annotations and explanations about data transformations were not incorpo-rated into the system.2.1.3 Probabilistic DatabasesThere is a subset of DBMSs that deals with situations where we do not know theexact value of the data inside a database. Such DMBSs are called probabilisticdatabases. The main difference between probabilistic databases and the traditionaldatabases that most people are familiar with is that probabilistic databases containdata that may be true as opposed to traditional databases where we know the ex-act value of every piece of data. Probabilistic databases can occur in situationslike when experimental values from multiple iterations of an experiment may bereplaced by an aggregated statistic. If that statistic had been stored in a traditionaldatabase, it may be unclear whether or not the statistic is exact data or derived data.Trio [1] is a probabilistic DBMS that manages data, the accuracy of data, andlineage information. Trio approaches lineage and accuracy from a viewpoint ofwhen data may be inexact or may have missing values. In our system, we workwith a traditional relational database and as such, assume that our values are exact(or in cases where there are data conflicts, we have conflict resolution policies).However, we drew upon some of the ideas in Trio when we designed our data con-flict resolution policies. Specifically, we drew upon the way Trio used confidencevalues (similar to Orchestra (Section 2.1.1) to handle data conflicts. This showedus that the idea of confidence values could be generalized to many different sce-10narios and probably did not need any extra modifications for use in the researchcontext for this thesis.2.2 Data ProvenanceData provenance refers to the idea of being able to track the lineage of data, namelywhere the data has come from and what has happened to it since. There are fourtypes of data provenance: lineage, why, where, and how. Lineage provenance looksat trying to identify the base items that have contributed to a query result or view[11]. Why provenance answers the question of which tuples are required in order toderive a query result; it tells us what tuples are required in order for the answers tothe query to be what they are [8, 15]. Where provenance tells us the location that aspecific piece of data has come from [8, 15]. How provenance answers the questionof how a piece of data became what it currently is; it captures the transformationsapplied to a piece of data.In this thesis, we are particularly interested in how provenance. We want helpnon-computer experts determine the how provenance of a piece of data in orderto determine if they would trust the data enough to use. To support this goal,we decided to use an eager approach of tracking provenance [5, 38] which meantthat we tracked provenance as data was added/modified rather than inferring andcalculating provenance at a later point.2.3 Transformation LanguagesA transformation language is a function that takes some data as input, and outputsthe data in a different form. In the case of this thesis, the transformation languagewe focus on takes as input, a data transformation (i.e., an action like editing a valueor adding a row), and outputs the data transformation in a short succinct way thatusers can understand.Most of the work in the data transformation language domain seems to be clas-sified into one of two categories. Either the work focuses on trying to help usersvisualize the data transformations that have occurred [24, 25, 32] (Section 2.5) orit focuses on modifying existing transformation languages to help with the datatransformation process [16, 26]. Due to the lack of literature about the specifics11of data transformation languages, we decided to examine data transformation lan-guage frameworks. There was minimal literature on using logical programmingconcepts to create a data transformation language [7]. The idea was that people finddeclarative languages like logical programming languages easier to understand asthey focus on the relationships between the data models rather than how to locatea certain piece of information to work with [7, 39]. There has been research ondeclarative data transformation languages for different data models, highlightingthe flexibility requirements in order to keep current with today’s less structureddata [7, 27, 39]. Therefore, we decided to try to model our data transformationlanguage accordingly.Aside from determining how to represent changes, we also had to identifywhich changes we should be supporting. From a literature search [13, 24, 25, 31,32], we determined that our data transformation language had to focus on logicaltransformations rather than physical schema changes; so, in the end, we decided tosupport map and reshape changes (Section 4.1) [24].2.4 View MaintenanceIn the first data management scenario (Chapter 1), a user has a local copy of datarelevant to them and is oblivious to any changes that may occur on the databaseside, which is very analogous to the view maintenance problem in data warehous-ing environments. We considered the Eager Compensating Algorithm (ECA) [42]which described triggering events to update the warehouse or the application’s ma-terialized view. While this approach may be helpful in correcting for inconsisten-cies arising from decoupling the data source and application, it requires quite a bitof overhead to regularly connect, and it creates the opportunity for anomalies (con-sistency issues that arise when updating views while base data is being changed)[42].We implemented an application that takes a mixed-initiative approach to up-dating the remote data source making it important for us to be aware of how oftento push any possible schema changes. This seems to be a comparable problem tohow materialized views need to push their changes to a remote database [2, 3, 41].Through looking at how views and indices are maintained in a large distributed12database [2], we noted that the model suggested in the papers validated many ofthe design decisions we made for our transformation language (Section 4.1). Forexample, the discussion on the different ways to push updates and how to handleupdate discrepancies (source versus application) [2] is reflected in our implemen-tation.2.5 Transformation VisualizationsAs mentioned in Section 2.3, a large body of work has focused on visualizing thedata cleaning and manipulation process [24, 32]. While manipulating the data, auser who wants to use a data point may not always be the one who has modifiedit. For example, if I were a molecular biologist studying a particular protein, theinformation I look up in an open source database was probably not derived orcleaned by me. It is more likely that someone else examined the data and cleanedit prior to putting it in the open source database. In this case, having tools to helpme visualize the process of data cleaning does not help me to understand whetheror not I should use this. After all, I am unfamiliar with the person who has collectedthis data and I am not sure if any other individuals have modified the data after theinitial researcher entered the information. In situations like these, I would need atool to help me understand what has happened to the data over time.Part of the motivation for this thesis was to help non-experts understand thechanges their data has undergone in order to help them to determine whether theyshould use the data. We wanted to keep the visualization and interaction as simpleas possible in order to prevent information overload. As such, we examined differ-ent data visualization tools, specifically Wrangler [24] and Potter’s Wheel [32] toobserve how data is displayed and to see if there were pieces of metadata that werecommon to all visualizations.Wrangler is a visualization tool that displays data transformations to the userthrough a combination of a spreadsheet interface and a list of changes that tookplace on the data. It also uses colour to show changes. The characteristics of Wran-gler that we particularly enjoyed were the use of colour to highlight changes andthe way changes were explicitly listed for the user. Those were the characteristicswe later tried to incorporate in our own visualization.13Potter’s Wheel is a data cleaning framework focused on helping users performdata transformations and detect discrepancies through a spreadsheet interface. Itfocuses on the interaction and usability of the spreadsheet interface, and not asmuch on the visualization of the data. While we did not get a lot of informationabout actual data visualization through this tool, identifying attributes that theyfocused on in their graphical interface (e.g., interface usability) was important toour design process.14Chapter 3System IntroductionData can come in many different formats and workflows. For example, Wikipediaarticles are a source of data and for the unprotected articles, its data workflow isvery free flowing. Anyone can edit the article and the changes do not go throughany approval process before they are seen by the rest of the world. In other cases,such as a curated scientific database, it may be the case that a designated individualhas to approve any changes made to the data before the changes will be seen by therest of the world.As mentioned in Chapter 1, this thesis focuses on trying to help individuals usetheir data. One of the driving goals behind the design of this system is to allowusers to collaboratively improve the quality of their data. In order to accomplishthis goal, we move towards a model where users do not need an administrator ordesignated editor to approve the changes that they make. This was due to the factthat as the number of users looking at the data increases, the number of changesthey will request will likely also increase and the bottleneck of waiting for changesto become approved and go live would be too long. In order to generalize thesystem to all users, both expert and non-expert, we make an assumption that theDBA is a separate person from the domain expert. In this system, the DBA isthe person who controls the flow of information from the database to the domainexpert, and the domain expert manages the content and accuracy of the information.Making this assumption will help us generalize to more data usage situations as wedo not require users to have any other skills other than their domain knowledge.15Users can, of course, have knowledge of tools like R, MATLAB, and Python; but,by aiming for the lowest common denominator of technical knowledge, we canhelp make the system more applicable to a range of users.In this thesis, we make the following assumptions:• There is a centralized relational repository.• The DBA and the user of the data are two separate people.• The user of the data has made a copy of the data and stored it on their localmachine. Any application that uses the data (e.g., Microsoft Excel) is alsocompatible with relational databases.• There are no constraints or views.As mentioned in the Introduction (Chapter 1), this thesis looks at managingdata updates in four different scenarios:1. When changes are made locally on the user’s machine2. When changes are made on the database and need to be propagated to theuser3. When changes are not stored locally but the data has to be updated4. When generating a list of changes by comparing two different versions of adatasetIn order to accommodate the first three scenarios, two different user interfaceswere created: one for the user (who is a domain expert but may lack technicalcompetence) (Figure 3.1), and one for the DBA (who is not a domain expert) (seeFigure 3.2). The system was implemented from the point of view of both thedomain expert and the DBA in order to determine the metadata requirements sothat the system has the ability to support data updates from the first three scenarios.16Figure 3.1: What the user interface would look like once the user has load-edthe data to be viewed.17Figure 3.2: The first screen the DBA sees as he/she prepares to get the subsetof data needed by the domain expert user.18Chapter 4System Internals4.1 Transformation ScriptTo represent what has happened to a dataset (and thus propagate and merge thecorrect changes to the centralized database), changes needed to be recorded ina standardized manner. The transformation script is a way to standardize thosechanges (see Table 4.1 for the template of all the data transformations that we con-sider in this thesis). The items inside the square brackets in the data transformationtemplates represent possible values that could appear in an actual transformationscript. Each user action is accompanied by a timestamp that represents when achange occurred. This timestamp is to help identify which changes should takeprecedent over another. In this system, when two or more changes happen to thesame cell and are pushed to the database at approximately the same time, the sys-tem first examines the trust value (see Section 4.5) of the individuals involved withthe changes. If one individual is more trusted than another, the changes from themore trusted individual will be chosen. However, if the individuals involved inthe changes all have the same trust value, then timestamp will be the determiningfactor of which change gets accepted into the database. The change with the latertimestamp will be the one accepted by the database.When designing the transformation script, it was important to keep it readablefor non-technologically savvy users. While we do not expect users examining thelist of changes as a common use case, we do think that occasionally users will19want to see what changes they have previously made to their data–especially ifthe user has been working on the data over a period of time. For that reason, twoof the criteria under consideration during the transformation script design werereadability and conciseness.The timestamp in each of the transformation scripts determines the chrono-logical order of data transformations by using time-aware schemas [34] to helpfacilitate data integration.Although brevity was a high priority when designing this transformation script,there were times when this design concern was shelved in order to lower the num-ber of database calls required. For example, the edit value transformation scriptstores the previous value of the cell to help reduce the number of database calls re-quired when trying to determine possible edit conflicts. If user A and user B haveboth edited the same cell within two hours of each other (Section 4.9), but user Ahas pushed her changes before user B, user A should have no problem when up-dating the cell. However, when user B pushes, there should be a conflict warning.In order to successfully apply the change, we have to determine what value we arechanging a cell from. There are two possible scenarios that could have occurredwhen this has happened. The first scenario is when both user A and user B aretrying to change the same value (e.g., user A and user B have both tried to changea cell that used to hold a value of 3). The second scenario is where user A hasmade a change to the cell (e.g., edited the cell to have a value of 5), and user B hasobtained a copy of the data after that point and made changes (e.g., has edited thecell from 5 to 7). In the first scenario, there should be a conflict warning as users Aand B are trying to overwrite the same value in the cell. Conversely, in the secondscenario, we should accept that change without any conflicts even though user Aand user B have made changes within two hours of each other.If we did not store the previous value of a cell in the edit value transformationscript, we would first have to go to the AppData table (Section 4.3) to obtain thelast sync time of user B’s dataset. Once we have the sync time, we would thenhave to go into the table that stores the provenance data of the cell in question todetermine all the changes that have happened to it to see if there was any changethat occurred after user B’s last sync time. Using this method, we can come to theconclusion that user B has a conflict; but, a faster heuristic to avoid all the database20calls is to store and use the previous value of the cell.User Action Data Transformation SyntaxAdd Row addRow(rowUUID:[c360b701-f91b-4920-937e-4afdfbebe194]), timestampRemove Row removeRow(rowUUID:[dca301c7-2a25-40f4-8212-b1b495c0d4dc]), timestampAdd Column addCol(colUUID:[22ce55c9-5af3-48a1-ba24-b160961fa048], columnName), timestampRemove Column removeCol(colUUID:[22479ac1-627c-44d6-a14b-ce273bba70af]), timestampEdit Value editValue(rowUUID:[fd8f3dd2-76f1-43c6-8930-c4831997bf49] , colUUID:[fa5d24c7-39a8-43f5-ad07-75a3dd8eda02], previousValue:[cat],updatedValue:[hat]), timestamp)Table 4.1: The data transformations logged to the transformation script as theuser makes changes to the data.4.2 UUIDsThe Java UUID class can generate unique 128-bit values and were used to cre-ate unique keys to identify rows and columns. To lessen the burden on non-technologically savvy users, data that is imported into the system will have theUUIDs added; the user bears no responsibility for ensuring that their data has thecorrect UUID values before database import. By using UUIDs, the maximum num-ber of arguments the transformation script would need to store would be four: therow UUID, the column UUID, the new value of an edited cell, and the previousvalue of the edited cell. All the other user actions listed in Table 4.1 would requirefewer arguments.214.2.1 Dataset UUIDsAs it is plausible that a user could request multiple subsets of data (e.g., if a grad-uate student was working on multiple projects, presumably he/she would require aseparate set of data for each project), each dataset generated for a user has its ownunique UUID. This UUID is tracked in the AppData table (see Section 4.3).4.2.2 Row and Column UUIDsWhen designing the transformation script, a key problem was that it was hard tocorrectly identify a particular row or cell in all the different user actions. In adatabase, rows and cells are identified by primary keys; primary key(s) can un-doubtedly point to only one option. However, in a situation where the user doesnot have direct access to a database and relies on a transformation script to expressany possible changes, actions like edit value could have a catastrophic effect onthe dataset as any change to a primary key cell would cause changes to happen tothe wrong cell when pushing changes to the database. For example, suppose wehad a Movie database where the primary key is a single column, MovieName (seeTable 4.2). If we did not use row and column UUIDs and a user working on hislocal dataset decides to edit the value in the MovieName column, say from De-spicable Me 2 to Inside Out, it would be impossible to identify the correct row tochange when the change is pushed. The transformation script would record some-thing along the lines of “editValue(colName: MovieName, value:‘Inside Out’)”.We would not be able to identify that the cell being changed is the one with “De-spicable Me 2” as a value for MovieName. An initial solution to this problem wasto include the original value of the cell in the transformation script along with thenew value but this approach does not scale when it comes to considering otheruser actions. For an action like delete row, if the row being deleted has many pri-mary key columns (i.e., because it is a composite key), the resulting transformationscript would be long and cumbersome to read. Hence, the idea of using UUIDs touniquely identify each row and column was introduced. Since the UUIDs are onlyused to keep track of cells and rows, the user application hides the UUID columnswhen data is loaded into the UI to prevent users from being confused.If the user ever requests data that comes from two or more tables, the UUIDs22of the rows in the resulting dataset are not changed. The UUIDs will be appendedto each other based on the alphanumerical order of their respective tables.MovieName MovieStudio CriticRating UserRatingDespicable Me 2 Illumination Entertainment 5/5 5/5Up Pixar 5/5 5/5Kung Fu Panda 2 Dreamworks 4/5 5/5Ratatouille Pixar 3/5 4/5Table 4.2: Movie database example 1.4.3 Database ArchitectureIn order to keep track of the various UUIDs used in the system (both row UUIDsand column UUIDs will need to be tracked), a few extra tables must be included inthe database. These tables include:• UuidColMapping: a table that tracks the UUID of each and every column inthe tables that store the dataset. There are four columns in this table: UUID,TableName, ColumnName, and PrimaryOrNot. PrimaryOrNot is a columnthat stores a boolean value that determines whether or not the column inquestion is a primary key column.• AppTableData: a table that tracks which dataset has its data.• AppData: a table that stores the name of each application that has taken datafrom it, when the application was last synced, what trust value was assignedto that application, and what dataset the application has (each dataset has itsown unique ID) (see Section 4.2.1).• LockData: a table that tells the database which applications are currentlytrying to push changes to it. If ten minutes have passed and the database stillhasn’t been updated, the entry will be deleted and other apps can try to pushchanges. The purpose of this table is to prevent multiple applications fromtrying to concurrently change the same tables. Ten minutes was chosen as await time because we assume that users are using somewhat current pieces23of hardware and if an update has not finished in ten minutes, we assume thateither something has gone wrong and needs to be terminated or the updateswill not finish even if we allow the application to take extra time.4.4 Data ProvenanceA big concern when designing how and when to capture data provenance was howto minimize the amount of space taken up by the provenance data and still pro-vide enough information to recapture the data provenance. We decided to create aHistory table (as seen in Figure 5.1 and Figure 5.2) for each data table within theschema. There was a definite trade off between the minimal amount of data wecould store and the number of database calls that would be required to fetch all theinformation required for the visualization (Section 4.11). We decided to err on theside of minimizing the number of database calls as database calls would quicklybecome a performance bottleneck, even with relatively few changes made to thedata.The data provenance table (Table 4.3) stores seven things: the row and columnUUID to correctly identify which piece of data was being modified, the timestampof modification to help chronologically sort data transformations, the value thedata was changed to, the previous data value (to save having to re-query the historytable to determine if a change would cause a possible data conflict), the action takenduring that data altering step (i.e., was it an “AddRow” action or a “DeleteRow”,etc.), and the trust value of the application that changed the data to help futureconflict resolutions. We did not want to use one central table to store all provenanceinformation as we felt that using this method would create one cumbersome tablethat would have poor performance as provenance queries were executed using it.4.5 Trust ValuesIn this system, trust values are assigned to the different users working on the data.The premise behind this is the idea that some of the users (e.g., researchers) aremore authoritative figures when it comes to the accuracy of data as compared toother, possibly more causal, users (e.g., first year students). Thus, the group ofusers who lean towards the domain expert side of things are given a higher trust24Columns in a Provenance TableRowUUIDColUUIDNewValueOldValueActionTakenTimestampTrustValueTable 4.3: The columns within a data provenance table.value than other users. Given a change from each of these user groups (researcherand student), the probability of the researcher pushing a correct change is higherand as such, preference is given to changes from the domain expert group. Trustvalues are assigned by the DBA as the data is exported from the system.4.6 User ActionsAny data changes that can occur fall into one of eight types of changes: map,lookups and joins, reshape, positional, sorting, aggregation, key generation, andschema transformation [24]. When deciding on which data changes to support inthis system, it was decided that only map and reshape should be supported as thesetwo actions were deemed to be the most basic out of the eight types of changes.Before changes in the other categories could occur, map and reshape changes hadto be supported. Therefore, we started with these two classes of changes.Map changes deal with “[transforming] one input data row to zero, one, ormultiple output rows” [24]. This would include changes such as deleting a row.Reshape changes deal with schema changes such as adding or removing a column.If we had decided to add an extra column to denote the movie release date in Table4.2, that would be considered a reshape change. The scope of this thesis does notdeal with changing the data type of a column.It is also important to note that as users are performing actions, they will havethe option of choosing whether or not that option should sync to the database. Asusers may perform changes that are only meant to help themselves manage andnavigate the data, not all the changes should be automatically synced. Changes25Figure 4.1: The red box shows the “Record Changes” checkbox that deter-mines whether or not changes will get synced to the database.that occur while the “Record Changes” checkbox located on the top of the userview is checked (Figure 4.1) will be synced to the database at a later time.4.7 Requirements for Exporting DataDespite what columns a user can request from the DBA, two requirements will ap-ply to the exported data. The first requirement is that for every table that the userrequests a column from, the primary key column(s) of that table must be included.This is to prevent issues that may result from missing primary key values whenthe user tries to push a change that involves adding a row (see Section 4.8.3). Thesecond requirement is that for every “NOT NULL” column that exists in the ta-ble(s) that the user requests column(s) from, the “NOT NULL” columns will haveto be included in the exported data. This ensures that AddRow changes will besuccessful.Each dataset that is exported has a unique UUID associated with it. This wasdone for instances where a user could have two or more datasets in use concur-rently. The UUID of a dataset associates a group of columns from that dataset withthe user’s application name to track who has received the data. When the DBA26Figure 4.2: If the “Record Changes” checkbox is checked, any changes thatare made will be recorded in a transformation script that will be usedlater to push changes to the database. On the right side of this figure, weshow the transformation script for changes that a user has made usingthe user interface on the left side of the figure.exports data to the user, the user actually gets two files. One file contains the user’sdata while the other file contains the schema mapping information of the data thatthe user has received. The schema mapping file contains the dataset UUID alongwith the UUID information of the columns in the dataset (i.e., the column UUID,the name of the table the column belongs to, the name of the column, whether ornot the column is a primary key, and the data type of the column).4.8 Restrictions on User Actions4.8.1 Add a ColumnUsers cannot add a primary key column (i.e., the user cannot specify that the newcolumn is unique and not null) since we cannot retroactively ensure that the pre-vious rows existing in the dataset being viewed are filled out. For example, if theuser only fetches a subset of data from two different tables, adding in a primary27key column would be impossible as the user would have to ensure that data rowshe/she is not viewing would also have unique and non-null values. In our movieexample, say a user wishes to see all the information about Pixar along with anymovies Pixar has produced that has a user rating of 5/5. By taking all the columnsin Table 4.4 for the Pixar row and the MovieName column from Table 4.2 whereuser rating is 5/5 and the movie studio is Pixar, we get one row as a result (see Ta-ble 4.5). If we wanted to add a primary key column to Table 4.5 , two issues wouldappear. The first issue is the issue of which table the new column should belongto. In the movie example, we would need to determine whether the new columnshould go into Table 4.2 or Table 4.4. The second issue is regardless of which tablethe column ends up being added to, there are rows in the table that need values. Inthe case of Table 4.2, all the other movies (i.e., Despicable Me 2, Kung Fu Panda,and Ratatouille) will need to have values for that new unique and not null column.Likewise, any movie studio in Table 4.4 that is not Pixar will need to have valuesfor the column.For a small dataset, especially for our toy movie dataset example, it could beargued that a user could enter in the rest of the values and it would not providemuch of a hinderance. However, for any dataset beyond a small number of rows, itwould start to become burdensome and impractical. Thus, the restriction of addingprimary key columns was created.MovieStudio YearLaunched HQLocationPixar 1986 Emeryville, CaliforniaWalt Disney Animation Studios 1923 Burbank, CaliforniaIllumination Entertainment 2007 Santa Monica, California20th Century Fox Animation 1997 Los Angeles, CaliforniaDreamworks Animation 2004 Glendale, CaliforniaTable 4.4: Movie database example 2.MovieStudio Year Launched HQ Location MovieName User RatingPixar 1986 Emeryville, California Up 5/5Table 4.5: Result for selecting all the information about Pixar along with anymovies produced by Pixar which received a user rating of 5/5.28Figure 4.3: What the user UI would look like when the user starts to add acolumn.4.8.2 Remove a ColumnA primary key column cannot be removed by the user if the “Sync” checkbox ischecked. This is to protect the dataset from accidental mass deletions. As deletinga primary key in a table may drastically influence the table and any dependencies(i.e., foreign key relationships), it would be safer for all parties using the data ifany primary key deletions were done by the DBA to preserve the other tables in thedatabase schema.29Figure 4.4: The result of performing the add column action shown in Figure4. Add a RowWhen adding a new row, there is no requirement that all the columns in the ap-plication view have to be filled out. As long as the primary key is filled out, theuser can successfully add a row. For example, if MovieStudio was the primarykey column of Table 4.4 and a user with the dataset depicted in Table 4.5 tried toadd a new row by only giving a value for Year Launched, the row would not beadded. However, if the user tried to add a new row by giving values to both YearLaunched and MovieStudio, then the row could be successfully added to Table 4.4(see Figure 4.5 and Figure 4.6 for what happens when a user adds a row).30The system also does a simple type check to ensure that number columns donot contain characters or strings. Since any columns that have restrictions suchas “NOT NULL” are automatically included in the dataset, we can ensure that allrows can be successfully added. The row UUID will be automatically generatedon the user application side when the row is created. Due to the nature of UUID’s,it is unlikely that the UUID will clash but in the case where the new row’s UUIDwill clash with a UUID already in the database, a new UUID will be assigned tothe row.When adding a row, multiple transformation scripts get generated. The act ofadding the new row would be one transformation script while each of the valuesgiven to a cell in the new row would be an EditValue transformation script. Forexample, if we wanted to add a new row to Table 4.4 with the values “StudioGhibli” for “MovieStudio”, “1984” for “Year Launched”, and “Tokyo, Japan” for“HQ Location”, the transformation script for the AddRow action would look likethe one shown in Table 4.6 (assume that the example UUIDs are 128 byte values–the full UUID values have been left out for brevity).TransformationsaddRow(rowUUID:1234), 2016-03-16 20:35:05.641editValue(rowUUID:1234,colUUID:2345,‘ ’,‘Studio Ghibli’), 2016-03-16 20:35:05.642editValue(rowUUID:1234,colUUID:6349,‘ ’,‘1984’), 2016-03-16 20:35:05.642editValue(rowUUID:1234,colUUID:8745,‘ ’, ‘Tokyo, Japan’), 2016-03-16 20:35:05.643Table 4.6: The list of transformations that would be generated when you adda new row to Table Remove a RowThere are no restrictions on RemoveRow. Any row can be removed by any user.For users that deal with data that is not stored locally (i.e., scenario 3 in Chapter1), DeleteRow can still occur.We discuss constraints on views in Section 4.10.31Figure 4.5: What the user sees when he tries to add a row. There is no restric-tion on needing to fill all the columns shown unless a column has beenspecified as not null.4.8.5 Edit a ValueIf the user tries to clear the value of a primary key, a warning will appear letting theuser know that this value is necessary for identifying the row and asking if they aresure they want to delete it. If the user insists on clearing the value in the primarykey column, the row will be deleted. The user will never be forbidden from editinga value.32Figure 4.6: The result of having performed the add row action from Figure4.5.4.9 Pushing Data Changes to the ServerThe user decides when he wants to push his changes to the database (see Figure4.7). Once the user decides to sync, the transformation script will be sent to theserver. At this point the server will process the transformation script and convertit to the query language used by the database to input the changes. If there arechanges that occur within two hours of each other, we consider that a conflict.When there is a conflict, we determine if there is a difference in trust value of thetwo data changes that are in conflict. If there is a difference, we trust the datachange with the higher trust value. If there is no difference in trust values, we usethe data change with the later timestamp.Two hours was chosen as the time limit for detecting conflicting changes arbi-trarily. In the future, it may be wise to investigate whether using a sliding scale to33Figure 4.7: How the user would sync changes to the database using the userinterface.determine time limits would be more useful. It seems reasonable to assume thatnewer datasets would need a larger time limit for detecting conflicting changes aspeople are slowly starting to use the data and discover errors. Older datasets couldpossibly use a shorter time limit, or the time limit could be determined based onother factors like a dataset’s frequency of use.Aside from updating the user’s changes on the database, the server will alsocheck to see what changes have occurred to the dataset since the user last synced.From the AppTableData table (see Section 4.3), we can determine which columnsexist in the dataset that the user is trying to sync; we can use the AppData table(Section 4.3) to determine when the last sync time of this dataset was. From the34UuidColMapping table, we can determine what tables the columns belong to andthus, we know which provenance tables to examine for possible data updates. Oncewe have the correct provenance tables, we can use the time of last sync, and thecolumn UUID to find all the changes that have occurred to the values in a columnsince the data was last synced. These changes can then be sent back to the user.As the system currently stands, the server portion of the system has not beenimplemented and will be left to future work. To simulate the server API response,we have a file that stores a few transformation scripts. During testing (Chapter 5),the user application reads from the file and sees it as changes that the server hasgiven it.4.10 Data Transformations on Non-locally Stored DataIn the four data usage scenarios that we introduced in Chapter 1, our third scenarioinvolves users who do not have a local copy of data. These users essentially sufferfrom a view maintenance problem (Section 2.4). As the overhead of constantlyfetching changes from the server and reapplying the user’s own local changes istoo high to realistically support, we have decided to approach this scenario muchlike how we approach the first two usage scenarios in Chapter 1. Specifically,we still allow users to choose which changes they want to sync to the database.The difference comes from how the changes are synced. With the first two datausage scenarios, the user had the choice of when to push their changes to the server(Section 4.9). In this data usage scenario, the user will not get a choice of when topush to the server. The system will periodically choose to push the user’s changesand fetch back changes that were made. The system will then take the fetchedchanges and the user’s local copy of the changes to merge the changes together. Inthe case where a fetched change conflicts with a local change, the user will choosethe resolution of that conflict (i.e., the user chooses which change is correct: theone from the database or the one she has made previously). The exact time of whento periodically push the changes is currently arbitrarily decided.354.11 VisualizationTo help support non-technologically savvy users, it was proposed that there besome way for users to view what has happened to a piece of data over time. WhileDBAs and people familiar with querying data may be able to come up with an-swers to questions like “What has happened to this piece of data since it was in-troduced?”, “What was the last change to happen to this piece of data?”, or “Whenwas this piece of data introduced?”, domain expert users lack this ability. To helpfill this need, we came up with a visualization that would show users what hashappened to a piece of data since it was first introduced into the dataset.The visualization breaks the information down into three levels: year, month,and date. The basic motifs of the visualization remain the same in each level butthe information displayed to the user will be successively more detailed. The visu-alization is a timeline with coloured bars on top. The height of the bars representhow many changes have been made in that year/month/date while the colour of thebars represents the frequency of changes. In the year view (Figure 4.8), if the pieceof data has 20 or more changes within two months, the bar will be coloured redto represent frequent changes; if there are between 10 to 19 (inclusive) changeswithin two months, then the bar will be orange. In the month view (Figure 4.9),if the piece of data has 20 or more changes within a period of seven days, thenthe bar will be red; if there are between 10 to 19 (inclusive) changes within sevendays, then the bar will be orange. For the day view (Figure 4.10), if a piece ofdata has more than 8 changes, the bar representing that day will be red; if there are4 to 7 (inclusive) changes, then the bar will be orange. If the number of changesfall below the threshold for an orange bar, the bar will be green. There will be agradient legend in the visualization demonstrating how the colour scheme works.If the user clicks on one of the bars, the details of the changes represented by thatbar will appear below the bars.For more information about how the visualization was evaluated, please seeChapter 6.36Figure 4.8: What the user sees if he/she is in the “Year” view and has clickedon a bar to view changes that have occurred on that day.4.12 Generating a Transformation Script from DataSnapshotsAs mentioned in Chapter 1, this thesis looks at managing data updates in fourdifferent scenarios (see Chapter 1). For the fourth scenario (generating a list ofchanges by comparing two versions of data), we attempted a more brute force ap-proach similar to the diff operator in Model Management [6]. We very quicklyrealized that it was impossible to generate a fully accurate transformation scriptbecause intermediate data changes could not be captured. For example, if a be-fore data snapshot had a piece of data with the value “1234” and the after snapshot37Figure 4.9: What the user sees if he/she is in the “Month” view and hasclicked on a bar to view changes that have occurred on that day.had the value “2345”, we can only say that the piece of data has been edited from“1234” to “2345”. However, that piece of data may have had other intermediatevalues before becoming “2345”. That intermediary information can never be gen-erated from a before/after data snapshot.Ideally, when comparing two data snapshots, the order of the data can bechanged to sort the data by row UUID. If the actual data files cannot be modi-fied, it is strongly suggested that copies of the data files be made and the data besorted in the copies. Not sorting the data will cause a much higher level of com-plexity as each and every row in both data files have to be compared to determine ifany edit value changes were made in the rows. If the data is sorted, the complexity38Figure 4.10: What the user sees if he/she is in the “Day” view and has clickedon a bar to view changes that have occurred on that day.of examining each and every row in both files would be O(m+ n) whereas if thedata were not sorted, the complexity would be O(mn).When comparing two data snapshots to generate the transformation script, thecolumns should be checked first to determine if any columns have been added orremoved. By checking and accounting for any removed columns first, a myriadof edit value transformation scripts are avoided (i.e., if you do not check for re-moved columns first, you will end up with each and every row giving you an editvalue transformation where the value in a cell has disappeared). Checking for newcolumns before checking any rows helps minimize the number of times that eachrow in the data file has to be read in because when you check the row, you can39account for the extra cell(s) while checking the value in each of the original cells.The alternative would be to read each row in the after snapshot and only check forcolumns that appear in the before snapshot. Then, at a later point, you would haveto reread each and every line to look at the additional cells in the after snapshot thatyou did not examine previously.Once we determine which columns have been added or removed, each andevery row will need to be examined. The row UUID can be used to quickly check ifa row is new or not (this is where sorting the data will become handy because sorteddata minimizes the number of rows that need to be checked before determining arow UUID does not exist in a data file). If the row UUID exists, then each of thevalues within that row can be checked with each other. Unfortunately, there is nofaster process to do this. If the row UUID does not exist, then the row has eitherbeen removed (if the row UUID was in the before snapshot but not the after) oradded (if the row UUID is in the after snapshot but not the before).40Chapter 5ExperimentIn order to determine if the system could handle the first two data usage scenariosdescribed in Chapter 1, we set out to test the system with some real life data wefound (Section 5.1). We focused on testing just the first two scenarios because thethird scenario (i.e., when a user works with data that is not locally stored on hermachine) is considered a subset of the first scenario (when a user has a local copyof data that he is making changes to). If the system can handle the first scenario, itis considered trivial to add in a timer that would periodically push changes to theserver. The important part is to demonstrate that our system can handle parsingthe transformation script that is used to store transformations in the database andupdate the current value of a data point accordingly. We also evaluated our systemvisualization (Section 4.11) using a user study. In Chapter 6, we discuss the detailsabout the user study as well as analyze the results that we obtained.5.1 Data5.1.1 GLEIThe GLEI (Global Legal Entity Identifier) dataset was created in response to thefinancial crash that occurred in 2007-2008 as a means to track various financialorganizations around the world [9]. Its goal is to improve financial transparency byrequiring that all parties participating in the financial markets of the world identify41Figure 5.1: The structure of the GLEI data within the MySQL database.themselves according to a predetermined standard. The dataset that is used in thisexperiment came from a lab-mate, O. AlOmeir [4], who obtained it from the workof V. L. Lemieux, P. Phillips, H. S. Bajwa, and C. Li.Not all the entries in this dataset were usable for experimentation as it con-tained non-unicode characters that were not able to be ported into MySQL. In total,192105 entries were ported into MySQL and the data took up 74.7MB of space.The data schema in Figure 5.1 depicts how the data was structured.5.1.2 Bird DataThis data came from Dr. Jill Jankowski’s lab (UBC zoology) and primarily focuseson the migration pattern and changes in birds over time. Nets are set in the forestto try and catch birds in a safe manner. Teams of researchers will carefully takeeach bird and observe whether it has a band on its leg. Birds with bands on theirlegs signify that they have been caught previously and those birds will have theirweight/size measured and observations may be made about their general health.If a bird does not have a band on its leg, then the researchers will put a band on42Figure 5.2: The structure of the bird data from 2011.its leg, record the number of the band and then take the bird’s measurements. Inparticular, the data used for our experiment process is from 2011 and is 471 rowslong. The data schema in Figure 5.2 shows how the bird data is structured. Thisdataset is used as a measuring stick to see if our system will work.5.2 Technical InfrastructureThe datasets mentioned in Section 5.1 were stored on a MySQL server running the5.6.28 openSUSE package.The client used to connect to MySQL (and also the same computer used torun the user studies) is a a MacBook Air running OS X El Capitan with 8GB ofmemory and 1.7GHz i7 processor.435.3 Testing for CorrectnessTo determine whether or not the system can handle managing updates, we carefullygenerated and curated a list of 100 changes for each of our datasets. A number ofcolumns in each of the datasets were chosen as columns that could have their val-ues changed but no primary key column would have its value changed. A scriptthen generated a list of changes by randomly choosing a row in the database andgenerating one to four different changes in the row. Out of the 100 changes gener-ated, we made sure that some of the changes would modify the same rows at timesthat would cause a conflict. At the end of this change generation process, we deter-mined what the final values of the modified rows should be based on the changesgenerated and used that as our “answer key” when checking for the correctness ofrow values in the database.We ran this procedure 100 times and the database output was what we expectedeach time.44Chapter 6User StudyTo determine whether the visualization component of our system (Section 4.11)would help non-expert computer users understand the changes their data has un-dergone, we ran a user study to see how expert and non-expert users would react tousing the visualization. Particularly, we were interested to see if the visualizationwould help users determine whether they should trust a value in an easier mannerthan what is currently available to them (a spreadsheet application). We measuredtwo metrics, time to task completion and the user’s own rating for whether or notone tool helped them more than the other, as we felt these two pieces of informa-tion would be indicative of the information we were interested in. We also wantedto see whether or not the visualization helped expert users and whether or not thegap between time to task completion between expert and non-expert users wouldchange. We were also interested in learning about what qualities users examinedwhen looking at an unfamiliar dataset. A secondary question we wanted to answerwas if there were certain data qualities that users of all skill levels examined whenlooking at suspicious data.6.1 ParticipantsAside from the requirement of having to be a current UBC student, there wereno other constraints on participation in this user study. Participants were sourcedby putting up flyers in the common areas of the UBC Computer Science building;45sending emails to the Computer Science, and Electrical and Computer Engineeringgraduate mailing list; in-class announcements, and by sending a call for participa-tion to biology graduate students in Dr. Lacey Samuel’s lab. In total, there werefifteen participants (Table 6.1).Area of Study Student Status Number of StudentsBiochemistry Undergraduate 1Biology Graduate 1Biology Unclassified 1Cognitive Science Undergraduate 1Computer Science Graduate 6Computer Science Undergraduate 2Electrical and Computer Engineering Graduate 2Linguistics Unclassified 1Table 6.1: Basic information about the participants who participated in theuser study.On a scale from 1 to 5 with 1 being “Very Unfamiliar and Uncomfortable withComputers” and 5 being “Very Familiar and Comfortable with Computers”, par-ticipants averaged 4.6, with 10 participants rating themselves as a 5, 4 participantsas a 4, and 1 participant as a 3. 11/15 of the participants work with data in someform ranging from using Excel sheets dealing with statistical data like graphs andmodels to using R/Python to handle thousands to tens of thousands of data points.6.2 MethodologyParticipants in the user study worked with the bird data discussed in Section 5.1.2.The decision to remove the GLEI (Section 5.1.1) data from the user study wasbecause Microsoft Excel did not handle a dataset of that size well and tended to lag.Since the lag would affect the task performance time metric, we decided to let userswork solely with the bird data so that the task performance time measurementscould be as accurate as possible.Before each participant started the user study, the computer randomly decidedtwo things: the participant’s ID number, and which software tool (a spreadsheetapplication (in this case Microsoft Excel) or the visualization in the system) the46user would start with. At the beginning of the study, we showed the participant theconsent form and call to participation forms and allowed the participant to take asmuch time as they need to read and sign them. If the participant signed the forms,the user study would start.We first explained what the participant would be doing (i.e., what kind ofdataset the participant was looking at and what they would be asked to do). Abrief explanation of the timer tool (Figure 6.1) was given. User study participantshad a chance to test out how to use this timer tool before the study started. Dur-ing the first training task, the timer asked users to click on the “Start Task” button,complete the demographic questionnaire (Appendix A), and then click on the “EndTask” button. The goal of this task was to start getting participants used to using thetimer interface to time their tasks. During the user study, the graduate student tooknotes about the participant’s use of the software tools, and a screen capture was runto try to track the way participants interact with software tools to understand data.After the demographic questionnaire, the user study began. As mentioned pre-viously, the computer randomly decided which software tool the participant startedwith but regardless of which software tool they started with, they would be askedto perform the same task three times. The participant was asked to pretend thatthey were doing a report for work and determine whether or not they would usea particular data point in their report. The participant then examined three datapoints: 1) a data point that has been modified in non-conflicting ways since its in-troduction into the data set, 2) a data point that has been modified but the changeshave not been reverted back by someone, and 3) a data point that has had its valueconsistently modified back and forth. After examining each data point, the par-ticipant was asked to complete a section of the task questionnaire (see AppendixB). Once the participant finished examining three data points with the initial soft-ware tool, he/she was asked to complete the General Questions portion of the taskquestionnaire. The participant then repeated the above steps for the other softwaretool.47Figure 6.1: The timer interface that user study participants use to time taskcompletion.486.3 ResultsTable 6.2 displays the average values for the quantitative data derived from the userstudy, specifically, the answers to the questions of whether or not the participantwould use a data point in an important report (or research paper) they were writingand whether or not they trust the data they were looking at. Likelihood to Use inReport refers to how likely the participant would use that data point for an impor-tant report (or research paper) that he/she is working on. In that question, 1 refers tovery unlikely while 5 refers to very likely. Ease to determine trustworthiness refersto how much easy the participant found using the tool to determine whether or notdata was worth trusting. For this question, 1 represents very easy and 5 representsvery hard. Table 6.2 also includes averages for how long it took a participant tocomplete a task. An important point to note is that some task completion timeswere not able to be tracked due to issues like user error (i.e., forgetting to click“Start Task”). We failed to obtain a task completion time for six tasks and dis-carded two timer results that were outliers. The outliers showed a task completiontime of 0.02 minutes which was not possible given what was observed during theuser studies.In the experiment, Task 1 was the data point that had been rarely modified (andmodified only in non-conflicting ways) since its introduction to the data set, Task 2was the data point that had been modified back and forth in conflicting ways, andTask 3 was the data point that had been modified frequently but not in ways wherechanges overrode each other.SoftwareTask Likelihood to Ease to Time toNumber Use in Report Determine TaskTrustworthiness CompletionSystem1 3.17 2.93 5.882 2.27 2.00 3.73Visualization 3 3.33 3.00 5.39Microsoft1 3.73 3.47 8.502 1.60 2.53 6.03Excel 3 3.07 3.07 8.54Table 6.2: Averages of the quantitative data collected from the user study.496.3.1 Two Factor Repeated Measures ANOVAIn order to determine whether or not significant user interaction and preference dif-ferences existed between the software visualization and a spreadsheet applicationlike Excel, three two factor repeated measures analysis of variance (ANOVA) wasthen run on the results. ANOVA is a statistical test designed to compare two ormore group means to see if there are any significant differences between them. Inthe ANOVAs that we ran, we used α = 0.05 meaning that we are willing to take a5% chance of having a false positive (i.e., saying there is a significant result whenthere is none). One of the results an ANOVA test gives is a probability value, de-noted by p. The smaller the p value, the less likely it is that the difference in groupmeans is just due to chance. In this thesis, as we use α = 0.05, any p value lessthan 0.05 will be considered as significant.The first ANOVA (Table 6.3 and Table 6.4) investigated if there were significantdifferences in task completion times between the two software tools and the threetasks performed. We determined that there was a significant difference in task tocompletion times where users using the software visualization performed the threetasks much faster than when they used Microsoft Excel (p = 0.032). We furtheranalyzed this significant result to determine the effect of each independent variable(software tool and the tasks performed) on the dependent variable (time to taskcompletion). We found that the software tool had a significant main effect (p =0.032) and the tasks performed did not have a significant main effect on time totask completion (p> 0.05, ns).SoftwareTask Mean Time StandardNNumber to Completion DeviationSystem1 5.88 2.39 92 3.73 1.67 9Visualization 3 5.39 1.98 9Microsoft1 8.50 4.76 92 6.03 2.24 9Excel 3 8.54 6.77 9Table 6.3: Mean and standard deviation for the time to task completion. Al-though we had 15 participants, N = 9 because not all participants man-aged to successfully time all their tasks.50Source dfMeanF Sig.Partial EtaSquare SquaredSoftware 1 97.724 6.759 0.032 0.458Error(Software) 8 14.458Task 2 29.276 2.805 0.090 0.260Error(Task) 16 10.438Software*Task 2 0.848 0.064 0.939 0.08Error(Software*Task) 16 13.348Table 6.4: ANOVA table generated when determining if there were signifi-cant differences in time to task completion.The second ANOVA (Table 6.5 and Table 6.6) examined whether or not therewas a significant difference in how much a user trusted the data shown. We did notfind any significant differences in trust across the two software tools (p> 0.05) butdid we find significant differences when comparing across tasks (p < 0.01). Thisis what we expect as the three data points participants explored with the tool weredesigned to be trusted in radically different degrees. We wanted to test whether ornot our tool affected how much a user would trust the data shown. The ANOVAresult has shown that our tool does not significantly affect how much a user truststhe data shown when compared to a standard spreadsheet application like MicrosoftExcel.SoftwareTask Mean Trust StandardNNumber Value DeviationSystem1 3.17 1.28 152 2.27 1.53 15Visualization 3 3.33 1.45 15Microsoft1 3.73 0.80 152 1.6 1.12 15Excel 3 3.07 1.44 15Table 6.5: Mean and standard deviation for the degree that participantstrusted the data that was shown. Trust was ranked on a scale from 1to 5 where 1 was very unlikely and 5 was very likely.The last ANOVA (Table 6.7 and Table 6.8) examined whether or not there wasa significant difference in how easy it was for a user to determine whether or not51Source dfMeanF Sig.Partial EtaSquare SquaredSoftware 1 0.336 0.152 0.703 0.011Error(Software) 14 2.217Task 2 19.836 11.756 0.000 0.456Error(Task) 28 1.687Software*Task 2 2.969 3.160 0.058 0.184Error(Software*Task) 28 0.940Table 6.6: ANOVA table generated when determining possible significantdifferences in how much user study participants trusted the data they wereshown.the data should be trusted (i.e., ease to trust data). We did not find any significantdifferences in ease to trust data across the two software tools (p > 0.05) but wedid find significant differences when comparing across tasks (p < 0.01). Whilenot exactly what we expected, this is could be due to the fact that some of thedata points the users examined had more changes that happened across multiplecolumns which made it harder for the users to determine whether to trust the dataor not. Results could have also been affected by issues related to the way thequestion was worded (Section 6.3.3) or by the participants being skeptical of thedata they were shown (Section 6.3.2).SoftwareTask Mean Ease StandardNNumber to Trust DeviationSystem1 2.93 1.39 152 2.00 1.36 15Visualization 3 3.00 1.36 15Microsoft1 3.47 0.83 152 2.53 1.51 15Excel 3 3.07 1.10 15Table 6.7: Mean and standard deviation for how easy participants found thetool to use when trying to determine how much to trust the data shownon the screen.52Source dfMeanF Sig.Partial EtaSquare SquaredSoftware 1 3.211 0.944 0.348 0.063Error(Software) 14 3.402Task 2 7.433 7.766 0.002 0.357Error(Task) 28 0.957Software*Task 2 0.544 0.544 0.592 0.037Error(Software*Task) 28 1.021Table 6.8: ANOVA table generated when trying to determine if significantdifferences existed in how easy it was for user study participants to ex-amine the data.6.3.2 Trust DifferencesSomething that the graduate students struggled with as a whole is the distinctionbetween examining a dataset for reliability and accuracy as opposed to examininga data point for reliability and accuracy. Part of the problem had to do with thefact that they were asked to evaluate data from an unfamiliar domain. Withoutknowing sufficient details about the dataset, their scientific training took over, sothey scrutinized data much more carefully and were much less trusting.6.3.3 Results DiscussionIssues with the QuestionnaireWhen we first designed the questionnaire, we had designed the questions with themost causal of data users in mind. Thus, the questions used very general languagethat was not specific to researchers and that became a large issue during the userstudy when a graduate student was the user study participant. The graduate stu-dents, as a whole, found the questionnaire’s general language quite consternatingupon first read and spent a large amount of time at the beginning of the study clari-fying what exactly each question meant. The question that provided the most diffi-culty for the graduate students was the question “... how likely would you use thispiece of data for a super important report?”. Clarification was sought about exactlyhow important the report was, what criteria they were expected to use to evaluate53the data, and what exactly was meant by the phrase “use this piece of data”. Theanswer to those questions were that the report was equivalent to a research paperthat the graduate student was trying to submit to a top-tier conference, and “use” ofthe data was defined as whether or not the graduate student would include that datapoint in the results section of their research paper. Most of the graduate studentsseemed mildly satisfied with this answer but a few persisted in trying to establishwhat criteria was required for validity. A few graduate students also took their taskvery seriously and started trying to calculate statistics such as average and standarddeviation to establish a distribution. We quickly discouraged them of doing sucha task and told them to judge the data point based on what they saw on screen butthose participants remained very skeptical of what they saw.There were issues relating to the placement of the first question in the TaskQuestionnaire–the question that asked the user the date the data point was created(Appendix B). For some reason, many participants did not see that question anddid not answer it. 16 out of 90 answers (15 participants × 6 question one’sparticipant ) to the firstquestion in the Task Questionnaire went unanswered. In the future, this problemcould potentially be mitigated if we put question numbers in front of the questionsto help alert users to the fact that there may be a question that has not been answeredyet.The Likert scales in the Task Questionnaire (Appendix B) also posed a prob-lem. Some participants did not find the scale intuitive (e.g., if 1 = very likely and5 = very unlikely, a participant might accidently assume that 1 = very unlikely and5 = very likely). The questions involving Likert scales were highlighted to drawattention to the fact that 1 may not represent what the participant is used to andthe participant was also reminded of this at the beginning of the user study. Inthe future, it would be wise to have participants directly circle “Very Unlikely”,“Unlikely”, “Possibly”, “Likely”, and “Very Likely” rather than a number 1 to 5.Dataset ConfusionIt would have been, perhaps, less problematic if we had used data that requiredless specific domain knowledge (e.g., Netflix viewer data over the last five yearsor stock price fluctuations or Wikipedia edit history of a few select articles). Many54of the user study participants found it hard to determine whether or not they wouldtrust the data simply because they did not know enough about the data. Using datathat is more general would help mitigate these issues. The only participants whodid not really question the data were the participants who were undergraduates andthe participants who had a biology background.Usability of System VisualizationA subtle but important distinction to make is the difference between having allthe data available and having the data presented in a usable way. Based on thetask questionnaire answers, participants using the system visualization generallyanswered the questions correctly. A common point of confusion was the inter-changing of the “number of people who changed the data” and the “number ofchanges that have occurred to a data point”. In reality, one person can make manydifferent changes to a data point but many participants assumed that one persononly makes one change. The explanation and the visualization should have alsomade the distinction clearer.A second point that almost all participants mentioned was the fact that therewas a lot of scrolling involved in the visualization. This is mainly in the day view(Figure 4.10) and partly because we wanted to represent how long a data point hasnot been modified for. Initially, we believed that participants would find data moreskeptical if it had not been altered for a long time. After interviewing the user studyparticipants, we only found one participant who agreed with this view. The generalconsensus was that the time of the changes would be something that is looked atin a secondary evaluation of the dataset, and the importance of the frequency ofchanges would depend on the dataset itself. Participants found the scrolling to bea pain and it would cause them to easily miss a bar indicating a change.Some users in the user study seemed confused by the lack of rows in the vi-sualization. A spreadsheet or tool that uses a row paradigm to display data (e.g.,MySQL) has been the de-facto standard for many of our participants. During thedemographic survey, 9 out of 15 participants indicated that they use tools that dis-play data in a row paradigm (i.e., spreadsheets and/or relational databases). Asthe visualization does not use a row paradigm, some participants were not used to55seeing changes displayed in that format.Determining a General Superset of ValuesOne of the motivating tasks in this thesis was to help non-computer experts uselarge datasets; so, when designing of the visualization, we had hoped to only showthe most basic information to avoid overwhelming and confusing non-experts. Wefound spreadsheet applications, while sufficient to get the job done, often camewith many issues such as changes being easily missed due to the sea of text onscreen. Furthermore, some participants do not know how to take advantage of Ex-cel’s functions. For example, 4 out of 15 participants did not know about the freezepane functionality and had to continuously scroll up and down while 3 participantsdid not know that you could sort data until prompted.A core part of the visualization has to do with obfuscating data that has notbeen changed over time to avoid having the user become overloaded with infor-mation and text. However, this approach seems to have backfired as user studyparticipants who were considered as the “experts” (i.e., the PhD students), foundthe obfuscation of table cells that had no data changes much more troubling thanthe “non-expert” participants. While the“non-expert” participants seemed to findit normal that no changes were displayed and did not question the fact that datawas missing, the PhD students tended to be much more cautious and tended topause for longer before writing down their answers on the questionnaire. The PhDstudents would also ask more questions about the data and confirm that there wasnothing going wrong with the visualization before proceeding with the question-naire. The PhD students also tended to have comments like “I need to know whocollected this data and the data collection methods used” which affected their useof the system visualization as they were quite skeptical of the data. As the userstudy asked participants to examine a dataset from a domain that many participantswere unfamiliar with, it is understandable that they were distressed at the thoughtof trying to determine whether or not they should use a dataset blindly. However,in a normal real life scenario, users tend to know who a dataset came from andwhat the general method of data collection was; so, we felt that the visualizationshould not have to deal with trying to display this data.56What People Look for in DatasetsAt the end of the user study, we followed up with two sets of questions. Oneset of questions asked about the system visualization and whether or not usersunderstood the visualization encodings used (specifically the height and and colourof the bars). The height of the bars represented how many changes had occurredin that year/month/day (depending on which time granularity they were lookingat) while the colour of the bars represented how frequently the changes occurred(Section 4.11). 13 out of 15 participants were asked about their understandingof the visual encodings after the conclusion of their user study. Out of those 13participants, only 5 understood the height encoding, only 1 understood the colourencoding, and 6 understood both the height and colour encoding.To follow up with the height and colour encoding question, participants werealso asked about what qualities they look for when evaluating a dataset for relia-bility and accuracy. Common answers included semantics of the data, complete-ness/sparsity, clustering/outliers, distribution of data points, and whether or not thedata points fit a distribution (participants mentioned they would be suspicious ifthe data did not fit or if the data fit too well). When asked if frequency was an im-portant quality to consider when evaluating data, most participants answered thatthe dataset itself would determine their answer to that question. It seems that fre-quency is not as important as we originally thought. Some participants mentionedthat in their specific dataset, having frequent changes is not only normal, but a goodsign for their dataset.Many of the user study participants emphasized that the first thing that theylook at is whether or not the data makes semantic sense to them. This makes alot of sense as people do not tend to evaluate random datasets; more often thannot, people want to find a specific answer or at least know enough backgroundinformation to understand the dataset. Unfortunately, it is hard to pick out generalvisual encodings that can help users understand the semantic meaning of data asthat would be very domain specific.57Trends in Qualitative Feedback4 out of 15 participants mentioned that they enjoyed the fact that the data waslaid out from the point of view of time. They liked seeing the “big picture” ofthe data before drilling down into more detail. However, the other participantstended to view data from the point of view of itself where they are solely inter-ested in the changes that occurred to the data rather than the duration the changesoccurred in. While future iterations of the visualization could continue to use the“overview first, detail later” [36] encoding when viewing data, if we incorporatethe row paradigm (as discussed in the Usability of System Visualization subsectionwithin Section 6.3.3), it would not make sense to continue to use the “overviewfirst, detail later” method of viewing data as it would be hard to generalize whata data point’s cells should look like when aggregated. It would be hard to comeup with rules to determine when a column’s values should be averaged as opposedto taking the most frequently occurring value; the most representative value of acolumn often depends on the semantics of the data it is displaying.Two users suggested that the name of the person who made the data changesbe displayed in the visualization. This could be done; but, in a general situa-tion, this may not always make the most sense. With datasets, especially opensource data like Wikipedia edit history or collaborative data like the bird dataset weworked with (Section 5.1.2), or that involve many different people (often completestrangers on the Internet), listing the individual responsible for a certain changemay not make much sense. We could list the individual anyways but this may startto go against the principle of minimizing the amount of data displayed to the user.58Chapter 7Future Work7.1 Intelligently Inferring Where to Add a ColumnAs the system currently stands, users choose which table a new column gets addedto by telling the system which columns (out of the ones they are currently viewing)the newly added column is most similar to. In the future, this process could beautomated based on semantic mapping from multiple sources [28]. To determinewhich column fits better with which table, we could use semantic similarity dis-tance [33] to help determine where a column may be a better fit. However, this typeof technique would only be useful for a certain type of data, namely non-numericdata. With the GLEI data (see Section 5.1.1), if we wanted to add a column, thedata within the other columns would provide a basis for trying to determine whichset of columns the new column is most similar to since the data captured withinthe GLEI dataset is mostly words. With the bird data (see Section 5.1.2), mostof the data is numeric and thus detecting similarity by semantics would not work.Presumably, we could use the column title to infer what columns are similar toeach other, but as column titles are not standardized or regulated, it would not be areliable method for detecting similarity.Another possible avenue to explore might be trying to determine where a col-umn should belong based on what columns the user has worked on in the past. Inthe case of collaborating researchers, it is likely that each researcher has a differenttopic of interest they are pursuing which would mean that each researcher looks at59a slightly different set of columns. If a user has historically viewed a certain setof columns, it could be likely that the newly added column would belong to thatset of columns as opposed to a table that the user has never looked at. Whether ornot historical data about a user’s data preferences could help guide a heuristic forautomatically adding a column to a table is a subject for future research. Such aheuristic could help users use the data and prevent tables from becoming clutteredwith irrelevant columns over time.7.2 Inferring the Transformation ScriptBeing able to look at a before and after snapshot of data to generate a transforma-tion script to describe what has happened to the data is useful. Computer errors andcrashes could cause a loss of transformation scripts. Transferring a data file fromone computer to another could cause the transformation script to be left behind.No matter what the cause, it would be useful to recapture the transformations that adataset has gone through in order to preserve data usefulness for all users involved.While we examined a more brute force approach to regenerating transformations(see Section 4.12), it is possible that a clustering algorithm or machine learningapproach could be used to regenerate a transformation script. Much like how datatransformation rules can be inferred for data integration purposes [37] or how thesystem can learn from user actions to infer what data the user wants to enter [21], itwould be interesting to see if we could automatically infer a transformation scriptvia the same mechanisms.7.3 Achieving Full SynchronizationFull synchronization in this system would require implementing a REST API forthe server in order for client applications to know when to update. The client wouldsend the transformation script as part of the parameter arguments to the server andthe server will determine if there are any conflicts. If there are no conflicts, theserver would send back a message to say it was successful and if there are conflicts,the server would send back a list of conflicts.607.4 ScalabilityCurrently when a user pushes his/her updates to the centralized server, the serverhas to lock down the tables being updated and prevent other users who may bepushing changes to those same tables from trying to push their changes at the sametime. Although this delay is not expected to last long and is capped at ten minutes(Section 4.3), as the number of users grows, this approach will not scale. Perfor-mance would start to suffer drastically as the queue to push updates grows fasterthan what the server can update. The bottleneck comes from from the server lock-ing down tables. How to scale as well as determining the breaking point of thesystem seem to be logical next steps if the system were to be developed further.7.5 Determining a Good Conflict Time MeasureCurrently, our system uses two hours as the time limit for detecting data conflicts.If there are changes to the same piece of data within two hours of one another,we consider that a conflict. However, future investigation could determine a betterheuristic. It seems that newer databases would be more prone to corrections andthus, could use a more generous time limit as people will be likely to catch multipleerrors as they start to use and explore the data. However, as a dataset gets moremature, it is likely that many of the errors have already been caught and a shortertime period would be required in order to detect any abnormal changes.7.6 Automatically Adjusting Trust ValuesA large part of what data gets accepted into our system depends on the trust valueof the person who is trying to push the change. In our system, the trust value is setby the DBA. The original intent was that if a user was a causal user of the data (e.g.,community partner interested in the data), his/her changes may be less trustworthythan an expert (e.g., researcher in that area). However, as people become experts inthe data over time, this trust value should also adjust. Currently in the system, thereis no method to allow for this flexible adjustment of trust values–the DBA has tomanually change it. It will be interesting to see what kind of metadata would needto be stored in order for this flexible trust value change to occur. A possibility may61Figure 7.1: A possible design iteration on the system visualization that triesto show users data characteristics that user study participants havedeemed useful.be to track how many changes a user has accepted/rejected/reverted by other usersin the system and then use this information to influence the trust value.7.7 Visualization ImprovementsAs discussed in Section 6.3.3, the system visualization could try to improve userinteraction with the visualization as well as try to visualize general data character-istics that users of all expertise levels (from users with a casual interest in data toserious researchers in the area) find useful. As most of the user study participantsintuitively understood that the height of a bar represented the number of changes toa cell, it is perhaps best to keep try to continue to use that visual encoding. A pos-sible design iteration prototype is annotated in Figure 7.1 and the way it operateswhen a user interacts with it is shown in Figure 7.2 and Figure 7.3.62Figure 7.2: Progression of how a change would start to appear in the pro-posed new system visualization design.7.8 Automatically Pushing ChangesFor users who are working with data that is not stored locally (i.e., scenario threefrom Chapter 1), a future avenue of research would be to find a heuristic to deter-mine the optimal time to push changes for minimizing workflow disruptions. Thiscould be a heuristic based on the user’s pattern of changes to see if we can detectany period of inactivity or determine a period where the overhead of pushing a setof changes will not disturb the user. We could also see if it is possible to calculatean optimal subset (within the set of changes that need to be pushed) to push so thatthe changes the user is currently working on will not be affected. Only part of thesystem would be locked, possibly down to the column granularity, which wouldhelp improve system performance.63Figure 7.3: How multiple changes would appear in the new proposed systemvisualization design.64Chapter 8ConclusionData is everywhere and with the increasing connectedness of the world, the demandfor collaborative data sharing is increasing rapidly. From research lab memberscollaborating with each other to Internet users working together to jot down songlyrics, data takes many forms and can be part of many different workflows. Inthis thesis, we have defined four general data usage situations that involve userscollaboratively using a centralized database to update data. To handle these foursituations, we have examined a system that helps users manage their data updatesand transformations in a centralized database environment. The system designdecisions were informed by existing literature and we have carefully thought aboutthe various constraints and tradeoffs that must exist for this system to work. Specialeffort was put into helping users understand the changes their data has undergoneby creating a human readable transformation script (Section 4.1) and by using avisualization (Section 4.11).To determine whether or not the system built can handle the first two datausage scenarios, we simulated some data changes and ran the changes through thesystem as if a user were really pushing changes to the database at random times.We then checked the database for the final values to see if the final values wereexpected (Chapter 5). We also ran a user study to gauge how well our visualizationhelped users work with their data. The results from the user study did not finda significant difference between our visualization and Microsoft Excel in terms ofhow much users trusted the data and how easy it was for users to determine whether65or not they trust a certain piece of data. However, the user study did find that usersof our visualization took significantly less time to determine the changes a pieceof data has gone through over time (Chapter 6). We then ended this thesis with adiscussion on future directions this work can take (Chapter 7).66Bibliography[1] C. C. Aggarwal. Trio: A system for data uncertainty and lineage. InManaging and Mining Uncertain Data, pages 1–35. Springer, 2009. →pages 3, 10[2] D. Agrawal, A. El Abbadi, A. Singh, and T. Yurek. Efficient viewmaintenance at data warehouses. In ACM SIGMOD Record, volume 26,pages 417–427. ACM, 1997. → pages 12, 13[3] P. Agrawal, A. Silberstein, B. F. Cooper, U. Srivastava, andR. Ramakrishnan. Asynchronous view maintenance for VLSD databases. InProceedings of the 2009 ACM SIGMOD International Conference onManagement of Data, pages 179–192. ACM, 2009. → pages 12[4] O. AlOmeir. A study of provenance in databases and improving the usabilityof provenance database systems. Master’s thesis, University of BritishColumbia, 2015. → pages 42[5] P. A. Bernstein and T. Bergstraesser. Meta-data support for datatransformations using Microsoft Repository. IEEE Data Eng. Bull., 22(1):9–14, 1999. → pages 11[6] P. A. Bernstein and S. Melnik. Model management 2.0: manipulating richermappings. In Proceedings of the 2007 ACM SIGMOD InternationalConference on Management of Data, pages 1–12. ACM, 2007. → pages 37[7] F. Bry and S. Schaffert. Towards a declarative query and transformationlanguage for XML and semistructured data: Simulation unification. In LogicProgramming, pages 255–270. Springer, 2002. → pages 12[8] P. Buneman, S. Khanna, and T. Wang-Chiew. Why and where: Acharacterization of data provenance. In Database Theory—ICDT 2001,pages 316–330. Springer, 2001. → pages 1167[9] K. K. Chan and A. Milne. The global legal entity identifier system: Will itdeliver? Available at SSRN 2325889, 2013. → pages 41[10] L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. Dbnotes: a post-it system forrelational databases based on provenance. In Proceedings of the 2005 ACMSIGMOD International Conference on Management of Data, pages942–944. ACM, 2005. → pages 9, 10[11] Y. Cui and J. Widom. Lineage tracing in a data warehousing system. InProceedings of the 16th International Conference on Data Engineering,pages 683–684. IEEE, 2000. → pages 11[12] T. Dasu and T. Johnson. Exploratory data mining and data cleaning, volume479 of Wiley Series in Probability and Statistics. John Wiley & Sons, 2003.→ pages ii[13] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. DeclarativeData Cleaning : Language, Model, and Algorithms. Research ReportRR-4149, 2001. URL https://hal.inria.fr/inria-00072476. Projet CARAVEL.→ pages 12[14] W. Gatterbauer, M. Balazinska, N. Khoussainova, and D. Suciu. Believe itor not: Adding belief annotations to databases. Proc. VLDB Endow., 2(1):1–12, Aug. 2009. ISSN 2150-8097. → pages 9, 10[15] B. Glavic and K. R. Dittrich. Data provenance: A categorization of existingapproaches. In BTW, volume 7, pages 227–241. Citeseer, 2007. → pages 11[16] R. Goldman, J. McHugh, and J. Widom. From semistructured data to XML:Migrating the Lore Data Model and Query Language. In ACM SIGMODWorkshop on The Web and Databases ( WebDB 1999), 1999. → pages 11[17] T. J. Green, G. Karvounarakis, N. E. Taylor, O. Biton, Z. G. Ives, andV. Tannen. Orchestra: Facilitating collaborative data sharing. InProceedings of the 2007 ACM SIGMOD International Conference onManagement of Data, pages 1131–1133. ACM, 2007. → pages 8, 9[18] T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Provenance inorchestra. IEEE Data Engineering Bulletin, 33(3):9–16, 2010. → pages 3, 8[19] D. Hedgebeth. Data-driven decision making for the enterprise: an overviewof business intelligence applications. Vine, 37(4):414–420, 2007. → pages 168[20] J. M. Hellerstein. Quantitative data cleaning for large databases. UnitedNations Economic Commission for Europe (UNECE), 2008. → pages ii[21] Z. Ives, C. Knoblock, S. Minton, M. Jacob, P. Talukdar, R. Tuchinda, J. L.Ambite, M. Muslea, and C. Gazen. Interactive data integration throughsmart copy & paste. arXiv preprint arXiv:0909.1769, 2009. → pages 60[22] Z. G. Ives, N. Khandelwal, A. Kapur, and M. Cakir. Orchestra: Rapid,collaborative sharing of dynamic data. In CIDR, pages 107–118, 2005. →pages 8[23] Z. G. Ives, T. J. Green, G. Karvounarakis, N. E. Taylor, V. Tannen, P. P.Talukdar, M. Jacob, and F. Pereira. The Orchestra Collaborative DataSharing System. ACM SIGMOD Record, 37(3):26–32, 2008. → pages 8[24] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactivevisual specification of data transformation scripts. In Proceedings of theSIGCHI Conference on Human Factors in Computing Systems, pages3363–3372. ACM, 2011. → pages 3, 11, 12, 13, 25[25] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. Profiler:Integrated statistical analysis and visualization for data quality assessment.In Proceedings of the International Working Conference on Advanced VisualInterfaces, pages 547–554. ACM, 2012. → pages 11, 12[26] L. V. Lakshmanan, F. Sadri, and S. N. Subramanian. SchemaSQL: Anextension to SQL for multidatabase interoperability. ACM Transactions onDatabase Systems (TODS), 26(4):476–519, 2001. → pages 11[27] M. Lawley and J. Steel. Practical declarative model transformation withTefkat. In Satellite Events at the MoDELS 2005 Conference, pages 139–150.Springer, 2006. → pages 12[28] Y. Li, Z. A. Bandar, and D. McLean. An approach for measuring semanticsimilarity between words using multiple information sources. IEEETransactions on Knowledge and Data Engineering, 15(4):871–882, 2003.→ pages 59[29] J. A. Marsh, J. F. Pane, and L. S. Hamilton. Making sense of data-drivendecision making in education. 2006. → pages 1[30] K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. I. Seltzer.Provenance-aware storage systems. In USENIX Annual TechnicalConference, General Track, pages 43–56, 2006. → pages 969[31] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches.IEEE Data Eng. Bull., 23(4):3–13, 2000. → pages 12[32] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaningsystem. In Proceedings of the 27th International Conference on Very LargeDatabases, volume 1, pages 381–390, 2001. → pages 11, 12, 13[33] P. Resnik. Using information content to evaluate semantic similarity in ataxonomy. CoRR, abs/cmp-lg/9511007, 1995. → pages 59[34] M. Roth and W.-C. Tan. Data integration and data exchange: It’s reallyabout time. In CIDR. Citeseer, 2013. → pages 20[35] M. Seltzer, K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, andJ. Ledlie. Provenance-aware storage systems. → pages 9[36] B. Shneiderman. The eyes have it: A task by data type taxonomy forinformation visualizations. In Proceedings of the IEEE Symposium onVisual Languages, pages 336–343. IEEE, 1996. → pages 58[37] B. Spencer and S. Liu. Inferring data transformation rules to integratesemantic web services. Proceedings of the International Semantic WebConference, 3298:456–470, 2004. → pages 60[38] W. C. Tan. Research problems in data provenance. IEEE Data EngineeringBulletin, 27(4):45–52, 2004. → pages 11[39] P. Tarau. An embedded declarative data transformation language. InProceedings of the 11th ACM SIGPLAN conference on Principles andPractice of Declarative Programming, pages 171–182. ACM, 2009. →pages 12[40] A. M. Thrastarson. Managing updates and transformations in data sharingsystems. Master’s thesis, University of British Columbia, 2014. → pages 2,3[41] J. Zhou, P.-A. Larson, and H. G. Elmongui. Lazy maintenance ofmaterialized views. In Proceedings of the 33rd International Conference onVery Large Data Bases, pages 231–242. VLDB Endowment, 2007. → pages12[42] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View maintenancein a warehousing environment. ACM SIGMOD Record, 24(2):316–327,1995. → pages 1270Appendix ADemographic Questionnaire1. What age group are you in?• 18 and under• 19 to 25• 26 to 35• 36 to 45• 45 and above2. Gender:• Male• Female3. Which department do you belong to?4. What program are you a part of?5. Which year of your program are you currently in?716. On a scale from 1 to 5 with 1 being very unfamiliar and uncomfortable withusing computers and 5 being very familiar and comfortable with computers,what would you rank your comfort level as?1 2 3 4 57. Do you work with data in your normal day-to-day work?• Yes• No8. If your answer to question 6 was yes, what size is the data you work withand what do you do with the data?9. If your answer to question 6 was yes, do you use any particular tools to workwith the data?10. What do you consider as a large amount of data (e.g., file size, # of rows ina spreadsheet)?72Appendix BTask QuestionnaireCase 1When was this piece of data first created?How many people have modified this piece of data and what values has it beenmodified to? Please list all the changes made to this piece of data.On a scale from 1 to 5 with 1 being very unlikely and 5 being very likely, howlikely would you use this piece of data for a super important report?1 2 3 4 5Why did you choose this rating?On a scale from 1 to 5 with 1 being very easy and 5 being very hard, rate how easyit is to figure out if this piece of data is trustworthy?731 2 3 4 5Why did you choose this rating?Case 2When was this piece of data first created?How many people have modified this piece of data and what values has it beenmodified to? Please list all the changes made to this piece of data.On a scale from 1 to 5 with 1 being very unlikely and 5 being very likely, howlikely would you use this piece of data for a super important report?1 2 3 4 5Why did you choose this rating?On a scale from 1 to 5 with 1 being very easy and 5 being very hard, rate how easyit is to figure out if this piece of data is trustworthy?1 2 3 4 574Why did you choose this rating?Case 3When was this piece of data first created?How many people have modified this piece of data and what values has it beenmodified to? Please list all the changes made to this piece of data.On a scale from 1 to 5 with 1 being very unlikely and 5 being very likely, howlikely would you use this piece of data for a super important report?1 2 3 4 5Why did you choose this rating?On a scale from 1 to 5 with 1 being very easy and 5 being very hard, rate how easyit is to figure out if this piece of data is trustworthy?1 2 3 4 5Why did you choose this rating?75


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items