Validation of SQL Queries over Streaming WarehousesbyRitika JainB.Tech Computer Science, Vellore Institute of Technology, Vellore, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2017c© Ritika Jain, 2017AbstractThere is often a need to recover the “missing” query that produced a particular out-put from a data stream. As an example, since a data stream is constantly evolving,the analyst may be curious about using the query from the past to evaluate it on thecurrent state of the data stream, for further analysis. Previous research has studiedthe problem of reverse engineering a query that would produce a given result at aparticular database state.We study the following problem. Given a streaming database D= 〈D0,D1,D2..〉,a result Rout , and a set of candidate queries Q, efficiently find all queries Qi ∈Qsuch that for some state D ji of the stream, Qi(D ji) = Rout , and report the pair(Q,witQ) where witQ is the witness of (in)validity. A witness for a valid queryQval is a state Di s.t. Qval(Di) = Rout . For an invalid query Qinval , a witness is a pairof consecutive states (Di,Di+1) such that Rout \Qinval(Di) 6= /0 6=Qinval(Di+1)\Rout .We allow any PTIME computable monotone query to be included inQ. Whiletechniques developed in previous research can be used to generate the candidatequery set Q, we focus on developing a scalable strategy for quickly determiningthe witness. We establish theoretical worst-case performance guarantees for ourproposed approach and show that it is no more than a factor of O(log |DRDS|) of theoptimal “Lucky guess” strategy, where Q(DRDS) = Rout . We empirically evaluateour technique and compare with natural baselines inspired from previous research.We show that the baselines either fail to scale or incur an inordinate amount ofoverhead by failing to take advantage of natural properties of a data stream. Bycontrast, our strategy scales effortlessly for very large data streams. Moreover,it never performs more than a small constant times the optimal amount of work,regardless of the state of the data stream that may have led to Rout .iiLay SummaryData is constantly evolving and new data records are added every day, for exam-ple in a weather monitoring system. Data enthusiasts often like to ask the samequestion repeatedly over time to find evolving trends. Now, consider the scenariowhere a data analyst has queried this data at some time in the past and stored theanswer that she obtained. However, due to negligence, the question that she askedis no longer available but she has access to the stored answer. In our work, we wishto help her find the “missing” question that she had asked on the data at a previouspoint in time. We do this using the stored answer and a set of possible questionsthat she could have guessed.iiiPrefaceThis thesis is submitted in partial fulfillment of the requirements for a Master ofScience Degree in Computer Science. All the work presented in this dissertationare original work of the author, created in collaboration with Divesh Srivastava,Dmitri Kalashnikov, AT&T Labs Research and performed under the supervision ofProf. Laks.V.S.Lakshmanan.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Geometric Chunking . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Background and Problem Statement . . . . . . . . . . . . . . . . . . 53 Solution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 Baseline Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Backward Naı¨ve strategy . . . . . . . . . . . . . . . . . . 103.1.2 Forward Naı¨ve strategy . . . . . . . . . . . . . . . . . . . 10v3.2 Static Chunking Strategy . . . . . . . . . . . . . . . . . . . . . . 113.3 Geometric Chunking Strategy . . . . . . . . . . . . . . . . . . . 163.4 Log Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Technical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Assumptions and Cost Model . . . . . . . . . . . . . . . . . . . . 184.2 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 The Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Overview of Results . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 No Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.5.1 Geometric Chunking . . . . . . . . . . . . . . . . . . . . 204.5.2 Log Index . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5.3 Static Chunking . . . . . . . . . . . . . . . . . . . . . . . 224.6 With Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1 Environment and Setting . . . . . . . . . . . . . . . . . . . . . . 265.1.1 Database Log Generation . . . . . . . . . . . . . . . . . . 265.1.2 Workload Generation . . . . . . . . . . . . . . . . . . . . 295.1.3 Default Parameter Configuration . . . . . . . . . . . . . . 315.2 Comparison with baselines . . . . . . . . . . . . . . . . . . . . . 345.3 Comparison with Static Chunking . . . . . . . . . . . . . . . . . 345.4 Comparison with the oracle . . . . . . . . . . . . . . . . . . . . . 405.5 Impact of Log Index . . . . . . . . . . . . . . . . . . . . . . . . 415.6 Scalability in terms of available memory . . . . . . . . . . . . . . 445.7 Varying Configuration . . . . . . . . . . . . . . . . . . . . . . . 445.7.1 Impact of Base Chunk Size . . . . . . . . . . . . . . . . . 445.7.2 Tuning the geometric multiplier ‘c’ . . . . . . . . . . . . 455.7.3 View Maintenance . . . . . . . . . . . . . . . . . . . . . 455.7.4 Full query workload versus right query workload . . . . . 476 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.1 Query Reverse Engineering (QRE) . . . . . . . . . . . . . . . . . 516.1.1 Deriving Instance Equivalent Queries (IEQs) . . . . . . . 51vi6.1.2 Query from Examples (QFE) . . . . . . . . . . . . . . . . 536.1.3 Targeted Query Generation . . . . . . . . . . . . . . . . . 546.2 Reverse Query Processing . . . . . . . . . . . . . . . . . . . . . 547 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56viiList of TablesTable 5.1 Comparison of Geometric with Naı¨ve approaches at RDS 1, 10million. Last TS: 11,000,000. . . . . . . . . . . . . . . . . . . 33Table 5.2 Performance improvement using View Maintenance . . . . . . 46viiiList of FiguresFigure 2.1 Example database state and queries. . . . . . . . . . . . . . . 7Figure 5.1 TPC-H Schema . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 5.2 Example showing partial tables to illustrate database log gen-eration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 5.3 Schema graphs for default queries. . . . . . . . . . . . . . . . 31Figure 5.4 Expression complexity (on X-axis) and Time complexity (onY-axis) for all queries for Geometric approach at RDS-10 million 32Figure 5.5 Parameter configuration (default in bold); SF = scale factor. . 32Figure 5.6 Static versus Geometric Chunking (RDS on x-axis is shown inlog scale) for Q15 . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 5.7 Static versus Geometric Chunking (RDS on x-axis is shown inlog scale) for Q22 . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 5.8 Static versus Geometric Chunking (RDS on x-axis is shown inlog scale) for Q23 . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 5.9 Static versus Geometric Chunking for Q15. Ratio=time of ap-proach/MIN(.) Both axis are shown on log-scale. . . . . . . . 39Figure 5.10 Static versus Geometric Chunking for Q22. Ratio=time of ap-proach/MIN(.) Both axis are shown on log-scale. . . . . . . . 39Figure 5.11 Static versus Geometric Chunking for Q23. Ratio=time of ap-proach/MIN(.) Both axis are shown on log-scale. . . . . . . . 39Figure 5.12 Histogram of Oracle v/s Geometric. Ratio=Total Time takenby Geometric/Total Time taken by Oracle. . . . . . . . . . . . 40ixFigure 5.13 Log Index with different initial jumps versus Geometric forQ15. X-axis shown in log scale. Y-axis shows Ratio=Timetaken by approach/Geometric . . . . . . . . . . . . . . . . . 42Figure 5.14 Log Index with different initial jumps versus Geometric forQ22. X-axis shown in log scale. Y-axis shows Ratio=Timetaken by approach/Geometric . . . . . . . . . . . . . . . . . 42Figure 5.15 Log Index with different initial jumps versus Geometric forQ23. X-axis shown in log scale. Y-axis shows Ratio=Timetaken by approach/Geometric . . . . . . . . . . . . . . . . . 43Figure 5.16 Varying the available memory provided to SQL Server. . . . . 45Figure 5.17 Varying base chunk size. X-axis is in log-scale. . . . . . . . . 46Figure 5.18 Varying c. X-axis showing RDS is in log-scale. . . . . . . . . 47Figure 5.19 Right Query versus Full Query Workload: q15. X-axis show-ing RDS is in log-scale. . . . . . . . . . . . . . . . . . . . . . 49Figure 5.20 Right Query versus Full Query Workload: q22. X-axis show-ing RDS is in log-scale. . . . . . . . . . . . . . . . . . . . . . 49Figure 5.21 Right Query versus Full Query Workload: q23. X-axis show-ing RDS is in log-scale. . . . . . . . . . . . . . . . . . . . . . 50Figure 5.22 %age increase=(Full Query Workload-Right Query Workload)/RightQuery Workload * 100. X-axis showing RDS is in log-scale. . 50xGlossaryBS Binary SearchDBLP Digital Bibliography & Library ProjectFK foreign keyGC Geometric ChunkingLI Log IndexLS Linear SearchPK primary keyRDS Right Database StateRIC referential integrity constraintSPJU selection-projection-join-unionSQL standardized query languagexiAcknowledgmentsI would like to thank my supervisor Dr. Laks.V.S.Lakshmanan, whose selfless timeand care were sometimes all that kept me going. I would also like to thank ourcollaborator, Dr. Divesh Srivastava, whose attention to detail drove me to producenothing short of perfection.I am grateful to all the professors I got an opportunity to interact and discuss myideas with. In particular, I would like to thank Ivan Beschastnikh for his constantmoral support and providing me with Azure credits towards running my experi-ments. I am also very grateful to Hu Fu for providing feedback and suggestions onmy thesis.I am extremely grateful to my collaborators and lab-mates for their useful dis-cussions and morale boosts. I would like to thank my friends for being with methrough all the ups and downs during this exciting journey at UBC.Last but not the least, I am thankful to my family for always being there forme.xiiChapter 1IntroductionThere is nothing more difficult to take in hand, more perilous toconduct, or more uncertain in its success, than to take lead in theintroduction of a new order of things — Niccolo MachiavelliDatabase management systems excel in recording dynamic data and in effi-ciently answering complex ad hoc queries over this data. The need for these corecapabilities has only grown in recent years, with increasing availability of a largevariety of transactional data feeds about different facets of the real world, and theconcomitant desire by data enthusiasts to continually analyze this data to makedata-driven decisions. As an example, the NYC Citi Bike data records the loca-tion, time and user of every checkout and checkin of a Citi Bike in New Yorkcity, and makes a version of this growing data set available for general use.1 Cityplanners may want to query this data to find the trips taken by Citi Bikers duringpeak commute hours on different weekdays, Citi Bike riders may want to knowwhich bike stations are popular for checkouts and checkins in the vicinity of GrandCentral station, and so on.As new data becomes available, data enthusiasts have a tendency to issue thesame queries repeatedly over time to understand the evolving nature of the data.For example, a city planner querying the NYC Citi Bike data may issue the samequery every few days to understand the impact of construction in the NYC Penn sta-tion area on the trips taken during peak commute hours on different weekdays. Un-1https://www.citibikenyc.com/system-data (visited on 24/08/2017).1less the data enthusiast is highly disciplined and diligently records detailed meta-data with each query result, a common problem that arises is that it is easy to losetrack of which query result was generated for which version of the data, making itdifficult to do meaningful longitudinal analysis on this data. This is the core prob-lem of Query Validation that we address in this work: Given a database log L ofrecords that are incrementally added to the database over time, a monotonic queryQ and a result table Rout , efficiently determine if there exists a database state Drcontaining all and only records in log L up to time r, such that Q(Dr) = Rout .Our core problem is relevant to many applications, including scientific repro-ducibility, query reverse engineering and so on. For example, Digital Bibliography& Library Project (DBLP) dataset is a popular dataset for benchmarking and re-porting experiments in the social networks and database systems community. Itis always evolving with time in an append-only fashion. Scientific reproducibil-ity problem can be stated as follows: given the experimental results which wererun on the DBLP data in the past, reproduce the results by identifying the relevantdatabase snapshot e.g., Baid et al. run a known query on an unknown state of theDBLP database. In query reverse engineering, given a database log and a resulttable, Tran et al. try to identify a concise query and a recent database state thatcould have generated the result table.1.1 ChallengesWhile our core problem is simply stated, and has a straightforward solution (runthe query Q on every database state D j generated by the log L up until time j forall j ≥ 0, and check if Q(D j) = Rout), it is easy to see that this solution can beextremely inefficient for big logs and complex queries. An optimal solution herewould obviously need to make a “lucky guess” of the correct time r in the log L,load all the log records up until time r in a database to get the correct database stateDr and validate that Q(Dr) = Rout . Our key result in this work is that it is possibleto devise an algorithm that, without making any lucky guesses, is close-to-optimalin that its efficiency is within a logarithmic factor of the optimal algorithm.What would such an algorithm look like? The “logarithmic factor” might leadthe astute reader into thinking that binary search plays a role here, possibly by2first loading the entire log L into a database, then doing a binary search on thepossible database states w.r.t. the log to identify the correct one. Note, however,that any solution that needs to first load the entire log L (with n records) into adatabase cannot be close-to-optimal if the correct database state has, say, fewer thanlog(n) records; in this case, just loading the entire log takes exponentially longerthan the straightforward solution described above (for queries with polynomial datacomplexity).1.2 Geometric ChunkingOur proposed algorithm, which we call Geometric Chunking , works in two phases.1. In the first phase, it starts by loading a fixed-sized chunk of the log (in termsof the number of log records, say 1) into the database, then iteratively load-ing a geometrically increasing chunk size (say, doubling) of the log into thedatabase so long as the correct database state can still be encountered sub-sequently: this can be checked efficiently by comparing Q(Dr) and Rout ondatabase states Dr at the boundaries of the loaded chunks.2. If the database state Ds at the boundary of the most recently loaded chunkcan be used to infer that the correct database state may have been crossed,the second phase is initiated. In the second phase, Geometric Chunking doesa binary search on the database states within the most recently loaded chunk.We prove that the efficiency of Geometric Chunking (GC) is within a logarith-mic factor of the optimal algorithm (which would need to make lucky guesses).Geometric Chunking is not only theoretically elegant, it is also practically ef-ficient. This is achieved by making effective use of the capabilities of moderndatabase management systems, including materialized views and view mainte-nance, dynamic indexes on derived tables, and stored procedures.1.3 ContributionsIn this work, our contributions range from the conceptual to the algorithmic, fromthe analytical to the experimental.3Our first contribution is conceptual: we formulate the technical problem ofQuery Validation on a database log, and argue that it is a core problem that arisesin many applications.Our second contribution is algorithmic: we devise the Geometric Chunkingalgorithm (described above) to elegantly and efficiently solve our problem.Our third contribution is analytical: we build cost models for a family of can-didate algorithms to solve the Query Validation problem, and prove that GeometricChunking is within a logarithmic factor of the “lucky guess” optimal algorithm.We also show that using indexing on the log cannot improve by more than a loga-rithmic factor over Geometric Chunking.Our fourth and final contribution is experimental: we carry out a large varietyof experiments on benchmark TPC-H data and suitable queries to demonstrate theconsiderable advantages of Geometric Chunking over its competitors.4Chapter 2Background and ProblemStatementNo problem can withstand the assault of sustained thinking— VoltaireWe consider a transaction log L corresponding to a streaming warehouse, wheretransactions consist only of append operations. One practical scenario of this set-ting is discussed in update scheduling in streaming data warehouses [11] whereexternal sources push append-only data streams into the warehouse at a range ofinterarrival times. In general, temporal data involving tables with transaction-timesupport tend to be append-only as well.Furthermore, we assume that transactions leave the database state consistent,w.r.t. applicable integrity constraints. Specifically, if there is a referential integrityconstraint (RIC) from relation R to S on key attributes K, then whenever a trans-action inserts a tuple t into R, either S already contains a tuple s with s[K] = t[K]or such a tuple is added as part of the same transaction.1 The transaction log as-sociates a timestamp with every inserted tuple, reflecting the time at which thetuple was added to the warehouse. All tuples added in a transaction are assignedthe same timestamp. Positions on a log L can be associated with corresponding1In this case, K is a foreign key (FK) of R and the primary key (PK) of S.5database states: we denote the state associated with position r as Dr. We refer tostates associated with consecutive integers as consecutive states.Given a SQL query Q, a transaction log L, and an output relation Rout , we saythat Q is valid w.r.t. L, provided there exists a position r on L such that Q(Dr) =Rout . If there is no such position, we say Q is invalid. In practice, we may encountera set of candidate queriesQ along with a log L and an output relation Rout , and maywant to identify which of the queries in Q are valid (resp., invalid). In this work,we consider monotone standardized query language (SQL) queries, focusing onqueries involving no aggregation. In database theory, a monotone query is definedas a query that does not lose any tuples that it previously produced [1]. Formally,a query Q over a schema D is monotonic iff for every pair of instance I,J of D ,I ⊆ J =⇒ Q(I)⊆ Q(J).Furthermore, we make the following assumptions: (i) the first state of the logis empty; (ii) the queries considered return an empty result when evaluated on anempty database; (iii) they return a superset of Rout when evaluated on the last stateof the log. The assumptions are natural and not restrictive. In particular, Assump-tion (iii) can be verified by evaluating the query on the last state of the log: noticethat any query not satisfying this property cannot be valid.2 We further assume,w.l.o.g., that the queries we consider have an output schema that is compatiblewith the given output relation Rout .A key computational question we are interested in is determining the (in)validityof a query w.r.t. a given output relation and a transaction log. We make use of thenotion of a witness as a certificate of (in)validity of a given query. If a query isvalid, any state at which it exactly returns the given output relation can be regardedas a witness. If it is invalid, the witness we produce must offer a concise certificatethat there is no state in the log at which the query will exactly produce the outputrelation. This is formalized below.Definition 1 (Witness). Given a transaction log L, an output relation Rout , anda valid query Q, a witness of validity of Q is a state Dr with the property thatQ(Dr)\Rout and Rout \Q(Dr) are both empty. For an invalid query Q, the witness2In practice, we may of course face a workload not satisfying this assumption. In Section 5, weevaluate our algorithms on such workloads.6SELECT A, B SELECT A, BQ1: FROM R Q2: FROM RWHERE C=1 and B>=2 WHERE C=1 and B>=3R RoutRID A B Ct1 1 2 1t2 1 3 1t3 2 3 2t4 2 3 1A B1 32 3Figure 2.1: Example database state and queries.of invalidity consists of a pair of consecutive states, say Dr and Dr+1, such thatRout \Q(Dr) is non-empty and Q(Dr+1)\Rout is non-empty.Notice that the second condition occurs when we have two adjacent states, rand r+1, such that Rout is a subset of Q(Dr) and Q(Dr+1) has additional tuples notin Rout (in particular, Q(Dr+1) is a superset of Rout).The definition of witness for a valid query is equivalent to the condition Q(Dr)=Rout . In this case, the state Dr is referred to as a Right Database State (RDS). No-tice that for a valid query, there may be multiple right database states. From themonotonicity of Q and the append-only nature of the warehouse, it follows thatwhenever Dr and Ds are RDS’s for a query Q, every state in between Dr and Dsmust be an RDS for Q too. In other words, the set of RDS’s for a query form aninterval over timestamps.For an invalid query, by definition, there is no state at which Q returns exactlythe output relation Rout . Since the warehouse is append-only, and Q is monotone, itfollows that there must be a pair of consecutive states Dr and Dr+1 such that Q(Dr)misses some tuples in Rout and Q(Dr+1) contains some spurious tuples not presentin Rout . This captures the following “check-mate” intuition: no state to the pastof Dr could lead to the result Rout , since on those states, Q will produce a subsetof Q(Dr); and no state to the future of Dr+1 helps either, since Q will produce asuperset of Q(Dr+1) on those states. In view of our assumption above, the witnessof an invalid query is well defined.7The next example illustrates witness states.Example 1 (Witness States). Consider the database consisting of one relation Rand the queries Q1 and Q2 shown in Figure 2.1. Consider two different transactionlogs. In the first log, L1, tuples are added into R in the order shown, one at a time:R0 = /0,R1 = {t1}, ...,R4 = {t1, ..., t4}. In the second log, L2, R0 = /0, and tuples t2and t3 are added in timestamp 1, and tuples t1 and t4 are added in timestamp 2, i.e.,R1 = {t2, t3}, and R2 = {t1, t4}.3First, consider the log L1. On this log, Q2(R4) = Rout and R4 is the only stateat which this is true. Clearly, it is a witness to the validity of Q2. Now, considerQ1. Notice that Q1(R0) = /0 and Q1(R1) = {(1,2)} and thus, Rout \Q1(R0) 6= /0 6=Q1(R1)\Rout . Therefore, (R0,R1) provides the witness for invalidity of Q1 for thislog.Next, consider the log L2. On this log, Q2(R2) = Rout and R2 is the only state atwhich this is true. Clearly, it is a witness to the validity of Q2. Now, consider Q1.Notice that Q1(R1) = {(1,3)} and Q1(R2) = {(1,2),(1,3),(2,3)}. Since Rout −Q1(R1) 6= /0 6= Q1(R2)−Rout , the pair of states (R1,R2) from the log L2 forms thewitness for the invalidity of Q1.The formal statement of the problem studied is as follows.Problem 1 (Problem Studied). Given a transaction log L, an output relation, anda workload consisting of a set of queries Q, find a witness for every query in Q,i.e., for every valid query Q ∈Q, find a state Dr such that Q(Dr) = Rout and forevery invalid query Q ∈Q, find a pair of states Dr,Dr+1, such that Rout \Q(Dr) 6=/0 6= Q(Dr+1)\Rout .3For simplicity of notation, we use Ri to denote the database state at position i in either log L1 orL2. The intended log should be clear from the context.8Chapter 3Solution StrategiesThe solution often turns out more beautiful than the problem— Richard DawkinsIn this section, we present strategies to solve Problem 1. We will first considerand explain some baseline strategies inspired from previous work and then suggestimprovements in Section 3.2 where we discuss Static Chunking Strategy. Next, wediscuss our proposed approach Geometric Chunking in Section 3.3. In Section 3.4,we consider the use of a log index for further speeding up geometric chunking.The intuition behind our approach is the following. If a query Q evaluated onthe database state Dr at timestamp r produces a proper subset of Rout (i.e. Rout \Q(Dr) 6= /0 and Q(Dr) \Rout = /0), we know that we should move forward in thelog1. Similarly, if Rout is a proper subset of Q(Dr) (i.e. Q(Dr) \ Rout 6= /0 andRout \Q(Dr) = /0), we need to evaluate Q on D for timestamps < r. However, ifboth conditions are satisfied i.e, Rout \Q(Dr) 6= /0 and Q(Dr+1)\Rout 6= /0, then wecan provide (Dr,Dr+1) as the witness for invalidity. 2 In fact, Q(Dr)\Rout 6= /0 6=Rout \Q(Dr) is sufficient to provide the witness (Dr,Dr+1). This is because themonotonicity argument proves that Q will not be able to produce exactly Rout evenfor timestamps τ > r since Q(Dτ)\Rout will never become non-empty.1Since there are some additional tuples in Rout which are missing from Q(Dr). Monotonicityargument dictates that we move forward in the log.2This provides a witness of invalidity since for states prior to Dr, we will only get subsets ofRout and for states Dτ beyond Dr+1, Q(Dτ ) will always contain spurious tuples not present in Rout(because Q(Dτ )\Rout will never be non-empty).9From here on, evaluating Rout \Q(Dr) and Q(Dr) \ Rout will be collectivelyreferred to as a probe on database state Dr.3.1 Baseline StrategiesTo the best of our knowledge, there is only one approach previously discussed inthe literature by Tran et al. that is relevant to our problem. We refer to it as theBackward Naı¨ve strategy. Originally, this has been implemented to solve a moregeneral problem where updates involving both additions and deletions to the logare considered. Hence, this is unable to take advantage of monotonicity and cannot“skip” ahead. It is forced to progress through each database state sequentially,probing on each time stamp until the Right Database State (RDS) is found.3.1.1 Backward Naı¨ve strategyGiven a sequence of database states D1,D2, ...,Dlast−1,Dlast , we start with the mostrecent state i.e. Dlast and do a backward linear scan, probing on each databasestate until an RDS is found or the query is found to be invalid. We load the entiredata up to the most recent timestamp last and probe on Dlast . If Q(Dlast) ⊂ Rout ,we see that assumption (iii) in Section 2 is violated, making Q invalid. SupposeQ(Dlast)⊇ Rout . If we encounter a state Dr such that Q(Dr) = Rout , we return Dr asa witness of validity. Otherwise, let Dr be the first state (going backward) at whichRout \Q(Dr) is non-empty. Then we can return (Dr,Dr+1) as a witness, since byconstruction, Q(Dr+1)\Rout is non-empty.Notice we can optimize this by maintaining a view on Q. This reuses the com-putation done for query evaluation on the previous database state instead of com-puting the query from scratch each time.3.1.2 Forward Naı¨ve strategyHere, we do a forward linear scan one database state at a time, starting from thebeginning of the log, i.e., state D1. Let Dr be the current state. If Rout \Q(Dr) isnon-empty, we keep probing the next state until one of the following happens:(i) Q(Dr) = Rout — in this case, we return Dr as a witness of validity of Q;10(ii) Rout \Q(Dr) as well as Q(Dr) \Rout is non-empty — in this case, as ex-plained before, we return (Dr,Dr+1) as a witness of invalidity of Q;(iii) Rout \Q(Dr) is non-empty and Q(Dr)\Rout is empty but Q(Dr+1)\Rout isnon-empty — again, we return (Dr,Dr+1) as a witness of invalidity of Q.Notice, that in forward naı¨ve in general, we do not have to load the entiredatabase log. Rather, we load one state at a time as we progress forward. Here,loading a database state means loading the additional tuple corresponding to thetransaction occurring after the previous state was loaded.3.2 Static Chunking StrategyProbing on every database state is wasteful and computationally expensive. Wecan reduce the number of probes we perform by carefully choosing the next pointto probe in the log. The idea is to pick the next point in the log based on theprevious comparison between Rout and Q(Dr). Suppose at timestamp r, Rout is aproper superset of Q(Dr), then from our assumption of monotonicity (discussed inSection 2) we know that the possibility of finding an RDS only occurs at timestamps> r. Instead of sequentially probing every timestamp > r, we can “skip” ahead inthe log.For this purpose, we will partition the log L into chunks. We create chunks ontimestamp, such that the new tuples being added to the data warehouse only touchthe new chunks. Let the size of our chunk be B and Dr be the most recently probedstate. Next, we load a chunk of size B and probe at the end of the chunk i.e., atDr+B. Now, if Rout is a proper subset of Q(Dr+B), we know that the possibility offinding the RDS is in between r and r+B, and we can perform Binary Search on thischunk (as outlined in Procedure BinarySearch). However, if at timestamp, r+B,Rout is a superset of Q(Dr+B), we know that we should keep progressing forwardin the log, and we choose our next probe at r+2∗B and so on. The advantage hereis that we save on the expensive probes at every timestamp between r and r+B.We further optimize this strategy by materializing a view for Q as discussed in [7].Instead of computing Q(Dr+B) from scratch, we incrementally maintain it from theprevious computation Q(Dr).11Procedure Probe defines performing a probe on the state Dr. We evaluate twoconditions in the probe:1. Q(Dr)\Rout2. Rout \Q(Dr)This leads to the following four possibilities-• Both 1 and 2 are empty: it is a valid query since Q(Dr) = Rout .• Both 1 and 2 are non-empty: it is an invalid query as explained previously.• 1 is empty and 2 is non-empty: proceed forward in the log since Rout hasextra tuples not present in Q(Dr).• 1 is non-empty and 2 is empty: perform a binary search in the current parti-tion since Rout is a subset of Q(Dr).We provide the pseudo-code for our approach in Algorithm 1. Given a work-load of queries Q, we maintain a view for each query Q ∈Q. We load a chunkonce, probe at the end of the chunk (Dend) for all the candidate queries in Q.If the Probe function returns “VALID”, we return the query Q and witness witQwhich is a single state, say Dr. If the Probe function returns “INVALID”, we re-turn (Q,〈Dr,Dr+1〉) where Q is the invalid query and the witness witQ is a pair ofstates. If the probe returns “FORWARD”, we know that possible RDS lies in thefuture, and we load another chunk. Similarly, if probe return “BACKWARD”, weknow that the possible RDS lies somewhere in the current chunk and we performa binary search on the current chunk defined by (start,end). The views for queriesthat receive a witness of (in)validity are dropped and they are removed from Q.The remaining queries proceed to the next chunk.Please note that the pseudo-code is common to both Static Chunking and Geo-metric Chunking strategy. The difference lies in the way the chunking is performedwhich is captured by the set of intervals CHUNKS defined in Equation 3.1 forStatic Chunking and Equation 3.2 for GC.12For static chunking strategy, each chunk is of the same size. Let B be the sizeof the chunk, then CHUNKS will be the set of intervals:CHUNKS[ ][2] ={ [1,B] , [B+1,2∗B] , [2∗B+1,3∗B] , [3∗B+1,4∗B] , ...... [(k−1)∗B+1,k ∗B] }(3.1)where k ∗B is greater than or equal to the last time stamp of the log. Clearly,static chunking requires dRDS/Be number of chunk loads to search for the RDS andlog2(B) number of binary probes in the worst case.Loading a chunk is captured in the Procedure LoadNextChunk. Each chunk isa row in the matrix CHUNKS[ ][2] (defined in Equation 3.1). The first timestamp ofthe chunk is the first element of the row. Similarly, end of the chunk is the secondelement of the row. These are called start and end respectively in the pseudo-code.Procedure BinarySearch explains the binary search that takes place within achunk. low is the beginning of the chunk and high is the last time stamp of thechunk. Recall that we already probed at high as part of the end of the chunk probe.We now probe on mid of the chunk. If Q(Dmid) is a proper subset of Rout , thenwe perform the next binary search in the right half, i.e. between mid + 1 andhigh. Else if Q(Dmid) is a proper superset of Rout , then we perform the next binarysearch in the left half, i.e. between low and mid− 1. If both Q(Dmid) \Rout andRout \Q(Dmid) become non-empty, we can provide a witness of invalidity for queryQ i.e., witQ = (〈Dmid ,Dmid+1〉). Note, if we reach two adjacent states Dlow,Dhigh(where low+1 = high) such that Q(Dlow) is a proper subset of Rout and Q(Dhigh)contains a tuple not in Rout ; then the witness of invalidity is the pair of states(Dlow,Dhigh).13Algorithm 1: Chunking StrategyInput: Log L, result table Rout , workloadQOutput: (Q, witQ)Initialization: Init()1 if LoadChunk = true then2 LoadNextChunk(ch)3 LoadChunk← f alse4 else5 foreach Q ∈Q do6 if Probe(Q,Rout ,Dend) = “VALID” then7 drop view for Q and remove Q fromQ8 return (Q,〈Dend ,Dend+1〉)9 else if Probe(Q,Rout ,Dend) = “INVALID” then10 drop view for Q and remove Q fromQ11 return (Q, 〈Dend ,Dend+1〉)12 else if Probe(Q,Rout ,Dend) = “FORWARD” then13 LoadChunk← true14 ch← ch+1 . Loads next chunk15 else16 BinarySearch(start,end) . Perform Binary Search. within the current partitionProcedure Init()1 Create empty tables2 Initialize CHUNKS . as per Equation 3.1 or 3.23 LoadChunk← false4 ch← 05 LoadNextChunk(ch)14Procedure LoadNextChunk(ch)1 start← CHUNKS[ch][0] . start is first timestamp. of the chunk2 end← CHUNKS[ch][1] . end is the last timestamp. of the chunk3 Load tuples between start and end for all relationsProcedure Probe(Q,Rout , p)1 if Q(p)\Rout = /0 AND Rout \Q(p) = /0 then2 return “VALID”3 else if Q(p)\Rout 6= /0 AND Rout \Q(p) 6= /0 then4 return “INVALID”5 else if Q(p)\Rout = /0 AND Rout \Q(p) 6= /0 then6 return “FORWARD”7 else8 return “BACKWARD”Procedure BinarySearch(low,high)1 while low ≤ high do2 if high = low+1 then3 drop view for Q and remove Q fromQ4 return (Q,〈Dlow,Dhigh〉)5 break6 else7 mid← (low+high)/28 if Probe(Q,Rout ,Dmid) = “VALID” then9 drop view for Q and remove Q fromQ10 return (Q,〈Dmid ,Dmid+1〉)11 break12 if Probe(Q,Rout ,Dmid) = “INVALID” then13 drop view for Q and remove Q fromQ14 return (Q,〈Dmid ,Dmid+1〉)15 break16 else if Probe(Q,Rout ,Dmid) = “FORWARD” then17 low← mid+118 else19 high← mid-1153.3 Geometric Chunking StrategyIn Geometric Chunking strategy, we geometrically increase each chunk size by aconstant factor c.Let B be the size for the first chunk (referred to as the base chunk size). Weassume the default value of c = 2. 3 Then CHUNKS is the set of intervals:CHUNKS [ ][2] ={ [1,B] , [B+1,B+2∗B] , [B+2∗B+1,B+2∗B+4∗B] , ..... [(2k−1−1)∗B+1,(2k−1)∗B] }(3.2)where (2k−1)∗B is greater than or equal to the last time stamp of the log.Please note that the procedure to provide witness of (in)validity is the same asbefore. The only difference is that the in GC, the chunk size grows geometricallyallowing us to quickly reach within the proximity of the RDS. Let dRDS/Be bem and [2k−1 ∗B−B+ 1,2k ∗B−B] be the chunk we perform binary search over.Then GC requires log2(m) number of chunk loads to search for the RDS. Once weload the chunk containing the possible RDS, a binary search (described in ProcedureBinarySearch) is performed as before. Here, the chunk size over which we performBinary search is 2k−1 ∗B and therefore, GC will perform (k−1)+ log2(B) numberof binary probes4 in the worst case.3.4 Log IndexTo speed up geometric chunking further, we could construct a log index. The ideais that for value v of each attribute A of each relation R in the database, we store theearliest timestamp τAv at which any tuple with A = v was inserted into R accordingto the log. Suppose query Q involves relation R and R.A is one of the projected3We have experiments exploring other values of c as well.4 log2(2k−1 ∗B) = (k−1)+ log2(B)16attributes of Q. If Rout contains a tuple t with t.A= v, then obviously Q(Dr) 6= Routfor all r < τAv . We can generalize this intuition for a conjunction of attributes: e.g.,for a set of attributes X ⊆ schema(R), it is easy to see thatQ(Dr) 6= Rout ,∀r < maxA∈X τAv .Finally, we can also show thatQ(Dr) 6= Rout ,∀r < maxt∈Rout maxA∈X τAt[A],where τAt[A] denotes the earliest timestamp at which any tuple with A = t[A] wasadded into relation R according to the transaction log. Notice that another max canbe applied over all relations in Q. More precisely, for each tuple in Rout we can findthe latest timestamp when a contributing tuple in any relation R participating in Qwas added to the database. We can then take max of this quantity over all tuples inRout .Q(Dr) 6= Rout ,∀r < maxt∈Rout maxR∈QmaxA∈X τAt[A],where τAt[A] denotes the earliest timestamp at which any tuple with A = t[A] wasadded into relation R according to the transaction log.Of course, one could build multidimensional indices on sets of attributes andexploit them to prune probes on early states of the transaction log.17Chapter 4Technical ResultsI’ve always believed that if you put in the work, the results will come— Michael JordanIn this section, we establish our main technical results. Let L be a transactionlog. Recall that for a valid query Q, an RDS corresponds to a log position r suchthat Q(Dr) = Rout . As seen in Section 2, a valid query may have multiple RightDatabase States. Without ambiguity, we denote any RDS as DRDS.4.1 Assumptions and Cost ModelIn this section, we assume that the available memory is adequate for probing anyone database state. This means not only that the database state D being probed andRout fit in available memory, but also any intermediate results of evaluating Q(D)fit in memory. Thus, the cost model we employ is in terms of the main memorymodel. In our experiments (Section 5), we relax this assumption and compare thedifferent strategies under limited memory. In this section, we consider polynomialtime computable selection-projection-join-union (SPJU) queries.4.2 Basic OperationsWe establish some terminology to help compare different strategies for query val-idation. A strategy needs to load some database state, either in whole, or a times-tamp at a time, or a chunk at a time, and needs to evaluate the query being val-18idated, on the loaded state. Once the query is evaluated, the result needs to becompared with Rout . Here, we refer to query evaluation together with this compar-ison as a probe. Both static and geometric chunking strategies consist of a LinearSearch (LS) phase and a Binary Search (BS) phase, each potentially consisting ofseveral probes. During LS, whenever a probe is found to result in a proper subsetof Rout , the next chunk is loaded and probed. If a probe on a state Dr results in aproper superset of Rout , the strategy proceeds to the BS phase, to search if there is aprevious state Ds, s < r, at which Q(Ds) = Rout . Clearly, the binary search involvesa logarithmic number of probes in the size of Dr. Notice that the cost of each probeduring BS is upper bounded by that of Dr. Finally, an orthogonal point is that anyof the strategies may employ materialized views to speed up the probes.4.3 The OracleWe propose a reference oracle strategy for the sake of comparison and calibrationof various algorithms. Given a valid query Q, the oracle knows the position RDSof a right database state in the log, loads the data for DRDS corresponding to thisRDS, probes it with the query Q, compares the result with Rout , and produces thecertificate that Q(DRDS)−Rout and Rout −Q(DRDS) are empty, thus validating Q atDRDS. Notice that the oracle simply loads one chunk of data and then evaluatesthe query. It does not involve any binary search. It thus acts as a bar for the mostefficient possible strategy among those considered.4.4 Overview of ResultsWe consider two different settings for query validation: (i) when no materializedviews are used and (ii) when materialized views are used and are maintained (incre-mentally) by the underlying DBMS. For both settings, under certain assumptionsabout the cost model which we will make precise shortly, we establish our resultson the cost of the static and geometric chunking strategies. Specifically, we showthat the geometric chunking strategy proposed incurs a cost that is no more than afactor O(log |DRDS|) times that of the oracle. By contrast, we show that the staticchunking strategy can incur linearly more cost than the oracle.19Finally, we consider a possible additional optimization for query validationon top of geometric chunking. We can create a log index which keeps track ofthe earliest timestamp when a tuple with certain attribute-value is added into eachrelation. Such an index can be used to avoid wasteful probes. We dicsuss theutility of such a log index and show that the cost saving achieved by a log indexover a geometric chunking strategy without such an index is bounded by a similarlogarithmic factor.Unless otherwise mentioned, all logarithms are base 2. While GC permitsexpanding chunk size using any factor c≥ 1, w.l.o.g., we assume the default valueof c = 2. Other values of c are explored in our experiments. Notice GC with c = 1is equivalent to static chunking strategy.4.5 No ViewsIn this section, we consider a query evaluation model that does not make use ofmaterialized views. This means that whenever a query needs to be evaluated on adifferent database state in the log, the query will be evaluated on that state fromscratch. We start with a simple observation.Observation 1. Let Q be a monotone SQL query and let Dr,Ds be any two databasestates from the log such that r ≤ s. Then under any plan p chosen by the query op-timizer, the cost of evaluating Q on Dr under p is no more than that of evaluatingQ on Ds under p.Proof. Since we consider a streaming warehouse setting, our logs are append-onlyand thus Dr ⊆ Ds. Since Q is monotone, we have Q(Dr) ⊆ Q(Ds). It follows thatunder any given plan for Q, obtaining the result Q(Ds) would involve at least asmuch work as obtaining the result Q(Dr).4.5.1 Geometric ChunkingOne of the assumptions we make in this section is that the query optimizer uses aconsistent plan for evaluating queries regardless of the size of the database state.For a database state D, we let |D| denote its size in number of tuples.20Theorem 1. Let Q be any valid monotone SQL query that is computable in PTIMEand let DRDS be the database state corresponding to a RDS w.r.t. query Q. Thenthe cost of the geometric chunking strategy for validating Q using the log is at mostO(log |DRDS|) times the cost of the oracle validating Q at the database state DRDS.Proof. Suppose on any state Dr, Q(Dr) can be evaluated in time O(|Dr|k), wherek is some constant. This is the cost incurred by the oracle for validating Q w.r.t.Dr. It is obvious that the geometric chunking strategy performs a logarithmic num-ber of probes, i.e., dlog |Dr|e probes during the LS phase. In the worst case, the(dlog |Dr|e− 1)th probe could have just “missed” the correct position of the RDSDRDS, with the result that the dlog |Dr|eth probe could correspond to a databasestate Ds such that |Ds| ≤ 2|DRDS|. Clearly, the cost of the last probe dominates thecost of all preceding probes, by virtue of Observation 1. Next, BS may involvea number of additional probes equal to log |Ds| = O(log |Dr|). The cost of everyprobe is upper bounded by O(|Ds|k) =O(2k|Dr|k) =O(|Dr|k), since k is a constant.Putting these together, it follows that the cost of the geometric chunking strategyvalidating Q using the transaction log is at most O(dlog |Dr|e|DRDS|k). The theoremfollows.4.5.2 Log IndexWe next establish an upper bound on the savings that can be achieved using anylog index, no matter how complex or sophisticated.Let Q be a valid query. Recall our previous discussion on Log Index in Section3.4. For any tuple in Rout , we take the maximum timestamp when a contributingtuple from a relation R participating in Q was added to the log. We can then takea maximum over all the tuples in Rout . This quantity is denoted as LI(Rout ,Q).This is also equivalent to the position in the transaction log for which we can pruneprobes on all states prior to LI(Rout ,Q). We now have the following:Theorem 2. The cost of geometic chunking strategy without log index is at mostO(log |DLI(Rout ,Q)|) times that of geometric chunking with log index.Proof. The argument is analogous to that of Theorem 1. The key idea is to realizethat GC with log index will start at the state DLI(Rout ,Q) while GC without log index21will start from the initial state. Thus, the saving of the former compared to the lattercomes from avoiding the logdDLI(Rout ,Q)e probes during the LS phase of GC withoutlog index. After this, both GC without log index and GC with log index behavesimilarly. It is clear that GC without log index has a cost that is O(log |DLI(Rout ,Q)|)times that of GC with log index.Notice that in practice, depending on the selectivity of the log index, LI(Rout ,Q)may be much smaller than RDS. Its construction and maintenance involve addi-tional overhead.4.5.3 Static ChunkingNext, we consider the cost of the static chunking strategy over the oracle. Let Q beany valid monotone SQL query, let DRDS be the database state corresponding to aRDS w.r.t. query Q. Then the cost of the static chunking strategy for validating Qusing the transaction log can be O(|DRDS|) times the cost of the oracle validating Qat the database state DRDS, as we show below.Suppose DRDS contains n tuples and that evaluating Q(DRDS) takes O(nk) time.Let B be the size of a (static) chunk. Suppose B is a constant and let m = dn/Be.Let Ds be the last (and largest) database state probed during the LS phase. SinceB is a constant, |Ds| = O(|DRDS|). During the BS phase, an additional numberof O(log |DRDS|) probes are performed. The total cost incurred during the LSphase of static chunking is O(|B|k +(2|B|)k + · · ·+(m|B|)k) = O(|B|k∑mi=1 ik) =O(|B|kmk+1) = O(nk+1), since B is a constant. The second equality follows fromFaulhaber’s formula [16]. Ignoring the additional cost of BS, the overall cost ofstatic chunking is O(n) times that of oracle. Consider an instance of the problemsuch that evaluating Q on DRDS takes nk time. The analysis above is tight in thesense that on this instance, the cost of static chunking can take up to O(nk+1/B)times that of the oracle.4.6 With ViewsIn this subsection, we consider the setting where query evaluation at differentdatabase states leverages incremental view maintenance strategies [7]. More pre-22cisely, suppose a query Q is evaluated at a state Dr and the result Q(Dr) is cached.Suppose it is now required to be evaluated at state Ds, where r ≤ s. Then Q(Ds)can be obtained from Q(Dr) by incrementally evaluating Q(Ds), given Q(Dr) andthe “delta” or change between Ds and Dr [17]. Notice that since the log is append-only, the incremental query evaluation only consists of additions and no deletions.Recall that the log associates a timestamp with every tuple, viz., the time at whichthe tuple was inserted into the warehouse. The algorithms we develop in this worktake advantage of certain stylized views, which keep track of the timestamp of ev-ery tuple that was used in deriving an output tuple of the view. We illustrate thiswith an example.Example 2 (Time aware views). Consider the following query.SELECT C NAME, C CUSTKEY, O ORDERKEY, O ORDERDATE,O TOTALPRICEFROM CUSTOMER, ORDERS, LINEITEMWHERE O ORDERKEY = L ORDERKEY ANDC CUSTKEY = O CUSTKEY AND C NATIONKEY = 3Suppose that for each relation R, R.TS denotes the timestamp attribute, thatindicates the time at which a particular tuple was added to relation R in the ware-house. Then the following modified query can be used to keep track of the timesof the participating tuples used in deriving each answer tuple in the query (view).SELECT C NAME, C CUSTKEY, O ORDERKEY, O ORDERDATE,O TOTALPRICE, C.TS, O.TS, L.TSFROM CUSTOMER C, ORDERS O, LINEITEM LWHERE O ORDERKEY = L ORDERKEY ANDC CUSTKEY = O CUSTKEY AND C NATIONKEY = 3For a query Q, we denote by QT S the rewritten version of Q that keeps trackof the times of participating tuples, as illustrated above. We refer to QT S as thetime-aware version of Q. In this work, by views, we mean the time-aware viewsillustrated above, unless otherwise specified. We make the following assumptions.Consider a database state Dr and a state Ds obtained by appending a set oftuples to Dr, i.e., Ds = Dr ∪ ∆D. The naive approach to query evaluation is to23evaluate each of the queries Q(Dr) and Q(Ds) independently from scratch, withoutany cached results. The incremental evaluation approach proceeds by first evalu-ating QT S(Dr) and caching the result. Then it uses incremental view maintenancetechniques [7] to directly evaluate QT S(Ds) from the cached result QT S(Dr) andthe “delta” tuples ∆D. Suppose that w denotes the number of relation instancesappearing in Q. We assume that the cost of incremental evaluation of QT S on Drand Ds is no more than w times that of naive evaluation of Q on Dr and Ds. This isa reasonable, if conservative, assumption.24Chapter 5ExperimentsGood judgement come from experience, and experience comes frombad judgement — Rita Mae BrownThis section presents the experimental study to evaluate the performance of ourproposed approach, Geometric Chunking Strategy against the TPC-H benchmark[18]. Below, we outline the key goals of the experiments.• Demonstrate the performance gains of our proposed approach, Geometricchunking strategy, over baselines (Section 5.1.3) and over static chunking(Section 5.3).• Compare the performance of Geometric chunking against the Oracle strategy(Section 5.4).• Test the scalability of our approach under limited main memory, which mayprevent the database and/or intermediate results from fitting in memory (Sec-tion 5.6).• Vary the various parameters outlined in Table 5.5 and analyze their impacton our approach (Section 5.5,5.7).255.1 Environment and SettingAll the experiments were conducted on a machine running 64-bit Windows 7 OSwith Intel Core i7-6600U CPU @ 2.60GHz with 20GB RAM. Our code base onthe client side is developed in Java JDK 1.5, and on the server side we use Mi-crosoft SQL Server 2014 - Enterprise Edition. In this thesis, we present results forexperiments on TPC-H 10 GB database. Additionally, we also ran experiments onTPC-H 1GB and 0.1GB and inferred same trends. We use 23 selection-projection-join queries (with their corresponding workloads), 22 of which are TPC-H queriesand 1 query is created by us which demonstrates the cycle schema graph involvingall the relations as shown in Figure 5.3c. To meet the requirements of our approach,we use the modified TPC-H queries taken from Zhang et al.’s paper which werecreated by dropping the group by aggregates and arithmetic expressions. Addi-tionally, we append some projection conditions derived from the original TPC-Hqueries. For running our experiments, we first create a database log and a queryworkload, elaborated in Section 5.1.1 and 5.1.2 respectively. All running times re-ported in this thesis are taken as a median over 5 runs. We perform an additionalfirst run which is used to warm up the cache and timing for this run is discarded.5.1.1 Database Log GenerationFigure 5.1: TPC-H SchemaThe TPC-H database has 8 relations: Lineitem, Orders, Customer, PartSupp,Part, Supplier, Nation and Region with the schema shown in Figure 5.1. To createa database log, we start with an empty log and keep adding tuples one at a time asper the following approach:261. Randomly select a table R (out of the 8 relations) from which a tuple needsto be added to the database log, say at timestamp τ .2. Randomly select a tuple t ∈ R which has not been added to the log yet.3. In order to satisfy the integrity constraints, check for any tuples that arereferenced by the tuple R(t) to be added, via FK – PK references. Add allsuch referenced tuples, at the timestamp τ and then add the tuple R(t) at thesame timestamp τ .Thus, more than one tuple may be added at the same timestamp in order to satisfythe integrity constraints. Notice that there are no foreign key – primary key refer-ences from tuples in Part and Region. Thus, if R is one of these relations, only onetuple will be added to the log at a given timestamp.We start from τ = 1 and go on till all the tuples have been added to the log.At the end of this process, each tuple in the database log receives a timestampcorresponding to the time it was inserted into the log.The following example illustrates the database log generation process.Example 3 (Database Log Generation). Let us assume that we want to insert atuple from the Lineitem table at Timestamp=t as shown in 5.2a. From Figure 5.1,we know that the Lineitem table has dependencies in Orders and PartSupp tables.Therefore the corresponding tuple from the Orders table, 5.2e and PartSupp table,5.2b are inserted in the log with the same timestamp t. The Orders table furtherhas PK-FK reference in the Customer table on CUSTKEY. Additionally, Customerrelation which has dependencies in the Nation and Nation in the Region. Similarly,PartSupp table has dependencies in the Part and Supplier relations. Supplier hasdependencies in the Nation table, which further has dependencies in the Regiontable. Therefore, the corresponding tuples from these relations are inserted into thelog with the same timestamp as shown in Figure 5.2.Notice that two tuples from the Nation table are inserted with the same times-tamp. This is because the tuple with N NATIONKEY=10 is inserted due to theS NATIONKEY dependency (from the Supplier relation) and N NATIONKEY=17comes from the C NATIONKEY=17 dependency (from the Customer relation).27L ORDERKEY L PARTKEY L SUPPKEY TS158246 173493 6011 t(a) LINEITEMPS PARTKEY PS SUPPKEY TS173493 6011 t(b) PARTSUPPP PARTKEY TS173493 t(c) PARTS SUPPKEY S NATIONKEY TS6011 10 t(d) SUPPLIERO ORDERKEY O CUSTKEY TS158246 85445 t(e) ORDERSC CUSTKEY C NATIONKEY TS85445 17 t(f) CUSTOMERN NATIONKEY N REGIONKEY TS10 4 t17 1 t(g) NATIONR REGIONKEY TS4 t1 t(h) REGIONFigure 5.2: Example showing partial tables to illustrate database log genera-tionTo create the database log for TPC-H SF=0.1, we first generate the databasewith SF=1 using the dbgen program (include citation) which gives us a database of28size 1 GB. We randomly select 10% tuples from the largest relation i.e. Lineitemand then generate the log satisfying integrity constraints as outlined above. For 1GB and 10 GB scale experiments we use the entire tables created by running thedbgen tool 1 with SF=1 and SF=10 respectively.5.1.2 Workload GenerationAs an input to our problem, we take a set of candidate SQL queries Q whichwe generate manually for the purpose of these experiments. For each query, wegenerate seemingly similar queries with differing constraints in the where clause,joining constraints or the relations participating in the joins. In our workload, weconsider one superset query, one subset query and the right query.Superset queryA superset query is a query which produces a superset of the result set at DRDS.Superset queries are typically constructed by either eliminating some conditions inthe where clause; or by relaxing some constraint, for example consider the follow-ing valid query:SELECT C NAME, C CUSTKEY, O ORDERKEY, O ORDERDATE,O TOTALPRICEFROM CUSTOMER, ORDERS, LINEITEMWHERE O ORDERKEY = L ORDERKEY ANDC CUSTKEY = O CUSTKEY AND C NATIONKEY = 3One obvious method to obtain a superset query for the above query is to losethe C NATIONKEY=3 constraint in the where clause. Another method of creatinga superset query is to widen the selection condition as follows:SELECT C NAME, C CUSTKEY, O ORDERKEY, O ORDERDATE,O TOTALPRICEFROM CUSTOMER, ORDERS, LINEITEM, NATION, REGIONWHERE O ORDERKEY = L ORDERKEY ANDC CUSTKEY = O CUSTKEY AND1Available for download at http://www.tpc.org/tpch/default.asp [18] (visited on 24-08-2017).29C NATIONKEY = N NATIONKEY ANDN REGIONKEY=R REGIONKEY AND R NAME = ‘AMERICA’In the above query, we are selecting customer names with their order details forthe region America. The original valid query was selecting customer names withtheir order details for the nation Canada (corresponding to C NATIONKEY=3)which is a subset of the wider region America. Therefore, even though we seem tobe adding an extra joining relation and a joining condition, we are actually widen-ing our selection criteria.Typically, superset queries fail fast on early database states as they have spuri-ous tuples which are not present in the result set Rout .Subset queryA subset query is a query which produces a subset of the result set at Drds. Theattentive reader would observe that such a query negates our third assumption inChapter 2, i.e. the candidate query returns a superset of Rout when evaluated onthe last state of the log. However, in practice we relax this assumption and we willsee that our approach is able to weed out these queries as well.To generate subset queries, we add additional constraints in the where clause,for example:SELECT C NAME, C CUSTKEY, O ORDERKEY, O ORDERDATE,O TOTALPRICEFROM CUSTOMER, ORDERS, LINEITEM, NATIONWHERE O ORDERKEY = L ORDERKEY ANDC CUSTKEY = O CUSTKEY ANDC NATIONKEY = N NATIONKEY AND N NATIONKEY = 3AND L RECEIPTDATE < L COMMITDATEThe subset queries take longer to fail since the stopping condition of Rout \Q(Dr) and Q(Dr)\Rout both being non-empty are achieved for values of i > RDS.For i <= RDS, Rout \Q(Dr) is not null but Q(Dr) \Rout is null, so we keep load-ing tuples until Q(Dr) \Rout becomes non empty as well. This typically involvesloading an extra chunk beyond the chunk containing the RDS.30(a) Schema graph for Q15 (b) Schema graph for Q22(c) Schema graph for Q23Figure 5.3: Schema graphs for default queries.5.1.3 Default Parameter ConfigurationThe various parameters that we study and their values are represented in Table 5.5.Default values (shown in bold) are used, unless stated otherwise. In Sections 5.5and 5.7, we vary the parameters and analyze their impact.Figure 5.4 shows the running time for all the queries at RDS-10 million. Weplot time complexity (Total running time) on the Y-axis against a measure of ex-pression complexity (the number of relations present in the query) on the X-axis.We select three queries: the longest running query, the shortest running queryand the most complex query in terms of the structure and number of relations(Q15,Q22,Q23 resp.). We use these three queries to present the results in thisthesis. The schema graph for them is provided in Figures 5.3a, 5.3b and 5.3c.Q23 (created by us is shown below) selects supplier attributes and the regionfor suppliers that supply orders to customers within the same nation.SELECT S NAME, S ADDRESS, S NATIONKEY, S ACCTBAL,R REGIONKEYFROM PART, SUPPLIER, PARTSUPP, CUSTOMER, ORDERS,LINEITEM, NATION N1, NATION N2, REGIONWHERE P PARTKEY=PS PARTKEY AND S SUPPKEY=PS SUPPKEYAND S NATIONKEY=N1.N NATIONKEY ANDN1.N REGIONKEY=R REGIONKEY AND PS PARTKEY=L PARTKEYAND PS SUPPKEY=L SUPPKEY AND L ORDERKEY=O ORDERKEYAND O CUSTKEY=C CUSTKEY AND C NATIONKEY=N2.N NATIONKEYAND N1.N NATIONKEY=N2.N NATIONKEY31Figure 5.4: Expression complexity (on X-axis) and Time complexity (on Y-axis) for all queries for Geometric approach at RDS-10 millionParameter ValuesDatabase TPC-H SF:0.1GB, SF:1GB, SF:10GBQueries Q1-Q15, Q16-Q22, Q23Strategies Forward Naı¨ve, Backward Naı¨ve, Static Chunking,Geometric ChunkingRDS 10k, 100k, 1 million, 10 million, 30 million,50 million, 70 millionBase chunk size 1k , 10k, 100k, 1 millionc 1.5, 2, 3, 4Materialized Views With or w/o materialized viewsMemory provided toSQL Server 5GB, 10GB, 15GB 20 GBQuery workload Right Query, Full Query WorkloadLog Index Without or with Log IndexFigure 5.5: Parameter configuration (default in bold); SF = scale factor.32Query RDS Forward Naı¨ve Backward Naı¨ve Geometric Chunking#LS #BS Cut-offTime (ins)RDSfound?#LS #BS Cut-offTime (ins)RDSfound?#LS #BS TotalTime (ins)RDSfound?q15 1 mil 66 n/a 173.8 No 2 n/a 173.8 No 7 6 34.7 Yes10 mil 202 n/a 1,277.8 No 10 n/a 1,277.8 No 10 9 255.6 Yesq22 1 mil 94 n/a 49.1 No 2 n/a 49.1 No 7 6 9.8 Yes10 mil 339 n/a 228.3 No 7 n/a 228.3 No 10 9 45.6 Yesq23 1 mil 35 n/a 208.7 No 2 n/a 208.7 No 7 6 41.7 Yes10 mil 49 n/a 1,032.6 No 9 n/a 1,032.6 No 10 9 206.5 YesTable 5.1: Comparison of Geometric with Naı¨ve approaches at RDS 1, 10 million. Last TS: 11,000,000.335.2 Comparison with baselinesTable 5.1 shows the comparison between Forward Naı¨ve, Backward Naı¨ve andGeometric Chunking approach. To keep the running time tractable, we run thisexperiment on a log with timestamps upto 11 million instead of the entire databaselog (≈ 82 million). We also cut-off the running time for naı¨ve approaches at 5 *time taken by Geometric Chunking for similar configuration. The experiment isconducted on the Q15, Q22, Q23 for two RDS values, one at the beginning on thelog (1 million) and the other towards the end of log (10 million). We observe thatforward naı¨ve performs more number of linear scans (#LS) compared to backwardnaı¨ve. This is because backward naı¨ve loads the entire data upto the last timestampof the log whereas forward naı¨ve loads as it goes. Clearly, for RDS at 1 million,loading the entire data upto 11 million is wasteful.Forward naı¨ve has the best shot to find the RDS for RDS= 1 million, similarlybackward naı¨ve for RDS= 10 million. Therefore, we compare these two scenarios.The forward naı¨ve for RDS at 1 million, in the best case is able to traverse 94database states and reach r = 94. The backward naı¨ve for RDS at 10 million, in thebest case is able to traverse 10 database states and reach r = 999,991. It is obviousthat both forward and backward naı¨ve approaches are never able to find the RDSand fail to scale.5.3 Comparison with Static ChunkingIn this experiment, we evaluate the performance of static chunking for three chunksizes - 10k, 1.28 million and 81.92 million against the geometric chunking ap-proach with base chunk size=10k and c=2. The static chunk sizes represent a goodmix of a small chunk (10k), medium sized chunk (1.28 million) and very largechunk (≈82 million). We show the comparison across a range of RDS values takenat {10,000; 100,000; 1,000,000; 10,000,000; 30,000,000; 50,000,000; 70,000,000}for the default queries. The static chunking approach is allowed twice the maxi-mum running time taken by GC across all RDS values. We refer to this as cappingtime.Figures 5.9,5.10 and 5.11 depict the ratio of the running time of the approachto the minimum time taken by any approach for that RDS. Broken lines indicate34that the approach was unable to finish within the capping time. We observe thatGC approach lies within 3 times of any static chunk size.Figures 5.6,5.7 and 5.8 depict the absolute running times (in ms) which is acombination of loading time, linear time and binary time for Q15, 22 and 23 re-spectively. Loading time is the time it takes to load a chunk into the SQL Serversmemory. Linear time involves the time taken for evaluating Rout \Q and Q\Rout atthe end of each chunk. Binary time refers to the time taken for performing binarysearch within a chunk to find the RDS.To understand the trends in these plots, we look at the chunk boundaries foreach of these strategies below.Geometric chunking[0, 10000, 30000, 70000, 150000, 310000, 630000, 1270000, 2550000,5110000, 10230000, 20470000, 40950000, 81910000, 163830000]Static Chunking (10k)[0, 10000, 20000, 30000, 40000, 50000, 60000 ... 70000000]Static Chunk (1.28 million)[1280000, 2560000, 3840000, 5120000, 6400000 ... 83200000]Static Chunk (81.92 million)[0, 81920000, 163840000]From Figures 5.6 and 5.9, we observe that for RDS value of 10k, both staticchunking strategy with chunk size 10k and GC perform well. This is becauseboth require one chunk load (to load data upto 10k), 2 evaluations at the end ofthe chunk and no binary search. For static chunk sizes of 1.28 million and 81.92million, we do extra work in loading the data upto 1.28 million and 81.92 millionrespectively and then performing binary search. The largest chunk sized staticstrategy of 81.92 million performs up to 400 times worse than GC for smallest RDSvalues of 10k. This is synonymous to loading everything in one chunk and thenperforming a binary search over the entire data. For smaller RDS values this has35two disadvantages: firstly, the time taken to load tuples upto 81.92 million is more;secondly, performing binary search on a large chunk takes more time.For the next RDS value of 100k, GC requires 4 chunk loads upto 150,000 linearevaluations at the end of each chunk and 3 binary searches. Static chunking withchunk size 10k requires to load 10 horizontal chunks but 0 binary searches andtherefore the ratio is still comparable. Static chunking 1.28 million requires onelinear load (upto 1.28 million) and 6 binary searches. For the largest chunk size of81.92 million, we do an enormous amount of extra work in loading the data upto81.92 million and then performing 12 binary searches to get to 100k. This is visiblein the very high ratio.For RDS value of 1 million, GC requires 7 chunk loads and 6 binary searcheswhereas static chunking 1.28 million requires 1 chunk load and 5 binary searches.As expected, static chunking of 1.28 million performs well at RDS of 1 million.Static chunking with 10k and 81.92 million compare worse because of the highloading and linear time (for 10k) and high loading and binary search time (for81.92 million). Beyond 1 million, static chunking 10k is unable to complete withinthe capping time and therefore, we see a flat line at 8,500,000 ms (which is thecapping time).For the larger RDS values of 50 and 70 million, we observe that static chunk-ing 81.92 million performs well since only one chunk load is required and thereforeonly 2 linear comparisons are performed; whereas for GC we require extra linearcomparisons as we need to load 13 chunks upto 81,910,000. Additionally, werequire 12 binary searches for GC compared to 10 binary searches for static chunk-ing 81.92 million. The other two static methods perform worse because of the highnumber of chunk loads and linear comparisons. To give some perspective, at anRDS of 50 million, a static chunk size of 10k would load 5000 chunks and wouldalso perform 5000*2 (Rout \Q) and (Q\Rout) comparisons. GC strategy would loadtuples upto 81,910,000 in 13 loads, require 13*2 linear comparisons and perform12 binary searches on a larger chunk of [40,950,001-81,910,000]. However, the ex-tra binary searches performed by the GC approach on a larger chunk significantlyoutweigh the loading and linear time for static approach.36In general, we observe that GC strategy is able to catch up to any static chunk-ing strategy, irrespective of the location of the RDS because of the geometric multi-plication.In Figure 5.7, we observe a drop in running time for GC strategy at RDS valueof 50 million. Here we load 12 chunks up to 40,950,000, perform 2*12 linearcomparisons and 0 binary searches. This is because for Q22, no new tuple getsadded to Rout beyond time stamp 32,386,674. 2. Hence the Rout at 40,950,000 isthe same as Routs at 50 million and 70 million. In fact, Routs beyond 32,386,674 areall same. The algorithm returns witness of validity at time stamp 40,950,000 forboth RDS values of 50,70 million and therefore, requires 0 binary searches. This isobserved by the drop at 50 million and then a flat curve from 50 to 70 million forGC. We observe a similar phenomenon for Q23 in Figure 5.8, where there is a dropat 70 million for GC strategy. No new tuple gets added to Rout beyond 60,950,000.Hence, there is no binary search required and a witness of validity is achievedthrough the linear comparisons at the end of the 13th chunk, i.e at 81,910,000.In general, static chunking performs well for RDS values closer to their chunksizes. However, no single static chunk size performs better than GC across therange RDS values. Since we have no way of knowing beforehand where the RDSlies and secondly, we cannot chose a single static chunk size that would performwell.2Q22 is a join between Customer and Orders relation. The way the query is structured, all thetuples get added early on by timestamp=32,386,674.37Figure 5.6: Static versus Geometric Chunking (RDS on x-axis is shown inlog scale) for Q15Figure 5.7: Static versus Geometric Chunking (RDS on x-axis is shown inlog scale) for Q22Figure 5.8: Static versus Geometric Chunking (RDS on x-axis is shown inlog scale) for Q2338Figure 5.9: Static versus Geometric Chunking for Q15. Ratio=time of ap-proach/MIN(.) Both axis are shown on log-scale.Figure 5.10: Static versus Geometric Chunking for Q22. Ratio=time of ap-proach/MIN(.) Both axis are shown on log-scale.Figure 5.11: Static versus Geometric Chunking for Q23. Ratio=time of ap-proach/MIN(.) Both axis are shown on log-scale.395.4 Comparison with the oracleWe define Oracle approach as the scenario where an Oracle provides us with theexact location of the RDS, almost like a lucky guess. The chunk size in this caseis exactly the RDS value, such that the data up to the RDS is loaded in one singlechunk, all at once. We run this experiment at RDS = 1 million, 5 million, 10 million,30 million, 70 million for the three default queries = Q15, Q22, Q23. The geomet-ric chunking strategy is run with the default parameters, i.e. c=2 and base chunksize=10k. Figure 5.12 shows the distribution of the ratio, where ratio is defined asthe time taken by geometric chunking approach to the time taken by the “Oracle”approach. The histogram shows the percentage of cases where the ratio falls withineach bin. We observe than for more than 50% of the times, GC performs within3 times of the “Oracle”. Each bin in the histogram is further divided accordingto the contribution from each RDS value (represented by a colored pattern). Thecases where the ratio is greater than 3 is largely for smaller RDS values where theabsolute loss is not too much anyway.Figure 5.12: Histogram of Oracle v/s Geometric. Ratio=Total Time taken byGeometric/Total Time taken by Oracle.405.5 Impact of Log IndexIn this experiment, we try to understand the performance improvements of GCstrategy in the presence of a Log Index (LI) on the database log. A log indexbrings us very close to the RDS, how close depends on how good the index is. Wesimulate this by jumping to the time stamp provided by the Log Index, i.e. load-ing the data upto the jump in one single chunk, performing 2 linear comparisons(Rout \Q(D jump) and Q(D jump)\Rout) at the end of the jump and then proceedingwith the usual GC beyond this point. We run this experiment for LI jumps at 100k,1 mil, 10 mil and 50 mil. Figures 5.13, 5.14 and 5.15 depict the ratio - runningtime of GC with Log Index to running time of GC without Log Index for Q15, 22and 23 respectively. To understand the trends in these plots, we need to look at thehorizontal chunks for each strategy.Geometric chunking[0, 10000, 30000, 70000, 150000, 310000, 630000, 1270000, 2550000, 5110000,10230000, 20470000, 40950000, 81910000, 163830000]GC with LI (100k)[0, 100000, 110000, 130000, 170000, 250000, 410000, 730000, 1370000,2650000, 5210000, 10330000, 20570000, 41050000, 82010000, 163930000]GC with LI (1 mil)[0, 1000000, 1010000, 1030000, 1070000, 1150000, 1310000, 1630000,2270000, 3550000, 6110000, 11230000, 21470000, 41950000, 82910000]GC with LI (10 mil)[0, 10000000, 10010000, 10030000, 10070000, 10150000, 10310000, 10630000,11270000, 12550000, 15110000, 20230000, 30470000, 50950000, 91910000]GC with LI (50 mil)[0, 50000000, 50010000, 50030000, 50070000, 50150000, 50310000, 50630000,51270000, 52550000, 55110000, 60230000, 70470000, 90950000]41Figure 5.13: Log Index with different initial jumps versus Geometric forQ15. X-axis shown in log scale. Y-axis shows Ratio=Time taken byapproach/GeometricFigure 5.14: Log Index with different initial jumps versus Geometric forQ22. X-axis shown in log scale. Y-axis shows Ratio=Time taken byapproach/GeometricIn Figure 5.13 for the 100k jump at RDS 100k, the ratio is very small as ex-pected because the GC with LI-100k loads all the data upto 100k in one chunkand performs 2 comparisons (Rout \Q(D100k) and Q(D100k \Rout) whereas GC withdefault parameters performs 4 chunk loads and 3 binary searches to find the RDS.42Figure 5.15: Log Index with different initial jumps versus Geometric forQ23. X-axis shown in log scale. Y-axis shows Ratio=Time taken byapproach/GeometricHowever, we observe that GC with default parameters catches up quickly for thelater RDS values.For GC with LI-1 million, a similar explanation holds for the small ratio at RDS1 million. For the later RDS values, GC with the log index performs one additionalchunk load to load data. This results in the ratios greater than 1. Please note thedifference between the absolute running times is not very much.GC with LI-10 million has ratios below 1 for RDS values 10, 30 and 50 million.At RDS 10 million, everything is loaded in one chunk as opposed to 10 chunk loadsand 9 binary searches performed by GC. At RDS-30 million, GC with LI loadsdata upto 30470000 whereas the default GC loads a larger chunk upto 40950000.GC with LI performs one less binary search on a smaller chunk size. A similarexplanation holds at RDS 50 million. GC with LI loads data upto 50950000 whereasthe default GC needs to load a larger chunk upto 81910000. At RDS 70 million, GCwith LI now needs to load one extra chunk upto 91910000. The linear comparisonshere are more expensive as well since we do it on more data. This leads to thehigher ratio at 70 million.Note in Figure 5.14, the ratios for last two RDS values 50 million and 70 millionare very similar for every strategy. Recall from our previous discussion in Section5.3, Rout does not change beyond timestamp=32,386,674. So in essence, the al-43gorithm provides witness of validity at the same states for both 50 million and 70million.To conclude, the GC catches up very quickly to the GC with Log Index. More-over, maintaining a Log Index is also expensive. If we were to include the cost ofmaintaining a Log Index, GC with Log Index would not perform much better thanthe default GC.5.6 Scalability in terms of available memoryWe study the effects of our approach for limited memory scenarios. We limit thememory provided to SQL Server by modifying the maximum available memoryto SQL Server. This experiment tries to simulate the practical scenario when theentire database does not fit in memory. We decrease the memory from 20GB to2GB. Figure 5.16 shows an increase in running time for queries 15 and 23 as wedecrease the memory (as expected because of the increase in I/O swaps). However,Q22 takes almost the same time to run up to 7GB after which decreasing mem-ory increases the running time. This is because Q22 is a join between Customerand Orders relations. Since these tables are relatively small they fit in memoryeven when available memory is 7GB, after which there is an increase in I/O swapsreflected by the increase in running time.To conclude, we show that our approach is robust in dealing with large databasesthat do not fit in memory. This is particularly useful in a streaming database settingwhere the entire database may not be available in memory.5.7 Varying Configuration5.7.1 Impact of Base Chunk SizeWe run our experiments with default value of base chunk size as 10,000. In thisexperiment, we study the effects of varying the base chunk size at RDS of 10 mil-lion. We vary the base chunk size from 1k to 1 million. A very small base chunksize such as 1k does not scale up very well, the horizontal chunks being as follows:[0, 1000, 3000, 7000, 15000, 31000, 63000, 127000, 255000, 511000, 1023000,2047000, 4095000, 8191000, 16383000, 32767000, 65535000, 131071000] takes44Figure 5.16: Varying the available memory provided to SQL Server.14 chunk loads to load data up to 10 million compared to 10 taken by GC with de-fault parameters. Moreover, GC with base chunk size 1k creates very large chunkstowards the end which result in an increased number of binary searches. How-ever, any reasonable base chunk size gives us very similar results as shown in 5.17and our approach does not dramatically change with changes in base chunk size.Hence, our approach is robust to changes in base chunk size.5.7.2 Tuning the geometric multiplier ‘c’Empirically, we have observed that geometric multiplier c=2 performs best in termsof striking a good balance between the number of linear loads and binary searches.However, changing the geometric multiplier does not drastically change results asseen from Figure 5.18. This shows that our approach is also robust to changes in c.5.7.3 View MaintenanceWe take advantage of the view maintenance functionality provided by the moderndatabase systems and incrementally maintain the view for Q(Dr). For systemswhich do not have a provision for view maintenance, we need to compute Q(Dr)from scratch every time Rout \Q and Q \Rout needs to be computed. We show in45Figure 5.17: Varying base chunk size. X-axis is in log-scale.—Query— —Total Time (in ms)— —Percentage Improvement—WithoutView Main-tenanceWith ViewMaintenanceq15 344,818 259,267 24.81047973q22 65,825 55,278 16.02278769q23 316,567 224,480 29.08926073Table 5.2: Performance improvement using View MaintenanceTable 5.2 that there is a significant performance gain achieved by using incrementalview maintenance over computing the query from scratch. The numbers shownhere are compared at RDS of 10 million. Percentage improvement is calculated as:(Time taken without view maintenance - Time taken with view maintenance)/Timetaken with view maintenance * 100.46Figure 5.18: Varying c. X-axis showing RDS is in log-scale.5.7.4 Full query workload versus right query workloadIn our experiments, each full query workload consists of the right query, one “in-correct” subset query, one “incorrect” superset query.Superset queries As discussed earlier, superset queries are faster to invalidate.Consider the following two scenarios:• RDS lies in the first chunk: For a superset query, Rout \Qsup(Dend) is emptyand Qsup(Dend)\Rout is non-empty at τ = end, where end is the last times-tamp of the first chunk. According to our algorithm, we perform a bi-nary search between τ = 1 and τ = end. Say at τ = mid, both the condi-tions become non-empty and we give the witness of invalidity i.e., witQ =〈(Dmid ,Dmid+1)〉. In the worst case, this will take 2 ∗ log2(base chunk size)number of comparisons. 3.• RDS lies beyond the first chunk: For a superset query, Rout \Qsup(Dend) andQsup(Dend) \Rout both are non-empty at τ = end, where end is end of thefirst chunk. Therefore we can give the witness of invalidity at τ = end itself,without having to perform any binary search comparisons. The expense for3since base chunk size=size of first chunk47query invalidation here is minimal- one chunk load and 2 comparisons at theend of the chunk.Generally, we encounter the second scenario where we can quickly invalidate thesuperset queries at the end of first chunk.Subset queries Subset queries take≥ number of chunk loads as the right query.In the worst case, subset query can take 2 ∗ (log2(last)− log2(end)) number ofcomparisons to give a witness of invalidity where last refers the last TS of the logand end refers to the last TS of the chunk containing RDS. In general, we haveobserved that it takes one extra chunk load than the right query to give a certificateof invalidity. This is because Rout−Qsub(Dr) is non empty (because Qsub is a subsetquery) and Qsub(Dr)−Rout is also non-empty (because r lies beyond RDS, henceincorporating extra tuples from time stamps between RDS and r). Therefore, subsetqueries take longer to receive a certification of invalidity.Since there is no way of knowing beforehand which category the query belongsto, we assume a mix of all three in our workload. We maintain three views for eachquery in the workload. Each chunk is loaded only once and all the queries (whosevalidity is yet to be determined) are evaluated at the end of the chunk. As soon asany query receives a witness, we drop the view for that query and go forward withthe remaining queries.Figures 5.19, 5.20 and 5.21 show the absolute running times for the right queryand the full query workload. Observe that the full query workload does not involvea tremendous amount of overhead for the extra two queries. Figure 5.22 depicts thepercentage increase in the running time for the full query workload over just theright query. Barring the first RDS at 10k, all the others lie within 33% of increasedrunning time. For the first RDS, the startup cost of loading the chunk and end ofchunk comparisons outweigh the advantage received by maintaining the differentviews. However, the absolute times here are very small.48Figure 5.19: Right Query versus Full Query Workload: q15. X-axis showingRDS is in log-scale.Figure 5.20: Right Query versus Full Query Workload: q22. X-axis showingRDS is in log-scale.49Figure 5.21: Right Query versus Full Query Workload: q23. X-axis showingRDS is in log-scale.Figure 5.22: %age increase=(Full Query Workload-Right Query Work-load)/Right Query Workload * 100. X-axis showing RDS is in log-scale.50Chapter 6Related WorkIf I have seen farther it is by standing on the shoulders of Giants.— Sir Isaac Newton (1855)The related work can be broadly classified into 2 areas - Query Reverse Engi-neering and Reverse Query Processing.6.1 Query Reverse Engineering (QRE)This body of work focuses on obtaining a SQL query that generates a specifiedresult set when evaluated on a given database state. There are three recent direc-tions that fall under the umbrella of Query Reverse Engineering but their goals andtechniques have significant differences.6.1.1 Deriving Instance Equivalent Queries (IEQs)Two queries Q1 and Q2 are considered IEQs w.r.t. a database state Dr if their resultsare equal, i.e. Q1(Dr) = Q2(Dr). Authors initial work — Query By Output(QBO),is a data driven approach that focuses on generating IEQs from a known inputquery and its result set on a given database state [19].They extend QBO to derive IEQS where the input query is unknown and pro-vide support for multiple database states on a continuous log. Specifically, givena result set Rout and a sequence of database states < D1,D2, ..,Dr >, determine51the most recent database state r and a query Q such that Q(Dr) = Rout . Thiswork [20] is most relevant to our work and we will discuss their work in greaterdetail. Their work aims to solve a more general problem, where they aim atquery discovery rather than query validation. Their work considers SPJ queriesas well as aggregates and unions. They model the different database states usinga backward delta storage organization [17]. Given a sequence of database states< D1,D2, ..,Dr >, the database stores the most recent state Dr together with back-ward deltas δr(r−1), ..,δ21. A backward delta models a set of insert and delete op-erations such that Dr−1 is derived from Dr and δr(r−1). Since their work considersupdate operations involving additions and deletions, they are unable to take advan-tage of the natural monotonicity in a data streaming environment. Our baseline—Backward naı¨ve strategy is inspired by their approach. From the family of queriesthat they cover, we focus on SPJ queries and show in Section 5.1.3 that their ap-proach is unable to scale up beyond a few database states. However, their discoveryof IEQs can serve as an input to our candidate query set Q.A subproblem of QBO is considered in View Definitions Problem (VDP) [9].This is a selection condition discovery problem for the view V for a single relationin R, where both R and V share the schema. Their discovery of Q equates to dis-covering selection predicates on R to generate V without any joins and projections.Our work is able to validate queries involving selection conditions and hence, VDPcan serve to generate candidate queries for our work.Another work that belongs to QRE and is seemingly relevant to our work isReverse Engineering Top-k Database Queries [15]. This work addresses the prob-lem of reverse engineering top-k queries, given a result set Rout and a relation R.This work also deals with a single relation and tries to find selection conditions toproduce Rout . The authors present a probabilistic model that assesses the chanceof a candidate query Q to evaluate exactly as or close to Rout . We are particularlyinterested in their approach for query validation. They order the candidate queriesby their expected suitability to evaluate to Rout . They claim that this approachpromises to find a valid query early. They also discuss smart query validationwhere, instead of evaluating each query sequentially in the order provided by theranking, they take advantage of the information learnt from executing the previousquery. Consider query Q1 was executed (based on the ranking) and it produced52results very similar to Rout but not an exact match. In such a scenario they con-sider validating queries that are similar to Q1 and skip those in the ranked list. Thedrawback of their approach is that they incorporate many false positive predicatesin their candidate queries. For evolving databases (because of updates, inserts ordeletes in a data warehouse scenario), their approach further introduces false nega-tives. They claim to combat this using the previously mentioned ‘smart validation.’However, even in the smart validation strategy, they resort to evaluating the candi-date queries on the entire database state, which is an expensive operation. Instead,our GC approach could work on top of their smart validation strategy and helpdiscard the invalid queries early on in the log by failing fast.6.1.2 Query from Examples (QFE)This body of work was introduced in [12] and helps non-expert database users con-struct SQL queries through user feedback. The user initially provides a databasestate-result set pair (Dr,Rout) as input. QFE iteratively presents the user withdatabase state-result set pairs which are close to the input pair. The user is re-quired to determine if the new database state-result set pair is consistent with herdesired query Q. A similar approach has been discussed in AIDE [10] where thedesired query Q is developed based on the users feedback on samples of databasetuples. Based on user feedback, the system generates a new set of database tuples.The query is presented to the user when she wishes to terminate this process aftera few iterations. These approaches are similar to our work in that the query is un-known. However these are all query generation approaches and our work can beappended to validate the candidate queries generated by the discussed approaches.Abouzied et al. discuss learning and verifying a special class of Boolean queriescalled qhorn queries. We are particularly interested in the verification part of theirwork. This is also a query from example approach, where the learning algorithmspose membership questions to the user and she classifies each data object as ananswer or a non-answer. The input to the verification model is a candidate queryQc and the user’s intended query Qi. The algorithm generates a set of membershipquestions for each query. If Qi is semantically different from Qc, then their answerswould differ on at least one question and we can safely invalidate the query.536.1.3 Targeted Query GenerationBruno et al. and Mishra et al. study the problem of generating test queries to satisfycertain cardinality constraints on their subexpressions.6.2 Reverse Query ProcessingA slightly tangential body of work discusses Reverse Query Processing. The prob-lem addressed here is generating databases to satisfy a set of constraints given a setof queriesQ and a corresponding result setRout . Binnig et al. introduced the prob-lem: given a query Q ∈Q and the corresponding result set Rout ∈Rout , generate adatabase D such that Q(D) = Rout .QAGen and its various variants discuss the problem: given an input query Qand a set of cardinality constraints on the subexpressions for Q’s evaluation plan P,generate a test database D such that P’s execution on D satisfies those constraints[6][13][3]. Unlike the above works on database generation, our work focuses onfinding a database state-query pair in a continuous log.54Chapter 7ConclusionIn this work, we have proposed a framework to validate monotone SPJU queries ona continuous log. This problem can be used as a standalone application to assist indata mining and data analysis tasks or it can be used in conjunction with the variousQuery Reverse Engineering (QRE) problems. From the previous work on QRE, wehave observed that most focus on query discovery tasks, where query validationbecomes a bottleneck. Our proposed approach Geometric Chunking aims to solvethat by failing fast on the invalid queries and quickly finding the RDS for validqueries. We have performed experiments on a 10 GB data set and shown that ourapproach is scalable and robust to changes in parameters. Our approach deals wellwith limited memory conditions and can perform for scenarios where the memoryavailable is one-fifth the size of the database. We also compare with baselines andstatic chunking and show that they do not scale well. As future work, a challengingissue that requires further study is to validate queries involving non-monotonicity,aggregation operators and arithmetic expressions.55Bibliography[1] S. Abiteboul. Foundations of databases. Addison-Wesley, Reading, Mass,1995. → pages 6[2] A. Abouzied, D. Angluin, C. Papadimitriou, J. M. Hellerstein, andA. Silberschatz. Learning and verifying quantified boolean queries byexample. In Proceedings of the 32Nd ACM SIGMOD-SIGACT-SIGAISymposium on Principles of Database Systems, PODS ’13, pages 49–60,New York, NY, USA, 2013. ACM. → pages 53[3] A. Arasu, R. Kaushik, and J. Li. Data generation using declarativeconstraints. In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’11, pages 685–696, NewYork, NY, USA, 2011. ACM. → pages 54[4] A. Baid, A. Balmin, H. Hwang, E. Nijkamp, J. Rao, B. Reinwald,A. Simitsis, Y. Sismanis, and F. van Ham. Dbpubs: multidimensionalexploration of database publications. PVLDB, 1(2):1456–1459, 2008. →pages 2[5] C. Binnig, D. Kossmann, and E. Lo. Reverse query processing. In DataEngineering, 2007. ICDE 2007. IEEE 23rd International Conference on,pages 506–515. IEEE, 2007. → pages 54[6] C. Binnig, D. Kossmann, E. Lo, and M. T. O¨zsu. Qagen: Generatingquery-aware test databases. In Proceedings of the 2007 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’07, pages341–352, New York, NY, USA, 2007. ACM. → pages 54[7] J. A. Blakeley, P.-A. Larson, and F. W. Tompa. Efficiently updatingmaterialized views. In Proceedings of the 1986 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’86, pages61–71, New York, NY, USA, 1986. ACM. → pages 11, 22, 2456[8] N. Bruno, S. Chaudhuri, and D. Thomas. Generating queries withcardinality constraints for dbms testing. In Transactions on Knowledge andData Engineering. IEEE Computer Society, January 2006. → pages 54[9] A. Das Sarma, A. Parameswaran, H. Garcia-Molina, and J. Widom.Synthesizing view definitions from data. In Proceedings of the 13thInternational Conference on Database Theory, ICDT ’10, pages 89–103,New York, NY, USA, 2010. ACM. → pages 52[10] K. Dimitriadou, O. Papaemmanouil, and Y. Diao. AIDE: an automatedsample-based approach for interactive data exploration. CoRR,abs/1510.08897, 2015. → pages 53[11] L. Golab, T. Johnson, and V. Shkapenyuk. Scalable scheduling of updates instreaming data warehouses. IEEE Transactions on Knowledge and DataEngineering, 24:1092–1105, 2011. → pages 5[12] H. Li, C. Chan, and D. Maier. Query from examples: An iterative,data-driven approach to query construction. PVLDB, 8(13):2158–2169,2015. → pages 53[13] E. Lo, N. Cheng, and W.-K. Hon. Generating databases for queryworkloads. Proc. VLDB Endow., 3(1-2):848–859, Sept. 2010. → pages 54[14] C. Mishra, N. Koudas, and C. Zuzarte. Generating targeted queries fordatabase testing. In Proceedings of the 2008 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’08, pages 499–510, NewYork, NY, USA, 2008. ACM. → pages 54[15] K. Panev, E. Milchevski, and S. Michel. Computing similar entity rankingsvia reverse engineering of top-k database queries. In Data EngineeringWorkshops (ICDEW), 2016 IEEE 32nd International Conference on, pages181–188. IEEE, 2016. → pages 52[16] R. Schumacher. An extended version of faulhabers formula. J. IntegerSequences, 19, 2016. → pages 22[17] M. Stonebraker and L. A. Rowe. The design of postgres. In Proceedings ofthe 1986 ACM SIGMOD International Conference on Management of Data,SIGMOD ’86, pages 340–355, New York, NY, USA, 1986. ACM. ISBN0-89791-191-1. → pages 23, 52[18] TPC. Tpc-h benchmarks. (visited on 24/08/2017). URL http://www.tpc.org/.→ pages 25, 2957[19] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query by output. InProceedings of the 2009 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’09, pages 535–548, New York, NY, USA,2009. ACM. ISBN 978-1-60558-551-2. → pages 51[20] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query reverse engineering.The VLDB Journal, 23(5):721–746, Oct. 2014. → pages 2, 10, 52[21] M. Zhang, H. Elmeleegy, C. M. Procopiuc, and D. Srivastava. Reverseengineering complex join queries. In Proceedings of the 2013 ACMSIGMOD International Conference on Management of Data, SIGMOD ’13,pages 809–820, New York, NY, USA, 2013. ACM. → pages 2658