UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A streaming algorithms approach to approximating hit rate curves Drudi, Zachary 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_november_drudi_zachary.pdf [ 302.55kB ]
Metadata
JSON: 24-1.0166992.json
JSON-LD: 24-1.0166992-ld.json
RDF/XML (Pretty): 24-1.0166992-rdf.xml
RDF/JSON: 24-1.0166992-rdf.json
Turtle: 24-1.0166992-turtle.txt
N-Triples: 24-1.0166992-rdf-ntriples.txt
Original Record: 24-1.0166992-source.json
Full Text
24-1.0166992-fulltext.txt
Citation
24-1.0166992.ris

Full Text

A Streaming Algorithms Approach to Approximating HitRate CurvesbyZachary DrudiB. Math, University of Waterloo, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)October 2014c© Zachary Drudi, 2014AbstractIn this work, we study systems with two levels of memory: a fixed-size cache, anda backing store, each of which contain blocks. In order to serve an IO request,the block must be in the cache. If the block is already in the cache when it isrequested, the request is a cache hit. Otherwise it is a cache miss, and the blockmust be brought into the cache. If the cache is full, a block must be evicted fromthe cache to make room for the new block. A replacement policy determines whichblock to evict. In this work, we consider only the LRU policy. An LRU cache evictsthe block which was least recently requested.A trace is a sequence of blocks, representing a stream of IO requests. For agiven trace, a hit rate curve maps cache sizes to the fraction of hits that such acache would achieve on the trace. Hit rate curves have been used to design storagesystems, partition memory among competing processes, detect phases in a trace,and dynamically adjust heap size in garbage-collected applications.The first algorithm to compute the hit rate curve of a trace over a single passwas given by Mattson et al. in 1970. A long line of work has improved on thisinitial algorithm. The main contribution of our work is the presentation and formalanalysis of two algorithms to approximate hit rate curves. Inspired by recent resultsin the streaming algorithms community on the distinct elements problem, we usememory efficient probabilistic counters to estimate the number of distinct blocks ina subsequence of the trace, which allows us to approximate the hit rate curve usingsublinear space. We also formally state some variants of the hit rate curve approxi-mation problem which our algorithms solve, and derive lower bounds on the spacecomplexity of these problems using tools from communication complexity.iiPrefaceThis thesis is the formal analysis complement of the more systems-oriented workpublished as J. Wires, S. Ingram, N. J. A. Harvey, A. Warfield, and Z. Drudi; Char-acterizing storage workloads with counter stacks; published in 11th USENIX Sym-posium on Operating Systems Design and Implementation. The ideas behind thealgorithms described in chapter 3 are common to both works, and were developedtogether by the above authors.The analysis in chapters 3, 4, and 5 is original, unpublished work. The imple-mentation used for the experiments in chapter 6 is the same as that analyzed in thepublished paper mentioned above.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Exact Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . 53 Presentation and Analysis of Algorithms . . . . . . . . . . . . . . . 73.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.1 Algorithmic guarantees . . . . . . . . . . . . . . . . . . . 83.2 The Working Set Data Structure . . . . . . . . . . . . . . . . . . 93.3 Deterministic Algorithms for Computing Hit Rate Curves . . . . . 93.4 A Streaming Algorithm for Approximating Hit Rate Curves . . . 123.4.1 Implementing Algorithm 3 using probabilistic counters . . 133.4.2 Improved space by pruning redundant counters . . . . . . 18iv4 A Unified Representation . . . . . . . . . . . . . . . . . . . . . . . . 224.1 Distinct Elements and Probabilistic Counters . . . . . . . . . . . 224.2 Counter Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.1 Communication Complexity . . . . . . . . . . . . . . . . . . . . 265.2 Gap Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . 275.3 HRC Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 276 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1 Microsoft Research Traces . . . . . . . . . . . . . . . . . . . . . 306.2 Average Footprints and Phase Changes . . . . . . . . . . . . . . . 337 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36vList of TablesTable 6.1 Average and maximum error for avgfp and cs. . . . . . . . . 31viList of FiguresFigure 6.1 Hit rate curves for MSR traces. Cache size in GBs against hitrate percentage. . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 6.2 Hit rate curves for cyclic trace. Cache size in GBs againsthit rate percentage. . . . . . . . . . . . . . . . . . . . . . . . 34viiAcknowledgmentsI would like to thank my supervisor, Nick Harvey, for his encouragement, opti-mism, and guidance. He is an inexhaustible source of ideas, and his enthusiasmhelped me through many an unproductive lull.I am indebted to Andrew Warfield for encouraging my interest in systems, andtaking me on as an intern at Coho Data, where I worked from February to Augustof 2014.I would like to thank the remaining members of the team at Coho Data, StephenFrowe Ingram and Jake Wires. Stephen was the friend at my side who kept me sanewhenever I fell into a seemingly bottomless hole of circular dependencies, miscon-figurations, and versioning problems. Jake Wires tolerated my absolute ignoranceof the tools and techniques used in the systems world, and taught me many practicaltricks.Finally, I would like to thank my family for their long-distance but robust sup-port.viiiChapter 1IntroductionMost computer systems use multiple levels of storage. CPU caches permit veryefficient read and write operations, speeding the execution of processes, but arelimited in size by their high cost. Main memory is slower, cheaper and larger, butagain too expensive to provide all the storage requirements of a system. Magneticdisks or flash drives provide very high capacity storage, but at significantly lowerspeeds.Given this memory hierarchy, much work has been done to make the mostefficient use of available resources. To simplify the discussion, we assume thereare only two levels of memory: a fast, small cache, and a larger backing store.Both the cache and the store contain data organized into fixed size blocks. Duringexecution, a process issues block requests. If a requested block is in the cache, wehave a cache hit, and the process can continue execution. Otherwise, a cache missoccurs, and the requested block must be brought into the cache before the processcan resume. If the cache is full, a block must be evicted from the cache into thestore. A replacement policy determines how this block is chosen. To simplifymatters, in the following we will just consider the trace, which is the sequence ofblock requests made by a process, and ignore the process itself.Belady [4] did early, foundational work on comparing different replacementpolicies by simulation on request traces. The most widely used replacement policyis LRU. LRU stands for Least Recently Used. When a cache miss occurs, theLRU policy evicts the block whose last request occurred before the last request of1any other block in the cache. Although implementing LRU involves dynamicallyreordering blocks on every request and is somewhat expensive, there are variantsof LRU that are both simple and efficient to implement in practice, and are widelyused.Given a replacement policy such as LRU and a trace, the hit rate curve is afunction mapping cache sizes to the fraction of hits for that cache size.Mattson et al. [23] introduced the notion of a stack algorithm, which character-izes a class of replacement policies (including LRU), and gave a simple algorithmthat can compute the hit rate curve for a fixed stack algorithm in a single pass overthe trace. The algorithm computes the stack distance of each request, and talliesthem to produce the hit rate curve after the trace has been processed. The stackdistance of a request to a block b is the number of distinct blocks that occurred inthe trace between the current request to b and the last request to b. The stack dis-tance of a request corresponds to the minimum size LRU cache that would producea hit on the request. If there is no such earlier request to b, the stack distance isdefined to be infinite, as a request to a new block is necessarily a miss regardlessof cache size. In the Mattson et al. algorithm, a linked list maintains the blocksthat have been previously requested. To process a new request for a block b, thelinked list is linearly searched for b. If b is found, its position in the linked list isthe stack distance of the request. The block b is then removed from the linked list,and reinserted at the head of the list. If b is not found in the list, then this is the firstrequest for b, and the request is assigned a stack distance of infinity. After the traceis processed, the value of the hit rate curve for a given cache size C is simply thenumber of stack distances less than or equal to C, normalized by the total numberof requests.While simple, this algorithm is quite inefficient. If the trace has length m andthe blocks come from a store of size n, each request will take O(n) time to process.A series of more efficient algorithms has been introduced by authors interested inCPU caches ([5], [26], [28], [1], [13], [27], [34], [15], [25]) . The traces studied areproduced by a processor executing a given program, and each request correspondsto a memory location. The cache is the CPU level cache, while the backing storeis main memory. For these traces, m is typically on the order of billions, evenfor short program executions of a minute or less, while n is a fraction of main2memory. In this setting, online algorithms which process each request as it ismade by the process are attractive because the trace doesn’t need to be recorded.These algorithms must be very efficient or they will impose unacceptable overheadon program execution. Accordingly, past work has frequently focused on timecomplexity, while space complexity has received comparatively little attention.In this work, we approach the problem motivated by storage level traces. Thesetraces correspond to requests made to the file system. In this setting, the backingstore is the permanent storage medium of the system, which may be a magnetic diskor flash drive. The number of distinct blocks in the backing store, n, may exceedmain memory, so we are very sensitive to space usage. Our contribution is thepresentation and analysis of approximation algorithms to compute hit rate curvesusing sublinear space. The main ingredient in our approach is a line of work fromthe streaming algorithms community on algorithms to count the number of distinctelements in a stream using sublinear space ( [16], [2], [3], [14], [17], [21]). Usingthese counting algorithms as a black box, we can estimate the stack distances ofrequests in the trace, and produce an approximate hit rate curve. Unlike prior workon approximate hit rate curves, we derive precise error bounds for our algorithms.Furthermore, using results from communication complexity we give lower boundsfor the space complexity of algorithms solving several variants of the hit rate curveproblem.3Chapter 2Related WorkResearchers have done a substantial amount of work exploring improved versionsof the Mattson algorithm first proposed in [23], as well as entirely new techniquesfor computing hit rate curves. Here we discuss some of these contributions. Wefirst discuss exact algorithms, and then we describe some recent approximationalgorithms.2.1 Exact AlgorithmsBennett and Kruskal [5] improved on the original Mattson algorithm by replacingthe linear list with a k-ary tree. The leaves of the tree represent requests, and store1 or 0 depending on whether the request corresponds to the last request of theblock. Interior nodes store the sum of their children. The last time each block wasrequested with respect to a given point in the trace is stored in a hash table, which isupdated after processing each request. The stack distance of a request is computedby looking up the block to find the time it was last requested, and finding the sumof the corresponding subtree. The time complexity to process a request is thusO(log(m)). This general framework of using a hashtable to store the last accesstime of each block and an auxilliary tree-based datastructure to compute the stackdistance was used by many later authors ([26], [28], [1], [13], [25]).Alma´si, Cas¸caval and Padua [1] proposed some variants of the Bennett andKruskal algorithm, counting the number of “holes” (leaves containing 0, instead of41) instead of the number of 1s in the leaves, and using an interval tree data structure(implemented using either an AVL or red-black tree) to store the locations of theholes. Their modifications were more efficient in practice, although they had thesame asymptotic complexity. Sugumar and Abraham [28] used splay trees instead.Niu et al. presented a parallel algorithm [25]. Given p processors, the trace issplit into p chunks, with the kth chunk going to the kth processor. The computationis performed in a series of rounds. In each round, every processor uses a variant ofthe sequential Mattson algorithm to process its chunk of the trace, storing the stackdistances in a local copy of the stack distance histogram. Requests to blocks whichoccur for the first time in a given chunk may have occured in a previous chunkin the trace. To determine their true stack distances, these requests are added toa queue, and passed to the previous processor at the end of the round. After atmost O(p) rounds, all stack distances have been found, and the local histogramsare aggregated to produce the complete histogram.2.2 Approximation AlgorithmsBy dynamically resizing the tree, Ding and Zhong [13] approximate the hit ratecurve by approximating the stack distance of each request with multiplicative er-ror ε . For each block, their algorithms record the time range that the last ac-cess belongs to instead of the exact time. The tree structure itself only consumesO(log(n)/ε) space, but they still require a hash table mapping each block to thelast time in the trace it was requested, requiring O(n) space. The trace can be pre-processed to include this information with each request, at the cost of O(m logn)time.Shen et al. augment the algorithm of Ding and Zhong with sampling [34]. Thetrace is divided into alternating sampling and hibernating intervals. The hash tableof last access times is always updated, but the stack distance of a given requestis only recorded if the latest prior access occurred in a sampling interval. Theyalso tweak the tree bookkeeping of Ding and Zhong’s algorithm by merging allrequests from a given hibernating interval into a single leaf node, and record onlythe number of distinct blocks requested in the interval.Shen et al. use a probabilistic model to estimate the stack distance histogram [27].5Using a histogram of time distances (the number of requests between two succes-sive requests to the same block) and the number of distinct blocks accessed in thetrace, they estimate the stack distance histogram. Their experimental results showgood accuracy, but they do not provide analytical error bounds.Eklov and Hagersten combine trace sampling with approximation of stack dis-tances from reuse distances [15]. Building on their work in [6], their method findsthe average stack distance among all reuse windows of a fixed size by using thedistribution of forward reuse distances.Xiang et al. [33], building on earlier work [32], use an approach based onaveraging. They define the footprint of a window of the trace to be the numberof distinct blocks accessed during the window. For a given window size w, theaverage footprint is the average number of distinct blocks accessed in windowsof size w. They calculate the average footprint of a logarithmic scale of windowsizes by recording the reuse distances of requests. Inverting the average footprintfunction yields an approximation to the unnormalized hit rate curve. While theirapproach is fast and accurate, their memory requirements are linear in n, and it isunclear if their averaging technique maintains the precise error bounds of an earlier,slower approach measuring all footprints [32].6Chapter 3Presentation and Analysis ofAlgorithmsIn this chapter we describe our main algorithms and characterize their space usageand accuracy guarantees. We begin by establishing some notation, and introducingthe working set data structure abstraction. In section 3.3, we present deterministicalgorithms for computing hit rate curves. In section 3.4 we modify these usingwork from the streaming algorithms community. In Theorem 2, our main result ofthis chapter, we show that the resulting algorithm computes the hit rate curve withadditive error ε , and in Corollary 6 we show it uses only O(poly(1/ε, log(nm)))space.3.1 PreliminariesTo state our results precisely, let us fix some notation. We will use n to representthe size of the store from which blocks are drawn, and m to represent the tracelength. We will identify blocks with integers from the set [n] = {i : 1≤ i≤ n}.The set of requested blocks between time t ′ and strictly before time t is:B(t ′, t) ={bi : i ∈ [m] and t′ ≤ i < t}.7At time t, the most recent request for block bt occurred at timeR(t) = max{ x : x < t and bx = bt } .We define R(t) = −∞ if bt was not requested before time t. The stack distance ofthe request at time t isD(t) =|B(R(t), t)| (if R(t)>−∞)∞ (otherwise)A cache of size k has a hit at time t if and only if D(t) ≤ k. The hit rate curve isthe function C : [n]→ [0,1] where C(k) is the hit rate for a cache of size k. ThusC(k) = |{ t ∈ [m] : D(t)≤ k }|/m.In this work we are concerned with computing the hit rate curve at ` uniformly-spaced points, where ` is a parameter. For simplicity, assume that n = `∆, n = `∆where ∆is an integer. The histogram of D is the function H : [`]→ N whereH(i) = |{ t ∈ [m] : (i−1)∆< D(t)≤ i∆ }|. (3.1)The fraction of requests that are hits with a cache of size x∆ is ∑xi=1 H(i)/m. Thehit rate curve at the desired ` uniformly-spaced points isC(x∆) =x∑i=1H(i)/m ∀x ∈ [`].3.1.1 Algorithmic guaranteesAll algorithms in Sections 3.4 and 4.2 produce a function Cˆ that satisfiesC((x−1)∆)− ε ≤ Cˆ(x∆) ≤ C(x∆)+ ε ∀x ∈ [`+1] (Weak-Guarantee)8with high probability. In fact, the algorithms of Sections 3.4.1 and 4.2 actuallysatisfy the guaranteeC((x−1)∆)− εx/` ≤ Cˆ(x∆) ≤ C(x∆)+ εx/` ∀x ∈ [`+1].(Strong-Guarantee)The algorithms of Sections 3.4.2 and 4.2 use space poly(`,1/ε, log(nm)).3.2 The Working Set Data StructureTo streamline our presentation, we introduce an abstract data type called a workingset data structure. In the following section, we will describe a single algorithm tocompute hit rate curves, given an implementation of a working set data structure.Then, we will present a series of implementations of working set data structures.The notion of a working set comes from Denning [11]. The working set of thetrace between times t ′ and t is simply B(t ′, t). As the stack distance D(t) is definedin terms of |B(R(t), t)|, if we knew |B(t ′, t)| for all t ′, t we could compute the stackdistance for every request in the trace. A working set data structure gives estimatesof |B(t ′, t)|, enabling a client to estimate stack distances and thus the hit rate curve.A working set data structure supports two operations, REGISTER(t,b), whichrecords that block b was requested at time t, and GETWORKINGSETSIZE(t), whichestimates the number of distinct blocks requested since time t.The simplest way to implement a working set data structure is to use a counter,which is an abstract data type that computes the number of distinct elements in adata stream. This data type supports two operations, INSERT and QUERY, whichreturns the number of distinct elements that were inserted. Pseudocode illustratingthis is shown in Algorithm 1.3.3 Deterministic Algorithms for Computing Hit RateCurvesWe are interested in computing the value of C at ` uniformly-spaced points, so webegin with Algorithm 2 which computes those values. It can be implemented inO(m logn) time and O(n) space using the hash table plus tree approach of Bennettand Kruskal [5].9Algorithm 1: An implementation of a working set data structure based onabstract counters.1 c← 12 Function Register(t,bt):3 c← dt/∆e4 if t ≡ 1 (mod ∆) then5 Create the new counter K [c]6 for i = 1, . . . ,c do7 K [i].INSERT(bt)8 Function GetWorkingSetSize(t):9 Return K [dt/∆e].QUERY()Algorithm 2: Algorithm for computing the hit rate curve at ` = n/∆uniformly-spaced points.1 Input: A sequence of requests (b1, . . . ,bm) ∈ [n]m2 Initialize the vector H ∈ N` with zeros3 for t = 1, . . . ,m do4 If D(t) is finite then increment H[dD(t)/∆e] by 15 B H[i] satisfies condition (3.1).6 Output the hit rate curve values C(x∆) = ∑xi=1 H[i]/m for x ∈ [`].Next we present Algorithm 3, which is an algorithm to approximate a hit ratecurve using a working set data structure. To facilitate compact implementations ofthat data structure, the algorithm only queries the working set size at specific timesτi = (i− 1)∆+ 1 for i = 1,2, . . .. This algorithm also allows the data structureto decline to return an estimate, in which case it returns NULL instead. Considerimplementing the working set data structure using exact counters, which computethe number of distinct elements exactly, e.g., using a hash table. Then line 9 inAlgorithm 3 will haveXi(t) = |B(τi, t)| ∀i, t. (3.2)Let us now compare the accuracy of Algorithms 2 and 3 when exact counters areused.10Algorithm 3: An algorithm for approximating the hit rate curve at `uniformly-spaced points, given an implementation W of a working set datastructure.1 Input: A sequence of requests (b1, . . . ,bm) ∈ [n]m2 Initialize the vector H ∈ N` with zeros3 B For convenience, let τi denote (i−1)∆+14 for t = 1, . . . ,m do5 B Receive request bt6 W .REGISTER(t,bt)7 Let c← dt/∆e8 for i = 1, . . . ,c do9 Let Xi(t +1)←W .GETWORKINGSETSIZE(τi)10 for i = 1, . . . ,c−1 do11 if Xi(t +1) 6= NULL and Xi+1(t +1) 6= NULL then12 Increment H[dXi(t)/∆e] by(Xi+1(t+1)−Xi+1(t))−(Xi(t+1)−Xi(t))13 Increment H[dXc(t)/∆e] by 1−(Xc(t+1)−Xc(t))14 Output the hit rate curve approximation given by C(x∆) = ∑xi=1 H[i]/m forx ∈ [`].Claim 1. Let C be the hit rate curve computed by Algorithm 2. Let Cˆ be the hitrate curve computed by Algorithm 3, using Algorithm 1 with exact counters toimplement W . ThenC((x−1)∆)≤ Cˆ(x∆) ≤ C(x∆) ∀x ∈ [`].Proof. Let H and Hˆ respectively denote the histograms computed by Algorithms2 and 3. Note that bt 6∈ B(t ′, t) for t ′ > R(t) but bt ∈ B(t ′, t) for t ′ ≤ R(t). Becauseof (3.2), we haveXi(t+1)−Xi(t) =1 (if R(t)< τi ≤ t)0 (if 1≤ τi ≤ R(t))It follows that the increment in line 12 equals 1 if τi ≤ R(t) < τi+1 and otherwiseit equals zero. Similarly, the increment in line 13 equals 1 if τc ≤ R(t). At most11one of these conditions can hold, so for each value of t, Algorithm 3 incrementsat most one entry of Hˆ. Specifically, if R(t) is finite then the algorithm incrementsHˆ[dXi∗(t)/∆e] where i∗ = dR(t)/∆e.When R(t) is finite, we have τi∗ ≤ R(t) < τi∗+1. Since Xi∗(t) = |B(τi∗ , t)| andD(t) = |B(R(t), t)|, we deriveXi∗+1(t)≤ D(t)≤ Xi∗(t). (3.3)We also haveXi∗(t)−Xi∗+1(t) = |B(τi∗ , t)|− |B(τi∗+1, t)|= |B(τi∗ , t)\B(τi∗+1, t)| ≤ |B(τi∗ ,τi∗+1)| ≤ ∆. (3.4)So by 3.3 and 3.4 we have⌈Xi∗(t)∆⌉−1 ≤⌈D(t)∆⌉≤⌈Xi∗(t)∆⌉.Algorithm 2 increments H[dD(t)/∆e], whereas Algorithm 3 increments Hˆ[dXi∗(t)/∆e].Sox∑i=1Hˆ[i]︸ ︷︷ ︸Cˆ(x∆)≤x∑i=1H[i]︸ ︷︷ ︸C(x∆)≤x+1∑i=1Hˆ[i]︸ ︷︷ ︸Cˆ((x+1)∆).Rearranging this yields the desired inequality.3.4 A Streaming Algorithm for Approximating Hit RateCurvesIn this section we design an improved working set data structure using ideas fromstreaming algorithms. The working set data structure used in Claim 1 is not ef-ficient because it uses exact counters. Our main idea is to use, as a black box, astreaming algorithm for estimating distinct elements, which we will call a proba-bilistic counter.Each probabilistic counter has two parameters, n and α . The INSERT operation12takes a value x ∈ [n] and the QUERY operation reports a value v satisfying|S| ≤ v ≤ (1+α)|S|. (3.5)An optimal probabilistic counter was developed by Kane et al. [21]. It ensures that(3.5) holds with high probability for poly(m) queries, and each instantiation usesonly O((1/α2 + logn) logm)bits of space. In practice, the HyperLogLog counter[17] is very simple and has excellent empirical performance.We assume three simple consistency properties of a counter.P1: Two consecutive calls to QUERY (without any intervening insertions) re-turn the same value.P2: Reinserting an item that was previously inserted does not change thevalue of QUERY.P3: The values returned by QUERY do not decrease as more elements areinserted.3.4.1 Implementing Algorithm 3 using probabilistic countersNow we consider the working set data structure of Algorithm 1 implemented us-ing the optimal probabilistic counter of Kane et al. [21] with accuracy parame-ter α = ε∆/n = ε/`. α = ε/`We will analyze Algorithm 3 with this working set datastructure. The data structure will create dm/∆e counters, each of which usess = O((1/α2 + logn) logm)bits of space. So the total space usage is O(ms/∆)bits. In Section 3.4.2 we will modify the data structure to “prune” redundant coun-ters, which reduces the space to O(`s/ε) bits.The following theorem compares the accuracy guarantee of Algorithm 3 usingAlgorithm 1 with either exact or probabilistic counters.Theorem 2. Let C and Cˆ respectively refer to the hit rate curves produced usingexact and probabilistic counters. ThenC((x−1)∆)−2αx ≤ Cˆ(x∆) ≤ C(x∆)+2αx ∀x ∈ [`+1]. (3.6)Furthermore, Cˆ satisfies the (Strong-Guarantee) condition on page 9 with respectto the true hit rate curve.13Proof. Let H and Xi refer to the quantities using exact counters and let Hˆ and Xˆirefer to the corresponding quantities using probabilistic counters. We require thefollowing claim:Claim 3. For any times a≤ b and any index i, we haveXi(a)−Xi+1(a)≥ Xi(b)−Xi+1(b).Proof of claim. Recall that Xi(t) = |B(τi, t)|. As τi < τi+1, we get Xi(t)−Xi+1(t) =|B(τi,τi+1)\B(τi+1, t)|. ThusXi(a)−Xi+1(a)− (Xi(b)−Xi+1(b))= |B(τi,τi+1)\B(τi+1,a)|− |B(τi,τi+1)\B(τi+1,b)| ≥ 0,as B(τi+1,a)⊂ B(τi+1,b).The histogram H and the hit rate curve C are computed by the increment op-eration of the ith iteration of the loop on line 10 in algorithm Algorithm 3 whileprocessing the request at time t, for each valid i, t pair. For simplicity, we considerline 13 to be the final iteration of this loop. This increment operation involves onlyXi and Xi+1. The same is true of Hˆ and Cˆ, using instead the pair Xˆi and Xˆi+1. So,to prove (3.6), we will show that the contribution from the pair Xˆi and Xˆi+1 to Cˆapproximately equals the contribution from the pair Xi and Xi+1 to C.Contribution to C. Fix any x ∈ [`] and recall that C(x∆) = ∑xj=1 H[ j]/m. By con-sidering lines 4, 12, and 13 of Algorithm 3 we see that the pair Xi and Xi+1 canonly contribute to C(x∆) while t ≤m and dXi(t)/∆e ≤ x. So, let Ti be the time afterthe last contribution of Xi and Xi+1 to C(x∆), i.e.,Ti = min({ t : Xi(t)> x∆ }∪{m+1}).At all times t ≥ Ti, the pair Xi and Xi+1 does not contribute to C(x).14For the time being, let us assume that τi+1 ≤ m. That is, an (i+1)th counter iscreated during trace processing. The contribution to m ·C(x∆) from the the pair Xiand Xi+1 isXi+1(t +1)−Xi+1(t)−Xi(t +1)+Xi(t) (for each time t ∈ {τi+1, . . . ,Ti−1})1−Xi(t +1)+Xi(t). (for each time t ∈ {τi, . . . ,τi+1−1}).Summing up, the total contribution is∑τi≤t<τi+1−1(1−Xi(t +1)+Xi(t))+ ∑τi+1≤t<Ti(Xi+1(t+1)−Xi+1(t)−Xi(t+1)+Xi(t))= ∆−Xi(τi+1)+Xi(τi)+Xi(τi+1)−Xi+1(τi+1)+Xi+1(Ti)−Xi(Ti)= ∆+Xi+1(Ti)−Xi(Ti) (3.7)Contribution to Cˆ. Similarly, let Tˆi = min({ t : Xˆi(t) > x∆}∪{m+1}). Then atall times t ≥ Tˆi, the pair Xˆi and Xˆi+1 do not contribute to Cˆ(x). (This assertion usesproperty P3 of the counters.) Summing up as before, the total contribution of thepair Xˆi and Xˆi+1 to m ·Cˆ(x∆) is∆+ Xˆi+1(Tˆi)− Xˆi(Tˆi). (3.8)Upper bound on contribution to Cˆ(x∆). The difference between the contributionof Xˆi and Xˆi+1 to m · Cˆ(x∆) and the contribution of Xi and Xi+1 to m ·C(x∆) is thedifference between (3.8) and (3.7), namelyXˆi+1(Tˆi)− Xˆi(Tˆi)−Xi+1(Ti)+Xi(Ti). (3.9)15We now upper bound this quantity. First note that Tˆi ≤ Ti, by (3.5). Then Claim 3shows that (3.9) is at mostXˆi+1(Tˆi)− Xˆi(Tˆi)−Xi+1(Tˆi)+Xi(Tˆi)≤ αXi+1(Tˆi) (by (3.5))≤ αXi(Tˆi) (by definition of Xi and Xi+1)≤ α(x∆+1) (since Tˆi ≤ Ti and by definition of Ti). (3.10)Lower bound on contribution to Cˆ(x∆). For the lower bound, we must considerthe contribution of Xi and Xi+1 to C((x−1)∆). DefineT ′i = min({ t : Xi(t)> (x−1)∆ }∪{m+1}).Arguing as before, we find that the total contribution of this pair to m ·C((x−1)∆)is∆+Xi+1(T′i )−Xi(T′i ). (3.11)The difference between (3.8) and (3.11) isXˆi+1(Tˆi)− Xˆi(Tˆi)−Xi+1(T′i )+Xi(T′i ). (3.12)Claim 4. T ′i ≤ Tˆi.Proof of claim. By definition of T ′i , we have Xi(T′i ) ≤ (x−1)∆+1. By definitionof α , we have αn = ε∆< ∆. So, by (3.5),Xˆi(T′i ) ≤ (1+α)Xi(T ′i ) ≤ (1+α)((x−1)∆+1)≤ (x−1)∆+1+αn < x∆+1 ≤ Xˆi(Tˆi).By P3, the claim is proven.16Applying Claim 4, we may use Claim 3 to show that (3.12) is at leastXˆi+1(Tˆi)− Xˆi(Tˆi)−Xi+1(Tˆi)+Xi(Tˆi)≥ −αXi(Tˆi) (by (3.5))≥ −α(x∆+1) (since Tˆi ≤ Ti and by definition of Ti). (3.13)The last counter. Here we consider the special case where m < τi+1. In this case,the contribution of Xi to C(x) is∑τi≤t≤m(1−Xi(t +1)+Xi(t))= m− τi +1−Xi(m+1). (3.14)Similarly, the contribution of Xˆi to Cˆ(x) ism− τi +1− Xˆi(m+1). (3.15)The bound of α(x∆+ 1) on the absolute value of the difference of (3.15) and(3.14) follows immediately from (3.5). Indeed, we get the bound of α∆. Notethat Ti = T ′i = m+ 1, as Xi(t) ≤ ∆ for all t ∈ [τi,m]. Thus the contribution of Xito C((x−1)∆) is equal to the expression in (3.14), and the lower bound follows aswell.Proof of (3.6): We now combine our previous observations to establish (3.6).Recall that c = dm/∆e is the total number of counters. Summing (3.10) over all i,we obtain thatmCˆ(x∆) ≤ mC(x∆)+αc(x∆+1) ≤ mC(x∆)+2αmx.This proves the second inequality of (3.6). The first inequality of (3.6) followsanalogously from (3.13).17Proof of (Strong-Guarantee): Let C∗ denote the true hit rate curve. Combining(3.6), Claim 1 and the definition of α yieldsC∗((x−2)∆)−2εx/` ≤ Cˆ(x∆) ≤ C∗(x∆)+2εx/`.Applying this bound with ∆/2 instead of ∆, and ε/2 instead of ε establishes(Strong-Guarantee).3.4.2 Improved space by pruning redundant countersAs requests are processed, adjacent counters may converge to the same value.When this happens, it is not necessary to keep both counters. In this section we usethe working set data structure of Algorithm 1 with probabilistic counters, but wedelete redundant counters. The new algorithm is shown in Algorithm 4.Claim 5. The number of active counters at any point in time is O(`/ε).Proof. Due to lines 8 and 9, every i ∈ A satisfies either Xi−1(t)−Xi(t) > 2ε∆ orXi(t)−Xi+1(t)> 2ε∆. In either case, we must have|B(τi−1, t)|− |B(τi+1, t)|> ε∆ ∀i 6∈ {1,c} , (3.16)since, by (3.5) and the definition of α ,|B(τ j, t)| ≤ X j(t) ≤ (1+α)|B(τ j, t)| ≤ |B(τ j, t)|+ ε∆for every j ∈ A and every t. Summing (3.16) over i we obtainε∆(|A|−2)≤ ∑i∈A\{1,c}(|B(τi−1, t)|− |B(τi+1, t)|)≤ 2c−1∑i=1(|B(τi, t)|− |B(τi+1, t)|)≤ 2n.We conclude that |A| ≤ 2+2n/ε∆= O(`/ε).Corollary 6. The space complexity of Algorithm 3 implemented using Algorithm 4is O(`(`2/ε2 + logn) log(m)/ε)bits.18Algorithm 4: An implementation of a working set data structure incorporat-ing pruning. A is the set of active counters.1 A← /02 Function Register(t,bt):3 c← dt/∆e4 if t ≡ 1 (mod ∆) then5 Create the new counter K [c]6 A← A∪{c}7 for i ∈ A\{1,c} do8 if((i−1 6∈ A)∨ (Xi−1(t)−Xi(t)≤ 2ε∆))∧((i+1 6∈ A)∨ (Xi(t)−Xi+1(t)≤ 2ε∆))then9 Delete K [i] and set A← A\{i}10 for i ∈ A do11 K [i].INSERT(bt)12 Function GetWorkingSetSize(t):13 Let i = dt/∆e14 if i ∈ A then15 Return K [i].QUERY()16 else17 Return NULLProof. Using the optimal probabilistic counter [21] with the parameter α = ε/`each counter uses space s = O((`2/ε2 + logn) logm). The space requirement forthe histogram used by Algorithm 3 is O(` logn). By Claim 5, the total space usageis O((`/ε) · s), as required.The next theorem compares the accuracy of Algorithm 3 with two differentworking set data structures: either Algorithm 1 or Algorithm 4. In both cases weuse probabilistic counters.Theorem 7. Let C and Cˆ be respectively the hit rate curve produced from Al-gorithm 3 using Algorithm 1 or Algorithm 4 to implement the working set data19structure. Then|C(x∆)−Cˆ(x∆)| ≤ 4ε ∀x ∈ [`+1].Consequently, Cˆ satisfies the (Weak-Guarantee) condition on page 8 with respectto the true hit rate curve.Proof. Let H and Xi refer to the quantities computed using Algorithm 1. Let Hˆ andXˆi refer to the corresponding quantities computed using Algorithm 4. We will as-sume that both algorithms are furnished with the same random bits. Consequently,Xi(t) = Xˆi(t) whenever Xˆi(t) 6= NULL ∀i ∈ [c], t ∈ [m]. (3.17)As in the proof of Theorem 2, we fix x ∈ [`+ 1] and i ∈ [c] and compare thecontribution from the pair of counters Xi and Xi+1 to C(x∆) and the contributionfrom the pair of counters Xˆi and Xˆi+1 to Cˆ(x∆).Let Ti = min({ t : Xi(t)> x∆ }∪{m+1}). As in the proof of Theorem 2, thepair of counters Xi and Xi+1 cannot contribute to C(x∆) at any time t ≥ Ti.Let Tˆi be the first time t at which Xˆi(t)> x∆, {i, i+1}* A, or t ≥m+1. Then,by considering lines 11-13 of Algorithm 3, we see that the pair of counters Xˆi andXˆi+1 cannot contribute to Cˆ(x∆) at any time t ≥ Tˆi. In the case that Xˆi(t) > x∆this follows from property P3 of Xˆi, and in the case that {i, i+1}* A this followsbecause of line 17 of Algorithm 4.We first assume τi+1 ≤ m. Following the argument of (3.7) in the proof ofTheorem 2, the difference in contributions to m ·Cˆ(x∆) and m ·C(x∆) isXˆi+1(Tˆi)− Xˆi(Tˆi)−Xi+1(Ti)+Xi(Ti). (3.18)Case 1. Suppose that Tˆi is such that dXˆi(Tˆi)/∆e > x or Tˆi = m + 1. Then wenecessarily have Tˆi = Ti due to (3.17). In this case (3.18) is clearly zero.Case 2. Suppose that {i, i+1}* A in iteration Tˆi. In this case we might not haveXˆi(Tˆi)> x∆, but we do haveXˆi(Tˆi)− Xˆi+1(Tˆi) ≤ 2ε∆, (3.19)20by line 8 of Algorithm 4. Observe that Tˆi ≤ Ti.Let Yi(t) denote |B(τi, t)|. Since Yi(t)≤ n and α = ε∆/n, it follows from (3.5)that0 ≤ Xi(t)−Yi(t) ≤ ε∆ ∀i ∈ [c], t ∈ [m]0 ≤ Xˆi(t)−Yi(t) ≤ ε∆ ∀i ∈ [c], t ≤ Tˆi.(3.20)The first step is to upper bound (3.18).Xˆi+1(Tˆi)− Xˆi(Tˆi)−Xi+1(Ti)+Xi(Ti)≤ Yi+1(Tˆi)−Yi(Tˆi)−Yi+1(Ti)+Yi(Ti)+2ε∆ (by (3.20))≤ 2ε∆,by Claim 3, since Tˆi ≤ Ti. Next we lower bound (3.18).Xˆi+1(Tˆi)− Xˆi(Tˆi)−Xi+1(Ti)+Xi(Ti)≥ Xˆi+1(Tˆi)− Xˆi(Tˆi)−Yi+1(Ti)+Yi(Ti)− ε∆ (by (3.20))≥ Xˆi+1(Tˆi)− Xˆi(Tˆi)− ε∆ (by definition of Y )≥ −3ε∆by (3.19). It follows that (3.18) is at most 3ε∆ in absolute value.The last counter: Here we examine the special case where τi+1 > m. As τi+1 =i∆+ 1, at every time t ∈ [τi,m], c = i. Thus the for loop of 7 cannot prune the ithcounter. So Tˆi = Ti, and the difference between the contributions of counters iszero.Summing up the contribution to m ·C(x∆) and m ·Cˆ(x∆) from all pairs of coun-ters, we obtain that|m ·C(x∆)−m ·Cˆ(x∆)| ≤ c ·3ε∆ = dm/∆e ·3ε∆ ≤ 4εm.This proves the theorem.21Chapter 4A Unified RepresentationIn the past chapter, we used probabilistic counters as a black box. Here, by exam-ining the implementation of probabilistic counters, we will derive a new algorithmthat makes different space tradeoffs. In order to explain this algorithm, we willdescribe some work from the streaming algorithms community on the distinct ele-ments problem.4.1 Distinct Elements and Probabilistic CountersThe distinct elements problem, referred to as DISTINCT-ELEMENTS, is the prob-lem of estimating the number of distinct elements in a stream of tokens. There aretwo parameters: ε is the desired accuracy of the estimate, and δ is the failure prob-ability. Formally, an algorithm solves DISTINCT-ELEMENTS with parameters εand δ if given a stream S, the algorithm outputs an estimate d such thatPr((1− ε)|S| ≤ d ≤ (1+ ε)|S|)≥ 1−δ .Using O(|S|) space, it is possible to solve DISTINCT-ELEMENTS deterministi-cally with ε = δ = 0. However, it is provably impossible to use sublinear spacewith ε = 0 or δ = 0.Probabilistic counters are randomized algorithms which use only sublinearspace to solve DISTINCT-ELEMENTS. Many probabilistic counters [2, 3, 17, 21]rely on a {0,1}-matrix M that is updated while processing each item in the stream.22Each item b in the stream is hashed to a binary string σ , and then M is updatedbased on lsb(σ), the number of trailing zeros in σ . We will call such a matrix M abitmatrix.The simplest counter [2] uses a single hash function h, and the matrix M hasa single column. To process a new stream element b, the algorithm computesz = lsb(h(b)). For each i ≤ z, Mi is set to 1. After the stream is processed, thealgorithm outputs 2 j∗, where j∗ is the index of the greatest non-zero row. Thisalgorithm produces an O(1) estimate of |S| with failure probability ≤√2/3.Other algorithms refine this estimate by using additional columns and anotherhash function g, which determines which column to update.The estimate could be, for example, a function of the average of the lowestnon-zero value in each column [14, 17], or the number of non-zero cells below acertain row (Algorithm 3 in [3]).Many distinct elements algorithms have a fixed failure probability. To reducethe failure probability, a standard method called the median trick is used. Supposea probabilistic counter solves the distinct elements problem with parameters ε,δ ,where δ < 1/2. Now run k independent instantiations of the algorithm on thestream S, and output the median estimate. If the median is greater than (1+ ε)|S|,then the estimate of k/2 counters exceeded (1+ ε)|S|. By a Chernoff bound, thisoccurs with probability 2Ω(−k). Similarly, we obtain a 2Ω(−k) probability for theevent that the median is less than (1− ε)|S| (further details can be found in [10]and [18]). We will make use of this trick below.4.2 Counter PackingIf we imagine that Algorithm 1 uses such a counter, and that all counters use thesame hash functions h and g, then we see that there is a great deal of redundantstate. For example, at time step t we apply the hash functions to the block bt , thenperform the appropriate update on Mc, the bitmatrix corresponding to the mostrecently created counter. We also loop over all the older counters, and updatetheir bitmatrices as well. However, if Mci,z = 1, it follows that Mji′,z = 1 for every0≤ i′≤ i and 1≤ j≤ c, as older counters will have certainly undergone any updatesthat newer counters have. This observation leads to the following idea: instead of23storing the bitmatrices for all counters separately, we can store a single unifiedmatrix from which all bitmatrices can be computed.We will keep a single matrix Q to represent a sequence of counters, wherethe kth counter was started at time (k− 1)∆+ 1 in the trace. We will maintain Qsuch that at any time t during stream processing, Qi, j = r means that the bitmatrixof the rth counter has a 1 at position i, j, and for any r′ > r, the bitmatrix of ther′th counter is 0 at position i, j. By the observation made above, it follows thatfor any r′ < r, the r′th bitmatrix has a 1 in position i, j. To extract the bitmatrixcorresponding to the r′th counter, Mr′, we examine each entry of Q. Let Qi, j = r.If r < r′, then Mr′i, j = 0. Otherwise, Mr′i, j = 1. The pseudocode for the algorithm isgiven in Algorithm 5.Algorithm 5: An implementation of a working set data structure based ona unified counter representation. Parameterized by a randomized counteralgorithm A . The set Q has many independent copies of the hash functionsand the resulting table. We need only |Q|= O(log(1/δ )) with δ = m−3.Data: A collection Q of pairs (Q,H ), where Q is a matrix, H is a set ofhash functionsA randomized counter algorithm A1 c← 12 Function Register(t,bt):3 c← dt/∆e4 for (Q,H ) ∈Q do5 Update Q using bt according to A6 Function GetWorkingSetSize(t’):7 for Q ∈Q do8 Let r = dt ′/∆e9 Define the bitmatrix Mr by Mri, j ={1 if Qi, j− r ≥ 00 otherwise10 Feed Mr into counter algorithm A to obtain estimate RQ11 Return the median of estimates RQ244.3 AnalysisIn order to analyze Algorithm 5, we must specify a concrete randomized counter al-gorithmA . We will use Algorithm 2 from the paper of Bar-Yossef et al. [3]. In thisalgorithm, each matrix Q has log(n) rows, and k = O(1/α2) columns. Each collec-tion H consists of k t-wise independent hash functions, where t is O(log(1/α)).To update Q, for j ∈ [k], we set Qi, j = c if lsb(h j(bt))≥ i. Given the bitmatrix M,this randomized counter algorithm can produce its estimate.Claim 8. The space requirement of Algorithm 5 is O(`2 log(n) log(m) log(1/δ )/ε2).Proof. Each Q has O(1/α2) columns, logn rows, and each cell requires logmspace. Thus Q requires O(log(n) log(m)/α2)space. Each collection H requiresO(log2(1/α) logn)space, which is negligible. We have O(log(1/δ )) such pairs(Q,H ). Thus the total space requirement is O(α−2 log(n) log(m) log(1/δ )). Sub-stituting α = ε/` completes the proof.Theorem 9. Algorithm 3 using Algorithm 5 as its working set data structure sat-isfies (Strong-Guarantee).Proof. Let Xi(t) be the result of GETWORKINGSETSIZE(τi) at time t, which isan estimate of |B(τi, t)|. It suffices to show that Xi(t) ∈ [|B(τi, t)|,(1+α)|B(τi, t)|]with high probability, in which case the argument of Theorem 2 applies.For any i, t, by taking the median of O(log(1/δ )) estimates of A , we havePr[|Xi(t)−|B(τi, t)||>α|B(τi, t)|]≤ δ . At time t, the algorithm computes estimatesfor t/∆ counters, and thus O(m2) estimates are taken in total. By a union bound,the probability that any estimate is poor is ≤ δm2.There is one subtlety: Theorem 2 assumed counters with only one-sided er-ror, while A provides counters with two-sided error. If we take α ′ = α2+α as theaccuracy of A and divide all estimates by (1−α ′), we recover one-sided α ap-proximations.25Chapter 5Lower BoundsIn this section we prove lower bounds on the space needed by one-pass algorithmsto compute approximate hit rate curves. Formally, let HRCn,m,ε,` be the computa-tional problem in which, given an input sequence in [n]m, one must compute a func-tion Cˆ satisfying (Weak-Guarantee), where ∆= bn/`c. Similarly, let HRC′n,m,ε,` bethe analogous problem in which Cˆ must satisfy (Strong-Guarantee).While HRCn,m,ε,` is perhaps a more natural formulation of the hit rate curveproblem, Theorems 2 and 9 show that two of our algorithms actually solve HRC′n,m,ε,`.It is useful to analyze the fundamental complexity of both HRCn,m,ε,` and HRC′n,m,ε,`in order to understand the limits of our current algorithms and to design improvedones.5.1 Communication ComplexityIn order to prove our lower bounds, we will use communication complexity. Thebasic model of communication complexity involves two players, Alice and Bob,who are tasked with computing f (x,y), where x is given to Alice and y to Bob. Thechallenge is to devise a communication protocol such that Alice and Bob exchangethe least number of bits to compute f (x,y).A k-round protocol is a communication protocol in which k messages are sentbetween Alice and Bob. For example, in a 1-round protocol, Bob can computef (x,y) after receiving a single message from Alice. A randomized protocol fur-26nishes Alice and Bob with random bits. In a private coin protocol, Alice and Bobeach have access to their own random bits and cannot see the other’s random bits.In a public coin protocol, Alice and Bob have access to the same random bits. Fora detailed introduction to communication complexity, consult [22].In the following, we will construct k-round, public coin protocols.5.2 Gap Hamming DistanceOur lower bounds are based on reductions from the Gap Hamming Distance (GHD)problem. In GHDk,t,g, Alice and Bob are respectively given vectors x,y ∈ {0,1}k.They are required to determine whether the Hamming distance between x and y,denoted d(x,y), is ≤ t−g or > t +g, outputting 0 or 1 respectively.The Gap Hamming Distance problem was first introduced by Woodruff andIndyk in [19]. Their goal was to obtain lower bounds for the space complexityof the Distinct Elements problem. They described a reduction of Gap HammingDistance to Distinct Elements, and gave a lower bound for the space complexityof Gap Hamming Distance. A subsequent line of work ([30], [20], [31], [7], [8]),culminating in the 2012 paper of Chakrabarti and Regev [9], settled the communi-cation complexity of GHDn,√n,n/2 as Ω(n). The Chakrabarti and Regev paper [9]also contained the following generalization, which will be helpful:Theorem 10 (Chakrabarti-Regev [9], Proposition 4.4). Any protocol that solvesGHDk,k/2,g with probability ≥ 2/3 communicates Ω(min{k,k2/g2}) bits.Since GHD was originally introduced to obtain lower bounds for the distinctelements problem [19], and since hit rate curves fundamentally involve the numberof distinct elements, it is not too surprising that reducing GHD to HRC and HRC′is useful.5.3 HRC Lower BoundsTheorem 11. The space complexity of HRCn,n,ε,` is Ω(min{n,1/ε2}).Proof. Let k = n/2=m/2, and assume ` divides n. Given an instance of GHDk,k/2,2εkwith inputs x,y, as well as an algorithm which solves HRCn,n,`,ε using space s, con-sider the following protocol:27Alice constructs the set X = { j : x j = 1}∪{k+ j : x j = 0}, then runs the algo-rithm for HRCn,n,`,ε on the stream of elements of X in arbitrary order. She passesthe s bits of state to Bob. Bob constructs the set Y = {k+ j : y j = 1}∪{ j : y j = 0},and using the state from Alice, continues the algorithm on the elements of Y . IfCˆ((`+1)∆)≤ 1/4, he outputs 0, otherwise he outputs 1.By construction, d(x,y) = |X ∩Y |, so 2k ·C(`∆) = d(x,y). (Recall that the hitrate curve is normalized by the length of the request sequence, which is m = 2k.)Since Cˆ satisfies (Weak-Guarantee), we have2k ·C(`∆)−2kε ≤ 2k ·Cˆ((`+1)∆) ≤ 2k ·C((`+1)∆)+2kε = 2k ·C(`∆)+2kε.Thus the protocol correctly distinguishes the cases d(x,y)≤ k/2−2εk and d(x,y)>k/2+ 2εk. Applying Theorem 10, we conclude that s ∈ Ω(min{k,(k)2/(kε)2} =Ω(min{n,1/ε2}).Theorem 12. The space complexity of HRC′n,n,ε,` is Ω(min{n/`,`/ε2}).Proof. Let k = n/2=m/2, and assume ` divides n. Given an instance of GHDk,k/2,2εkwith inputs x,y, as well as an algorithm which solves HRC′n,n,`,ε using space s, con-sider the following protocol:Alice constructs the ` setsAi ={j : x j = 1 ∧ (i−1)∆< j ≤ i∆}∪{j+ k : x j = 0 ∧ ( j−1)∆< j ≤ j∆}.Bob defines the ` setsBi ={j+ k : y j = 1 ∧ (i−1)∆< j ≤ i∆}∪{j : y j = 0 ∧ ( j−1)∆< j ≤ j∆}.Alice runs the algorithm for HRC′n,n,ε,` on A1, and passes the s1 bits of algo-rithm state to Bob, who uses it to run the algorithm on B1. He sends the resultings2 bits of state back to Alice, who uses it to process A2. Passing the current stateback and forth in this manner, after 2`−1 messages have been exchanged and Bobhas finished running the algorithm on B`, the algorithm has processed the requestsequenceA1,B1,A2,B2, . . . ,A`,B`,28where the entries of each individual set Ai or Bi may be ordered arbitrarily. Notethat the Ai’s are pairwise disjoint, as are the Bi’s. Furthermore, Ai ∩B j = /0 fori 6= j. Thus as |Ai| = |Bi| = ∆ for all i, at every time t we have either D(t) ≤ ∆or D(t) = ∞, and so C(i∆) = C(∆) for all i ≥ 1. Since 2k ·C(`∆) = d(x,y) and Cˆsatisfies (Strong-Guarantee), we haved(x,y)−2kε/` ≤ 2k ·Cˆ(2∆) ≤ d(x,y)+2kε/`.Thus Bob can solve GHDk,k/2,2kε/` as before.By Theorem 10,(2`−1) ·maxjs j ≥2`−1∑j=1s j = Ω(min{k, `2/ε2}).So for some j, we have s j ∈Ω(min{n/`,`/ε2}).29Chapter 6Experimental ResultsBased on the ideas developed in Chapter 3, a prototype was implemented [29]. Wecompare this prototype against an implementation of the Mattson algorithm [23] ona collection of storage traces [24] released by Microsoft Research in Cambridge.Xiang et al., the authors of the average footprint paper [33], released an opensource implementation of their algorithm [12]. We include the results from aslightly modified version of this implementation in our figures, and discuss someof the strengths and weaknesses of their approach.6.1 Microsoft Research TracesThe MSR traces record the disk accesses of a collection of 13 different servers overa period of one week. Each trace is a list of records, where each record representsa disk access. Records have fields for the time stamp, server name, disk number,operation type (read or write), disk offset, size of data requested, and latency. Inorder to process these traces, we filtered out writes and expanded each remainingrecord into a list of distinct blocks touched by the request. We used a block size of4 KB.The smallest trace, wdev, has 52,489 distinct blocks and 725,194 requests,while the largest trace, prxy, has 523,879 distinct blocks and 358,267,307 re-quests. The proj trace touches the largest volume of data with 324,760,925 dis-tinct blocks, and contains 564,577,120 requests.30Traceshm mds prn proj prxy rsrch src1 src2 stg ts usr wdev webavgfp average 0.96 0.03 1.71 0.50 0.18 0.59 3.53 0.33 0.02 1.17 0.43 1.10 4.51max 4.43 0.57 6.97 4.29 27.91 9.04 44.14 26.46 0.15 15.52 15.56 14.17 35.68cs average 0.23 1.06 0.34 1.01 0.98 0.67 0.56 0.74 1.02 1.46 0.26 1.45 1.29max 9.71 4.68 8.46 2.54 37.50 16.45 14.55 26.51 1.70 25.41 13.36 22.33 13.91Table 6.1: Average and maximum error for avgfp and cs.In Figure 6.1, we plot the hit rate curves generated by three algorithms onthe MSR traces. The avgfp algorithm is an implementation of the average foot-print technique [33]. The authors released their implementation as open source ongithub [12]. The original implementation used a statically allocated array of size512 MB to map blocks to last access times during trace processing, an efficientapproach for CPU traces using relatively few distinct memory locations. Given thesize of the larger MSR traces, a statically allocated data structure is impossible touse on modern hardware for this purpose, so we altered avgfp to use a hash tableinstead. The cs algorithm is a protoype based on the ideas sketched in Chapter 3.Finally, mattson is a single-threaded implementation of the Mattson algorithmbundled with parda [25]. We used mattson to compute the true hit rate curves tocompare against the other algorithms.To compare the resource usage of the algorithms, we ran each on the traceobtained by merging all MSR traces and sorting by time stamp. The resultingtrace, called the master trace, is 22 GB, uncompressed. The avgfp algorithmprocessed master in 10 minutes using 22 GB of memory, mattson in 1 hourusing 92 GB, and cs in 12 minutes using 0.5 GB. These experiments were runon a Dell PowerEdge R720 with two six-core Intel Xeon processors and 96 GB ofRAM.We give quantitative errors for avgfp and cs on the MSR traces in Table 6.1.We measured the error of an algorithm by finding the deviation between its curveand the curve produced by mattson.Although Xiang et al. developed their technique with CPU traces in mind [33],their implementation does quite well on most of the MSR traces. Two exceptionsare src1, with 44.1 maximum and 3.5 average error, and web, with 35.7 maxi-mum and 4.5 average. In comparison, cs has 14.6 maximum and 0.6 average error310.0 0.5 1.0 1.5 2.0020406080100hm0 10 20 30 40 50 60 70 80mds0 10 20 30 40 50 60 70prn0 200 400 600 800 10001200020406080100proj0.0 0.5 1.0 1.5 2.0prxy0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7rsrch0 50 100 150 200020406080100src10 5 10 15 20 25 30 35src20 10 20 30 40 50 60 70 80stg0.0 0.1 0.2 0.3 0.4 0.5020406080100ts0 200 400 600 800 1000usr0.00 0.05 0.10 0.15 0.20wdev0 10 20 30 40 50 60 70020406080100webavgfpcsmattsonFigure 6.1: Hit rate curves for MSR traces. Cache size in GBs against hit ratepercentage.32on src1, and 13.9 maximum and 1.3 average error on web. While both algorithmsare less accurate at approximating volatile hit rate curves, avgfp exhibits highererror.As the staircase plot suggests, web is characterized by multiple working setsof markedly different sizes in different phases of the trace. Such traces can causeproblems for the average footprint technique. We investigate this further in the nextsection.6.2 Average Footprints and Phase ChangesFor each window length w, the average footprint technique computes the averageworking set size across all windows of length w in the trace. If the trace is markedby distinct phases in which working set size varies drastically, the average workingset size may skew the reported stack distances.To validate this suspicion, we generated an artificial trace, cyclic. cyclicis composed of two distinct phases. In the first phase, the first 104 blocks are readin order. This sequential scan is repeated 1000 times. The second phase consists of105 repeated scans of the first 100 blocks. Both phases consist of 107 requests, fora total of 2 ·107 requests to 105 distinct blocks. The hit rate curves for cyclic ascomputed by avgfp, cs, and mattson are plotted in 6.2.As suspected, the average footprint technique has trouble with this trace. Forcache sizes between 0.020 GB and 0.035 GB, avgfp reports a hit rate of nearly100%, while cs and mattson both give about 50%. The average and maximumerror for avgfp are 24.6 and 50.0, while for cs they are 0.5 and 41.3. Because thehit rate changes so dramatically for a small change in cache size, large maximumerrors are unavoidable for both algorithms (recall Theorem 2 bounds the error forbin x in terms of bins x−1 and x).This trace is highly artificial and cannot be interpreted as a realistic model forsystem behaviour. However, we believe that it highlights an important limitationin the average footprint approach. Traces with highly pronounced phases do exist,and will cause inaccuracies with the average footprint method, as evidenced byweb. We fabricated this trace in order to examine the worst case for the averagefootprint algorithm.330.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040020406080100 avgfpcsmattsonFigure 6.2: Hit rate curves for cyclic trace. Cache size in GBs against hitrate percentage.34Chapter 7ConclusionIn this work, we introduced new algorithms for approximating hit rate curves whichdiffer dramatically from existing approaches. We characterized the space usage ofour algorithms, and provided lower bounds on the space complexity of any algo-rithm which approximates hit rate curves with given error bounds. We also vali-dated an implementation based on our algorithms on the MSR traces, and comparedour results to past work.Our focus was entirely on space efficiency. We did not characterize the timecomplexity of our algorithms, nor did we provide lower bounds on time complex-ity. A careful analysis of our algorithms’ time usage may result in more efficientalgorithms, and is a possible direction for future work.Our space lower bounds are not satisfactory. In particular, we were not able toobtain any dependence on ` for the lower bound of HRCn,m,ε,` in Theorem 11. Wesuspect neither Theorem 11 nor Theorem 12 gives tight bounds for their respectiveproblems. We believe a deeper investigation of the space complexity of the hitrate curve problem, besides finding tighter bounds, could result in algorithms withgreater space efficiency as well.35Bibliography[1] G. S. Alma´si, C. Cas¸caval, and D. A. Padua. Calculating stack distancesefficiently. In Proceedings of the 2002 workshop on memory systemperformance (MSP ’02), pages 37–43, 2002. → pages 2, 4[2] N. Alon, Y. Matias, and M. Szegedy. The space complexity ofapproximating the frequency moments. In Proceedings of the twenty-eighthannual ACM symposium on Theory of computing, pages 20–29. ACM, 1996.→ pages 3, 22, 23[3] Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan.Counting distinct elements in a data stream. In Randomization andApproximation Techniques in Computer Science, pages 1–10. Springer,2002. → pages 3, 22, 23, 25[4] L. A. Belady. A study of replacement algorithms for a virtual-storagecomputer. IBM Systems Journal, 5(2):78–101, 1966. → pages 1[5] B. T. Bennett and V. J. Kruskal. LRU stack processing. IBM Journal ofResearch and Development, 19(4):353–357, 1975. → pages 2, 4, 9[6] E. Berg and E. Hagersten. Statcache: a probabilistic approach to efficientand accurate data locality analysis. In Performance Analysis of Systems andSoftware, 2004 IEEE International Symposium on-ISPASS, pages 20–27.IEEE, 2004. → pages 6[7] J. Brody and A. Chakrabarti. A multi-round communication lower bound forgap hamming and some consequences. In Computational Complexity, 2009.CCC’09. 24th Annual IEEE Conference on, pages 358–368. IEEE, 2009. →pages 27[8] J. Brody, A. Chakrabarti, O. Regev, T. Vidick, and R. De Wolf. Bettergap-hamming lower bounds via better round elimination. In Approximation,36Randomization, and Combinatorial Optimization. Algorithms andTechniques, pages 476–489. Springer, 2010. → pages 27[9] A. Chakrabarti and O. Regev. An optimal lower bound on thecommunication complexity of gap-hamming-distance. SIAM Journal onComputing, 41(5):1299–1317, 2012. → pages 27[10] A. Chakrabarti et al. Cs49: Data stream algorithms lecture notes.http://www.cs.dartmouth.edu/∼ac/Teach/CS49-Fall11/Notes/lecnotes.pdf,2012. Retrieved 2014-07-17. → pages 23[11] P. J. Denning. The working set model for program behavior.Communications of the ACM, 11(5):323–333, 1968. → pages 9[12] C. Ding. Program locality analysis tool. https://github.com/dcompiler/loca,2014. Retrieved 2014-07-03. → pages 30, 31[13] C. Ding and Y. Zhong. Predicting whole-program locality through reusedistance analysis. In PLDI, pages 245–257. ACM, 2003. → pages 2, 4, 5[14] M. Durand and P. Flajolet. Loglog counting of large cardinalities. InAlgorithms-ESA 2003, pages 605–617. Springer, 2003. → pages 3, 23[15] D. Eklov and E. Hagersten. StatStack: Efficient modeling of LRU caches. InPerformance Analysis of Systems & Software (ISPASS), 2010 IEEEInternational Symposium on, pages 55–65. IEEE, 2010. → pages 2, 6[16] P. Flajolet and G. Nigel Martin. Probabilistic counting algorithms for database applications. Journal of computer and system sciences, 31(2):182–209,1985. → pages 3[17] P. Flajolet, E´. Fusy, O. Gandouet, and F. Meunier. HyperLogLog: theanalysis of a near-optimal cardinality estimation algorithm. DMTCSProceedings, 0(1), 2008. → pages 3, 13, 22, 23[18] N. J. A. Harvey. Cpsc 536n: Randomized algorithms lecture notes.http://www.cs.ubc.ca/∼nickhar/W12/, 2012. Retrieved 2014-08-16. → pages23[19] P. Indyk and D. Woodruff. Tight lower bounds for the distinct elementsproblem. In Foundations of Computer Science, 2003. Proceedings. 44thAnnual IEEE Symposium on, pages 283–288. IEEE, 2003. → pages 2737[20] T. Jayram, R. Kumar, and D. Sivakumar. The one-way communicationcomplexity of hamming distance. Theory of Computing, 4(1):129–135,2008. → pages 27[21] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for thedistinct elements problem. In Proceedings of the twenty-ninth ACMSIGMOD-SIGACT-SIGART symposium on Principles of database systems,pages 41–52. ACM, 2010. → pages 3, 13, 19, 22[22] E. Kushilevitz and N. Nisan. Communication Complexity. CambridgeUniversity Press, Cambridge, 1997. → pages 27[23] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluationtechniques for storage hierarchies. IBM Systems Journal, 9(2):78–117, 1970.→ pages 2, 4, 30[24] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practicalpower management for enterprise storage. ACM Transactions on Storage(TOS), 4(3):10, 2008. → pages 30[25] Q. Niu, J. Dinan, Q. Lu, and P. Sadayappan. Parda: A fast parallel reusedistance analysis algorithm. In Parallel & Distributed ProcessingSymposium (IPDPS), 2012 IEEE 26th International, pages 1284–1294.IEEE, 2012. → pages 2, 4, 5, 31[26] F. Olken. Efficient methods for calculating the success function offixed-space replacement policies. Technical report, Lawrence Berkeley Lab.,CA (USA), 1981. → pages 2, 4[27] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation usingtime. In POPL, pages 55–61. ACM, 2007. → pages 2, 5[28] R. A. Sugumar and S. G. Abraham. Multi-configuration simulationalgorithms for the evaluation of computer architecture designs. Ann Arbor,MI, 1993. → pages 2, 4, 5[29] J. Wires, S. Ingram, N. J. A. Harvey, A. Warfield, and Z. Drudi.Characterizing storage workloads with counter stacks. In OSDI, page toappear. USENIX, 2014. accepted. → pages 30[30] D. Woodruff. Optimal space lower bounds for all frequency moments. InProceedings of the fifteenth annual ACM-SIAM symposium on Discretealgorithms, pages 167–175. Society for Industrial and Applied Mathematics,2004. → pages 2738[31] D. P. Woodruff. The average-case complexity of counting distinct elements.In Proceedings of the 12th International Conference on Database Theory,pages 284–295. ACM, 2009. → pages 27[32] X. Xiang, B. Bao, T. Bai, C. Ding, and T. Chilimbi. All-window profilingand composable models of cache sharing. ACM SIGPLAN Notices, 46(8):91–102, 2011. → pages 6[33] X. Xiang, B. Bao, C. Ding, and Y. Gao. Linear-time modeling of programworking set in shared cache. In Parallel Architectures and CompilationTechniques (PACT), 2011 International Conference on, pages 350–360.IEEE, 2011. → pages 6, 30, 31[34] Y. Zhong and W. Chang. Sampling-based program locality approximation.In Proceedings of the 7th international symposium on Memory management,pages 91–100. ACM, 2008. → pages 2, 539

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0166992/manifest

Comment

Related Items