UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Iceberg-cube computation with PC cluster 2001

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


ubc_2001-0590.pdf [ 3.33MB ]
JSON: 1.0051672.json
JSON-LD: 1.0051672+ld.json
RDF/XML (Pretty): 1.0051672.xml
RDF/JSON: 1.0051672+rdf.json
Turtle: 1.0051672+rdf-turtle.txt
N-Triples: 1.0051672+rdf-ntriples.txt

Full Text

Iceberg-cube Computation with P C Cluster by Yu Yin B . S c , J i l i n U n i v e r s i t y , 1993 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M a s t e r o f Sc i ence in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard The University of British Columbia April 2001 © Yu Y i n , 2001 In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the requirements f o r an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r reference and study. I f u r t h e r agree that permission f o r extensive copying of t h i s t h e s i s f o r scholarly'purposes may be granted by the head of my department or by h i s or her r e p r e s e n t a t i v e s . I t i s understood that copying or p u b l i c a t i o n of t h i s t h e s i s f o r f i n a n c i a l g ain s h a l l not be allowed without my w r i t t e n permission. Department of I / w r M W A v i The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Date Od <o > ^°°< Abstract Iceberg queries constitute one of the most important classes of queries for O L A P applications. This thesis investigates using low cost P C clusters to parallelize the computation of iceberg queries. We concentrate on techniques for querying large, high-dimensional data sets. Our exploration of an algorithmic space considers trade- offs between parallelism, compuation, memory and I /O. The main contribution of this thesis is the development and evaluation of various novel, parallel algorithms for C U B E computation and online aggregation. These include the following: one, the C U B E Algorithm RP, which is a straightforward parallel version of BUC[BR99]; two, the C U B E Algorithm B P P , which attempts to reduce I /O by outputting re- sults in a more efficient way; three, the C U B E Algorithms A S L and A H T , which maintain cells in a cuboid in a skip list and a hash table respectively, designed to put the utmost priority on load balancing; four, alternatively, the C U B E Algo- rithm P T load-balances by using binary partitioning to divide the cube lattice as evenly as possible; and five, the online aggregating algorithm P O L , based on A S L and sampling technique, which gives back instant response and further progressive refinement. We present a thorough performance evaluation of all these algorithms in a variety of parameters, including the dimensionality and the sparseness the cube, the selectivity of the constraints, the number of processors, and the size of the data set. The key to understanding the C U B E algorithms is in that one-algorithm-does-not- fit-all. We recommend a "recipe" which uses P T as the default algorithm, but may also deploy A S L or A H T in appropriate circumstances. The online aggregation algorithm, P O L , is especially suitable for computing a high dimensional query over a large data set with a cluster of machines connected by high speed networks. Contents Abstract » Contents i" List of Tables vi List of Figures vii Acknowledgements ix 1 Introduction 1 2 Review 7 2.1 Iceberg Query 7 2.2 C U B E Operator 9 2.3 Iceberg-cube Computation 12 2.4 Sequential C U B E Algorithms 12 2.4.1 Top-down C U B E algorithms 13 2.4.2 Bottom-Up C U B E Algorithm 23 3 Parallel Iceberg-cube Algorithms 28 3.1 Algorithm R P 30 3.2 Algorithm B P P 30 3.2.1 Task Definition and Processor Assignment 32 i i i 3.2.2 Breadth-first Writing 34 3.3 Algorithm A S L . . 37 3.3.1 Using Skip lists • 38 3.3.2 Affinity Assignment 39 3.4 Algorithm P T 42 3.5 Hash-based Algorithms 45 3.5.1 Hash Tree Based Algorithm 45 3.5.2 Hash Table Based Algorithm 49 4 Experimental Evaluation 52 4.1 Memory Occupation 52 4.2 Experimental Environment 54 4.3 Load Distribution 55 4.4 Varying the Number of Processors 57 4.5 Varying the Problem Size 58 4.6 Varying the Number of Dimensions 60 4.7 Varying the Minimum Support 61 4.8 Varying the Sparseness of the Dataset 63 4.9 Summary 65 4.9.1 Recipe Recommended 65 4.9.2 Further Improvement 66 5 Online Aggregation 67 5.1 Selective Materialization 67 5.2 Online Aggregate from a Raw Data Set 68 5.3 Parallel Online Aggregation 69 5.3.1 Data Partitioning and Skip List Partitioning 69 5.3.2 Task Definition and Scheduling 70 5.4 Exerimental Evaluation 74 iv 5.4.1 Varying the Number of Processors 74 5.4.2 Varying the Buffer Size 76 6 Conclusion 78 Bibliography 80 v List of Tables 1.1 Key Features of the Algorithms 4 2.1 Example relation R 8 5.1 Task Array for 4 Processors 70 vi List of Figures 2.1 Iceberg Query 8 2.2 C U B E Operation on Relation S A L E S [8] . 10 2.3 Cube in Multi-dimensional Array Format [8] 11 2.4 Lattice and Processing Trees for C U B E Computation [4] 14 2.5 A n Example of 4-Dimensional Lattice for Algorithm PipeSort [2] . . 16 2.6 A n Example of Plan and Pipelines for Algorithm PipeSort [2] . . . . 17 2.7 PipeHash on a Four Attribute Group-by [2] 19 2.8 Examples for PartitionedCube and MemoryCube Algorithms [14] . . 21 2.9 A Skeleton of B U C 25 2.10 B U C Partitioning 26 3.1 A Skeleton of the Replicated Parallel B U C Algorithm 31 3.2 Task Assignment in Algorithm R P 31 3.3 Task Assignment in B P P 33 3.4 Depth-first Writing vs Breadth-first Writing 34 3.5 A Skeleton of the B P P Algorithm 35 3.6 I /O comparison between B P P (Breadth-first writing) and RP(Depth- first writing) on 9 dimensions on a dataset with 176,631 tuples, input size is 10Mbyte and output size is 86Mbyte 36 3.7 Pictorial Description of Steps Involved in Performing an Insertion [22] 38 3.8 A Skeleton of A S L 40 vi i 3.9 Binary Division of the Processing Tree into Four Tasks 43 3.10 A Skeleton of P T 44 3.11 Frequent Itemsets and Strong rules for a Bookstore Database [20] . . 46 3.12 Subset Operation on the Root of a Candidate Hash Tree [23] . . . . 47 3.13 A Skeleton of A H T 50 4.1 Load Balancing on 8 Processors 56 4.2 Scalability 57 4.3 Results for varying the dataset size 59 4.4 Results for varying the Number of Cube Dimensions 60 4.5 Results for varying the minimal support 62 4.6 Results for varying the sparseness of the dataset 64 4.7 Recipe for selecting the best algorithm 65 5.1 Tasks Assignment in P O L 71 5.2 A Skeleton of the P O L Algorithm 73 5.3 P O L ' s Scalability with the Number of Processors 75 5.4 Scalability with Buffer Size 77 vi i i Acknowledgements I would like to express my gratitude to my supervisor, Dr. Alan Wagner, and professor Dr. Raymond Ng, for helping me in carrying out this research project and for reading the manuscript of my thesis and offering their valuable comments. I also would like to thank Kirsty Barclay for reading the manuscript of my thesis and providing me with her helpful comments. Y u Y I N The University of British Columbia April 2001 ix Chapter 1 Introduction As computing and the Internet advance, we see a massive increase in the raw data available to institutions, corporations, and individuals. For example, large numbers of radiological images have been generated in hospitals and immense product and customer databases have been accumulated [1]. Extracting meaningful patterns and rules from such large data sets is therefore becoming more and more important. In this context, On-line Analytical Processing (OLAP) has emerged as a powerful tool for data analysis. In decision support systems, O L A P enables analysts and managers to obtain insight into data. By interactively posing complex queries, they can extract different views of data. In many O L A P applications, aggregation queries constitute a large percent- age of the computation. Many of these queries are only concerned with finding aggregate values above some specified threshold. We call this kind of query "iceberg queries". Query results consisting of above-threshold aggregate values are typically small compared to the total input data (the iceberg). Through Structured Query Language (SQL) aggregate functions and the G R O U P B Y operator, O L A P applications can easily produce aggregates for one group-by, however, most applications need aggregates for a set of group-bys in order to gain more insight into the data. This necessitates generalization of G R O U P B Y 1 operator. The C U B E operator, defined by Gray et al [8], provides this generaliza- tion. It computes aggregation for every possible combination of a set of specified attributes. For instance, if C U B E operator is applied on 2 attributes, A and B, then the aggregates from G R O U P B Y on all, G R O U P B Y on A , G R O U P B Y on B and G R O U P B Y on A B will be returned together. The computation introduced by the C U B E operator is huge, because for d specified attributes, 2d G R O U P BYs are computed. Furthermore, In each cuboid, there are also numerous cells, or partitions, computed. In C U B E terminology, output for an n-attributes G R O U P B Y is called an n-dimensional cuboid, also called an n-dimentional group-by. When the C U B E operator is employed to answer a set of iceberg queries, we call it an "iceberg-cube". In this thesis, we investigate the algorithms for answering iceberg queries, es- pecially for iceberg-cube computation. Recently, several algorithms have been pro- posed, including the PipeSort and the PipeHash algorithms proposed by Agrawal et al. [2], the Overlap algorithm proposed by Naughton et al [21], the PartitionedCube algorithm proposed by K . Ross and D . Srivastava [14] and the Bottom-Up algorithm (BUC) proposed by Beyer and Ramakrishnan [4]. A l l these algorithms except B U C are general C U B E computation algorithms, in the regard that they do not specifi- cally target iceberg-cube computation. They proceed in a top-down fashion, that is, computing from more dimensional group-bys to less dimensional group-bys. Many of them try to utilize previous sorting in the top-down traversal. B U C provides an- other efficient solution, specifically for threshold-set iceberg queries. It proceeds in a bottom-up fashion, trying to prune tuples which do not satisfy threshold as early as possible to reduce computation. We will discuss these two kinds of algorithms in more detail in Chapter 2 . We based our work on the algorithms mentioned above, however, we were interested in providing parallel solutions. The previous C U B E algorithms mainly proposed for running on stand alone machines were developed to execute on a single processor, so-called sequential algorithms. In this thesis, we propose several par- 2 allel algorithms for answering iceberg queries, and promote the benefits of using distributed computing platforms to solve problems. Our underlying architecture is a dedicated cluster of P C s . W i t h elegant parallel algorithms, these machines have the potential to achieve the performance of massive parallel machines at a much lower cost. We focused our work on practical techniques that could be readily im- plemented on low cost P C clusters using open source, L i n u x and public domain versions of the M P I message passing standard. - To improve the response time of iceberg .queries, two different solutions are explored: precomputing and online querying. Precomputat ion is a common technique used by many O L A P applications. Usually, precomputation computes a C U B E operator, extract ing multiple aggregates and saving the results on disks. It supports instant response if the precomputed re- sults match a user's queries. Towards efficient iceberg-cube precomputation wi th P C clusters, this thesis explores different trade-offs between parallelism, computa- tion and I / O . Assuming input data sets fit in main memory on each machine of the cluster, we developed several novel, parallel algorithms for iceberg-cube computa- tion and give a comprehensive evaluation in this thesis. Here is a summery of the parallel algorithms: • A lgo r i thm R P (Replicated Parallel B U C ) , is a straightforward parallel version of B U C . It is simple and introduces litt le overhead above its sequential version. However, algori thm R P is poor in dis tr ibut ing tasks and balancing workload. In an attempt to achieve better load-balancing, algorithm B P P (Breadth-first writing, Partitioned, Parallel B U C ) , was developed. B P P differs from R P in two key ways. F i r s t , the dataset is range-partitioned and distributed other than replicated in R P ; second, the output of cuboids is done in a breadth- first fashion, as opposed to the depth-first wri t ing in R P and B U C . Table 1.1 summarizes the key features of the algorithms. • Though B P P is better than R P concerning load-balancing, this improvement 3 Algorithms Writing Load Relationship Data Strategy Balance of cuboids Decomposition R P depth-first weak bottom-up replicated B P P breadth-first weak bottom-up partitioned A S L breadth-first strong top-down replicated P T breadth-first strong hybrid replicated Table 1.1: K e y Features of the Algor i thms is l imited when the raw data set skews on some attributes. Th is is pr imari ly because the task granularity of R P and B P P is relatively large and uneven. To consider load balancing as the utmost priority, algori thm A S L (Affinity SkipList) is developed. In A S L each cuboid is treated as a task. A S L uses an affinity task scheduling strategy to seek the relationship among tasks assigned to the same processor and maximize sort sharing among them. Thus A S L resembles to the top-down algorithms. A S L is also unique in the regard that it maintains the cells of a cuboid in a different da ta structure, namely a skip list. • A lgor i thm P T (Partitioned Tree) is a hybrid algori thm, combining both the idea of pruning from B U C and affinity scheduling from A S L . It processes tasks of slightly coarser granularity. The idea is to use binary part i t ioning to divide the cuboids into tasks as evenly as possible, in order to make the load well-balanced. The computation in each task proceeds in bottom-up fashion, however, the task assignment is processed by affinity scheduling in a top-down fashion. • T w o other algorithms based on a hash tree and a hash table were also devel- oped. The implementation based on a hash tree used up memory too rapidly that it fails to process large data set. The hash table based algorithm was im- plemented much like A S L , in terms of task definition and scheduling. However, its performance is no better than A S L in most cases. 4 T w o questions natually arise at this point: one, which algorithm is the best; and two, do we really need to know about all these algorithms? In considering the first question, we present a thorough performance evaluation of all these algorithms on a variety of parameters. The parameters include the dimensionality, the sparse- ness of the group-bys, the selectivity of the constraints, the number of processors, and the sizes of the data sets. W i t h respect to the second question, a key finding of our evaluation is that when it comes to iceberg-cube computat ion with P C clusters, it is not a "one-algorithm-fits-all" si tuation. Based on our results, we recommend a "recipe" which uses P T as the default a lgori thm, but may also deploy A S L under specific circumstances. Pu t t i ng parallel iceberg-cube algori thmic development and evaluation aside temporarily, we next consider the concept of " truly online". Precomputat ion can answer users' queries instantly if the query pattern can be predicted. However, if the threshold set by online queries differs from what the precomputation assumed, precomputed cuboids can no longer be used to answer those queries. Therefore, those queries have to be computed online. We posit a scenario that the input raw data set no longer fits in main memory. Only wi th this precondition wil l the query computat ion be large enough to necessitate applying parallelism. In the online aggregation framework proposed and studied by Hellerstein, Haas and Wang, an online query algori thm based on A S L was developed. Using the sampling technique, a user's online query can be responded to instantly. A n d with more and more data processed, the answer becomes more and more refined and accurate. Integrating C U B E precomputation and online querying computat ion together, this thesis gives a relative complete solution for the special problem domain: iceberg query computat ion. The outline of the thesis is as follows. Chapter 2 reviews key concepts and the main sequential algorithms for iceberg-cube computat ion. Chapter 3 introduces the various parallel algorithms we developed. Chapter 4 presents a comprehensive 5 experimental evaluation of these algorithms, and concludes with a recipe for pick- ing the best algorithms under various circumstances. Chapter 5 discusses online processing. F ina l ly , a conclusion is given in Chapter 6. 6 Chapter 2 Review The background material necessary for understanding the parallel algorithms to be introduced in Chapter 3 is presented in this chapter. We first discuss iceberg query, then the C U B E operator. A special C U B E operator, iceberg-cube, is intro- duced seperately. The last part of this chapter, Section 2.4 presents some sequential algorithms for C U B E and iceberg-cube computat ion. 2.1 Iceberg Query A n iceberg query is much like a regular aggregate query, except that it eliminates aggregate values that fall below some specified threshold after it performs an ag- gregate function over an attr ibute or a set of attributes. The prototypical iceberg query considered in this thesis is as follows for a relation R(targetl, target2, ..., targetk, rest, aggregateField) and a threshold T. S E L E C T targetl, target2,..., targetk, SVM(aggregateField) F R O M R G R O U P B Y targetl, targets,targetk H A V I N G count (rest) > T If the above iceberg query is applied to the relation R in Table 2.1, with T 7 targetl target2 rest aggregate Field Item Location Customer Sales Sony 25" T V Seattle Joe 700 J V C 21" T V Vancouver Fred 400 Sony 25" T V Seattle sally TOO J V C 21" T V L A sally 400 Sony 25" T V Seattle bob 700 Panasonic Hi -F i V C R Vancouver torn 250 Table '2.1: Example relation R 0 \P Over huge data set Cut off output by setting threshold Iceberg Query The outpjt is just the small tip of the Icebeig 'SELECT A, B, C, COUNTT^ FROM R GROUP BY A, B, C, ^HAVING COUNT!*) >= 2 j Figure 2.1: Iceberg Query = 2 and k = 2, the result would be the tuple <Sony 25" T V , Seattle, 2100>. We notice that relation R and the number of unique target values are typically huge (the iceberg), and the answer, that is, the number of frequently occurring targets, is very small (the tip of the iceberg). This situation is pictured in Figure 2.1. A n iceberg query becomes especially important when the amount of input data is tremendous, since data analysts or managers can not possibly go through all the detailed information within a huge data set. Usually, they only note frequently occurring behaviors, which are typical ly more important than unusual occurrences. 8 In realistic data analysis, data analysts often execute multiple iceberg queries, which G R O U P B Y on different number of dimensions. For example, they may want to know more detailed information if the previous query returns too few results. Af- terward they might like to "dri l l -down" by G R O U P B Y on more attributes. O n the other hand, if the previous query gives back too detailed and too much information, they may like to "roll-up" by giving less G R O U P B Y attributes in the upcoming query. A Generated report containing results from all those queries can be formu- lated in standard S Q L , but its representation is inconvenient. A s well as dri l l -down and roll-up, some other frequently used queries including histogram and cross-tab are also difficult to represent in standard S Q L [19]. 2.2 C U B E Operator To exceed the l imi ta t ion posed by the standard S Q L , as mentioned in Section 2.1, the C U B E operator was introduced in [8] by J . Gray et al . It generalizes the standard G R O U P B Y operator to compute aggregates for every combination of G R O U P B Y attributes. For instance, consider the following relation S A L E S (Model , Year, Color , Sales), shown in the lefthand table in Figure 2.2. W h e n C U B E is on R wi th G R O U P B Y attributes M o d e l , Year and Color , aggregate on attr ibute Sales ( S U M in this case), the result returned wil l contain the sum of Sales for the entire relation (i.e. no G R O U P B Y ) , for each i tem: (Model) , (Year), (Color) , for each pair: (Model , Year) , (Model , Color ) , and (Year, Color ) , and finally for each (Model , Year, Co lo r ) . The result is shown in the righthand table in Figure 2.2. Figure 2.3 shows the C U B E in a multi-dimensional array format. In O L A P terminology, the G R O U P B Y attributes are called "dimensions", the attributes that are aggregated are called "measures", and one particular G R O U P B Y , (e.g., (Model , Year)) , in a C U B E computat ion is called a "cuboid" or simply a "group-by". Three types of aggregate functions are identified in [8]. Consider aggregating 9 SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39 Relation SALES SELECT Model, Year, Color SUM(Sales) FROM SALES CUBE BY Model, Year, Color SALES Model Year Color Sales A L L A L L A L L 942 Chevy A L L A L L 510 Ford A L L A L L 432 A L L 1990 A L L ' 343 A L L 1991 A L L 314 A L L 1992 A L L 285 A L L A L L red 165 A L L A L L white 273 A L L A L L blue 339 Chevy 1990 A L L 154 Chevy 1991 A L L 199 Chevy 1992 A L L 157 Ford 1990 A L L 189 Ford 1991 A L L 116 Ford 1992 A L L 128 Chevy A L L red 91 Chevy A L L white 236 Chevy A L L blue 183 Ford A L L red 144 Ford A L L white 133 Ford A L L blue 156 A L L 1990 red 69 A L L 1990 white 149 A L L 1990 blue 125 A L L 1991 red 107 A L L 1991 white 104 A L L 1991 blue 104 A L L 1992 red 59 A L L 1992 white 116 A L L 1992 blue 110 All Tuples in Relation SALES C U B E of SALES on attributes Model, Year and Color, where aggregate attribute is Sales. Figure 2.2: C U B E Operation on Relat ion S A L E S [8] 10 Aggregate Sum Group By (with total) By Color R E D W H I T E B L U E Cross Tab C h e v y Ford ByCobr R E D Sum W H I T E B L U E By Mike The Data Cube and The Sub-Space Aggregates By Year By Make & Year ByCobr&Year Sum1 t^ByMake&Cobr By Cobr Figure 2.3: Cube in Multi-dimensional Array Format [8] a set of tuples T. Let {Si \ ii = 1. . . n} be any complete set of disjointed subsets of T such that U; $ = T and (% Si = {}. • An aggregate function F is distributive if there is a function G such that F(T) = G{{F(Si) \ i=l,..«}). S U M , M I N and M A X are distributive with G = F. C O U N T is distributive with G = SUM. • An aggregate function JF is algebraic if there is an M-tuple valued function G and a function H such that F(T) = H({G(Si) \ i = 1...»}.)., and M is constant regardless of \T\ and n. A l l distributive functions are algebraic, as are Average, standard deviation, M a x N , and MinN. For Average, G produces the sum and count, and H divides the result. • An aggregate function F is holistic if it is not algebraic. For example, Median and Rank are holistic. 1 1 2.3 Iceberg-cube Computation The basic C U B E problem is to compute all aggregates as efficiently as possible. Its chief difficulty is that the C U B E computation is exponential with the number of dimensions: for ci dimensions, 2d group-bys are computed. The size of each group- by (cuboid) depends upon the cardinalities of its dimensions, possibly the product of the G R O U P B Y attributes' cardinalities. When the product of the cardinalities for a group-by is large relative to the number of the cells (partitions) that actually appear in the cuboid, we say the group-by is "sparse". When the number of sparse group-bys is large relative to the number of total number of group-bys, we say the. C U B E is sparse. As is well-recognized, given the large result size of the entire C U B E , especially on sparse data set, it is important to identify subsets of interest. Deriving from this background, the concept of an "iceberg-cube" was intro- duced in [4, 12]. the iceberg-cube was described as a variant of the C U B E problem, which al- lows us to selectively compute cells that satisfy a user-specified aggregate condition. It is essentially a C U B E for iceberg queries. For example, an iceberg-cube is easily expressed in SQL with the C U B E B Y clause: S E L E C T A , B, C, SUM(X) F R O M R where N is a count condition, called "min- C U B E B Y A, B, C H A V I N G COUNT(*) > N imum support" of a cell, or "minsup" for short. In this thesis, we only discuss this count condition; other aggregate conditions can be handled as well [4]. 2.4 Sequential C U B E Algorithms Al l C U B E algorithms uses a lattice view for discussion. Figure 2.4(a) depicts a sample lattice where A , B, C and D are dimensions. Nodes in the lattice represent 12 group-bys (cuboids). The group-bys are labeled according to their G R O U P B Y attributes. The edges in the lattice show potential computing paths. A l l of the C U B E algorithms in fact convert this lattice into a directed processing tree. Each node in a processing tree therefore has no more than one parent, because it is computed only once from its parent or from the raw data set. C U B E algorithms are classified into two categories according to their com- putat ion fashion. Algor i thms which follow paths from the raw da ta towards the to ta l aggregate value are called "top-down" approaches. Algor i thms which compute paths in the reverse direction are called "bottom-up" approaches. For the exam- ple shown in Figure 2.4(a), a top-down approach computes from A B C D , to A B C , to A B and eventually to A ; a bottom-up approach goes in the opposite direction. Figure 2.4(b) gives a sample processing tree of top-down algori thm. The processing tree of bottom-up algorithm is il lustrated in Figure 2.4(c). In the following, we wil l discuss some significant sequential C U B E algorithms proposed. C U B E algorithms can be viewed as having two stages: the planning stage and the execution stage. In the planning stage, the algorithms decide how to convert the lattice into a processing tree; in the execution stage, the algorithm computes cuboids. 2.4.1 Top-down C U B E algorithms C U B E algorithms always try to discover and take advantage of commonali ty between a node and its parent in the lattice view. For many top-down algorithms, they recognize that group-bys with common attributes can share, sorts, or partial sorts, and utilize those sharings. Taking the processing tree shown in Figure 2.4(b) as an example, A D represents the cuboid G R O U P B Y on A and D . If the data set has been sorted with respect to A and D in order to compute A D , then for computing cuboid A , the data set does not have to be re-sorted. We can simply accumulate the sums for each of the values in A . Apparently, cuboid A and A D share sort on 13 ABCD AB AC AD BC BD CD (a) 4-Dimension Lattice ABCD (b) Sample Processing Tree of Top-Down Algorithms 5 A B C D (c) Processing Tree of Bottom-Up Algorithms Figure 2.4: Lat t ice and Processing Trees for C U B E Computa t ion [4] 14 attribute A . Besides sort sharing, there are some other commonalities which were ex- ploited by top-down algorithms. Some of these, specified as opt imizat ion techniques, are listed by Sarawagi [2]: • Smallest-parent: Th i s aims at computing a group-by from the smallest previ- ously computed group-by. For example, we can compute group-by A B from group-by A B C and A B D . However, among the two potential parents, only the one wi th smallest size wil l be selected, because computing from the small parent wi l l lead to lower cost. • Cache-results: Th is technique tries to compute a group-by when its parent is st i l l in memory, hence,.reducing disk I / O . • Amortize scans: Th i s technique also aims at reducing disk I / O by amort izing disk reads by computing as many group-bys as possible together in memory. For instance, during scanning group-by A B C D , we can compute group-bys A B C , A C D , A B D and B C D at the same time. • Share-sorts: Sort-based algorithms use this technique to share sorting cost among multiple group-bys. • Share-partitions: Th is is specific to the hash-based algori thm. W h e n a hash table can not fit in the memory, data wil l be partitioned into chunks which do fit in memory. Once a chunk is read in, multiple group-bys wil l be computed in order to share the part i t ioning costs. In the following, we wil l discuss several sequential top-down algorithms. PipeSort, PipeHash and Overlap PipeSort and PipeHash algorithms are among the first algorithms for efficient C U B E computat ion. They were proposed by Sarawagi et al . in [2]. Bo th assume the cost 15 BC AB AC BD AD CD 5 15 5 15 4 14 5 15 5 15 1020 ABC ABD ACD BCD 10 30 15 40 5 20 45 130 ABCD 50 160 Figure 2.5: A n Example of 4-DimensionaI Lat t ice for A lgor i thm PipeSort [2] of each node in a lattice proportional to the product of the cardinalities of G R O U P B Y attributes and try to compute each cuboid from a parent having the smallest cost. However, The data structures of the two algorithms are different: PipeSort uses array and sorting is done prior to aggregation; PipeHash uses hash tables. Furthermore, PipeSort considers share-sorts opt imizat ion, t rying to minimize the number of sorts, whereas PipeHash focuses on share-partitions opt imizat ion. PipeSort distinguishes between two different costs attached to each node X in the lattice view: cost A ( X ) and cost S ( X ) . A ( X ) is induced when one child of X is created through aggregating without any sort on X . Ac tua l ly only one child of X can be computed with cost A ( X ) . For instance, for cuboid A B C D , only its child, cuboid A B C , can be computed without any sort on A B C D . For other children, if they are computed from A B C D , cost S ( A B C D ) is induced becasue resort on A B C D is necessary. In this way, the sorting cost is counted by PipeSort . Assuredly, cost S (X) is always greater than or equal to A ( X ) . In the planning stage, a processing tree with a minimum total cost, taking both A ( X ) and S ( X ) into account, is computed in a level-by-level manner, where level N contains all A^-dimensinal cuboids. W h e n computing on a level, the algo- r i thm determines what edges between the nodes in this level and the next level in 16 all t B A D 1 4 16 f T \ BA 1 AC 1 DB 4 14 5 15 + _ - - ' ~ ~ L-- "1".' BAD ACD DBC 15 40 5 20 45 130 CBAD 50 160 i raw data (a) Minimum Cost Sort Plan (b) Pipelines Executed Figure 2.6: A n Example of P l an and Pipelines for Algor i thm PipeSort [2] the lattice should be left in the final min imum cost tree. Since each edge has a cost attached, either A ( X ) or S ( X ) , the problem is converted into finding the minimum cost matching in a bipartite graph. Given the lattice shown in Figure 2.5, the final minimum cost plan becomes that shown in Figure 2.6(a). The pair of numbers un- derneath each group-by in the figure denote the A ( X ) and S (X) costs. The detailed plan computat ion is elaborated in [2]. After a plan is created, in the execution stage each path is computed in a pipeline manner. Figure 2.6(b) shows the pipelines' execution for the generated plan in Figure 2.6(a). The head of each pipeline implies a re-sort, from its parent in the processing tree. Like PipeSort , PipeHash aims at computing cuboids from their smallest par- ents. Since PipeHash takes hash tables as its da ta structure, no sorting is required. Therefore, each node in the lattice has only one cost, which is similar to A ( X ) in PipeSort . In the planning stage of P ipeHash a minimum spanning t r ee (MST) is computed based on the singular cost of each node. Figure 2.7(a) gives an example 17 of an M S T . Besides smallest-parent optimization, PipeHash also explores share-partitions and amortized-scans optimizations. It computes as many cuboids as possible if their parents are in memory. If the main memory is big enough to hold hash tables for all cuboids, PipeHash can finish the cube computation in one data scan without any sorting. If no enough memory is available, PipeHash partitions data on some selected attribute, then processes each partition independently. Although data partitioning solve the memory problem, the partitioning at- tribute limits computations to only include group-bys with that particular attribute. For example, from the M S T in Figure 2.7(a), we compute a C U B E on dimensions A , B , C and D. If we partition on A , then the partitions are only used to produce cuboids containing dimension A , including A B C D , A B C , A B D , A B , A C , A D and A . Other cuboids will be computed afterward from cuboids with attribute A . Ide- ally, they can fit in memory and no further partitioning is necessary. This makes M S T divide into subtrees, as shown in figure 2.7(b) and (c). By processing as large a subtree(or a set of subtrees) of the M S T as possible in memory, computing all nodes in it (or them) simultaneously, PipeHash favors optimizations cache-results and amortize-scans. Sarawagi compared PipeSort and PipeHash in [2]. PipeHash suffers two apparent problems, requiring re-hash for every group-by and requiring a significant amount of memory. This makes it can only outperform PipeSort as the data is dense. However, in this thesis, the problem domain is iceberg-cube computation in which data is supposed to be highly sparse; therefore, hash-based algorithms are not our major concern. However, we did implement some hash-based algorithms and they will be discussed in Chapter 3 . Overlap, proposed in [21], as well as PipeSort, considers sorting cost, but it deals with it in a different way. It tries to overlap as much sorting as possible by computing group-by from a parent with the maximum sort-order overlap. The 18 A B C D 2 A / 8 4 ^ 5 V AB AC BC AD CD BD 10 A If 2 0 ' ^ 20 y 20 • /d 12 20 30 JC ABD ACD BC ^ ^ 9 0 *7 5 0 - ^ 40 ABCD A IOO i Raw Data (a) Minimum Spanning Tree A \ i A B C ^ AB^D A C D A B C D i Raw Data (b) Subtree: Parti- tioned on A n Ltl * D >5V AB \ B C A A B C B C D Y L \ B C D (c) Remaining Subtrees Figure 2.7: P ipeHash on a Four At t r ibu te Group-by [2] 19 algori thm recognizes that if a group-by shares a prefix of G R O U P B Y attributes with its parent, then the parent consists of a number of partit ions, one for each value of the prefix. For example, since cuboid A B C and cuboid A C share a G R O U P B Y prefix A , the A B C group-by has | A | partitions that can be sorted independently on C to produce the A C sort order, where \A\ is the number of values for attr ibute A . Overlap always selects a parent for a cuboid which shares the longest G R O U P B Y prefix with that cuboid. Then the size of part i t ion is minimized. If several potential parents of a group-by share the same length of prefix wi th it , and then the smallest one wi l l be picked as the final parent. Overlap chooses a sort order for the root of the processing tree, then all subsequent sorts are some suffix of this order. The planning stage wi l l build a tree like that shown in Figure 2.4(b). Once this processing tree is formed, Overlap tries to fit as many partit ions in memory as possible. If a part i t ion of a group-by can fit in the main memory, then a subtree of the processing tree rooted by that group-by wil l be computed in a pipeline manner when the part i t ion is scanned in . Th is is expected to save much I / O costs for wri t ing intermediate results. The experiments show that Overlap performs consistently better than PipeSort and P ipeHash . However, [14] argues that Overlap on sparse C U B E S st i l l produces a large amount of I / O by sorting intermediate results. PartitionedCube and MemoryCube When the above C U B E algorithms are applied to sparse data sets, their performance becomes poor. Group-bys for sparse data sets are more likely to be large; buffering intermediate group-bys in memory requires too much memory. If the main memory is l imited, then intermediate group-bys will be written out and read into memory multiple times, which increases I / O dramatically. Moreover, predictation of the size of group-bys becomes very difficult, because the real size of a group-by may not be proportional to the product of cardinalities of the G R O U P B Y attributes. This 20 makes the cost of computat ion in PipeSort and P ipeHash no longer feasible. M o r e recently, Ross and Srivastava proposed an efficient top-down algorithm designed for large, high-dimensional and sparse C U B E s [14]. Thei r algorithm con- sists of two parts: Par t i t ionedCube and M e m o r y C u b e . Par t i t ionedCube partit ions the data on some attr ibute into memory-sized units and M e m o r y C u b e computes the C U B E on each in-memory part i t ion. Par t i t ioning in Par t i t ionedCube is very similar to P ipeHash . The algorithm chooses an attr ibute to part i t ion input into fragments. Then all cuboids containing that attribute wi l l be computed on each fragment separately. For example, if C U B E is to be computed on attributes { A , B , C , D } , we might part i t ion the input relation on attr ibute A , and get three partit ions. Then , we compute cuboids A B C D , A B C , A B D , A C D , A B , A C , A D and A for each part i t ion. B y taking the union of corre- sponding partial cuboids computed from each part i t ion, we get finally the complete cuboids. Then cuboid A B C D can be taken as the input to compute another cuboid. Par t i t ionedCube is called recursively if the fragments or cuboid A B C D is st i l l too big to fit in the memory; in that case, the data wi l l be further partitioned on other attributes. Figure 2.8(a) gives an i l lustration of this example. Once the input of Par t i t ionedCube fits in the memory, then M e m o r y C u b e can be applied. M e m o r y C u b e is a sort-based algori thm, which is its main difference from PipeHash . Like PipeSort , however, M e m o r y C u b e algorithm takes advantage of the Pipel in ing technique. It tries to minimize the number of pipelines and hence, the number of sorts. Its Paths algorithm (not to be discussed in detailed here), guarantees that the number of pipelines(paths) it generates wi l l be the minimum number of paths to cover all the nodes in the lattice. Figure 2.8(b) shows the paths for 4-dimension C U B E computat ion. There are six pipelines in total built from the input data. The cuboids in boxes are the heads of the pipelines. Sort ing is required to create the head of the pipeline, which is shown as dash lines in Figure 2.8(b), however, no sorting is needed in the pipelines. 21 ' AB BD i V C D | / \ A B C AD^B \B^B' \ \ I / ' ! • / A B C D R (Partitioned by A) T - i \ i Cuboid(BCD)(In memroy, B projected out; Cube(ABCD) (Partitioned by B, A projected out) (a) Partitioning all B AB B C BD 7T~ CD D A D A B C D C A D DAB _ — K B C D in-memory partition (b) Paths Found by MemoryCube Figure 2.8: Examples for Par t i t ionedCube and M e m o r y C u b e Algor i thms [14] 22 Since Par t i t ionedCube only considers pipelines in the M e m o r y C u b e , this algorithm tries to reduce the amount of I / O for intermediate results, and thus enhance the performance for sparse C U B E computat ion. Array-Based Algorithms W h e n using the array-based algorithms, as one proposed in [13], da ta sets are stored in a multi-dimension array, where each coordinate matches a C U B E attr ibute. A tuple's location in the array is determined by its value in each dimension. The algo- r i thm requires no tuple comparison, only array indexing. Unfortunately, if the data is sparse, the algorithms become infeasible, as the array becomes huge. Therefore, we find array-based algorithms are too l imited to warrant further discussion here. 2.4.2 Bot tom-Up C U B E Algori thm Our background search revealed only one bottom-up algori thm. It was introduced in [4] by K . Beyer and R . Ramakrishnan, and called the B o t t o m U p C u b e , B U C for short. It especially targets iceberg-cube computat ion. Sett ing thresholds in iceberg queries always cuts off a lot of cells in general cuboids. For the data set used in [4], and which was also used in our experiments, as many as 20% of the group-bys consisted entirely of cells wi th support as one. For the iceberg queries with minsup higher than 1, those group-bys do not need to be computed at a l l . Th is makes sense to consider a way to prune as early as possible in C U B E computat ion. Unfortunately, when we traverse a lattice in a top-down fashion, we can not prune cells which have insufficient support in any cuboid, until the last step. For example, suppose the threshold is set by specifying H A V I N G Count(*) >= 2 in iceberg-cube (the minsup is 2). Before we compute cuboid A B C from cuboid A B C D , we can not prune the cells with support as in 1, for example, a l b l c l d l ( m i n s u p : l ) and a l b l c l d 2 ( m i n s u p : l ) . Th is is because they contribute to the cells in A B C , whose supports are bigger than 1, for example, a l b l c l ( m i n s u p 23 is 2). However if we compute from cuboid A B C to cuboid A B C D in a bottom-up fashion, pruning is possible. Al though cuboid A B C D can not be directly computed from cuboid A B C , we can make sure that tuples which do not contribute to cells in cuboid A B C wi l l not contribute to cells in A B C D . We could therefore prune out those tuples in the raw data earlier, before the computation for A B C D proceeds. Thus, in B U C , a bottom-up approach is adopted. The idea is to combine the I / O efficiency of the Par t i t ionedCube algori thm, with minimum support pruning. The processing tree of B U C is il lustrated in Figure 2.4(c). The numbers in Figure 2.9 indicate the order in which B U C visits the group-bys. A skeleton of B U C is shown in Figure 2.9; we use the notation TA{ to de- note the set of all nodes in the subtree rooted at A t . For the example given in Figure 2.4(c), TB = {B, BC, BD, BCD}. Prefix in line 9 in Figure 2.9 indicates the current processed cuboid's G R O U P B Y dimensions. Take the B U C processing tree in Figure 2.9 as an example: B U C starts with cuboid all, and then cuboid with G R O U P B Y attribute A . For each value Vj in A , the data set is parti t ioned. Then for those partit ions with higher support than minsup, B U C is called recursively in a depth-first manner to process other dimensional group- bys (in lines 14-16). For example, for part i t ion Aui , in the first further recursion, B U C proceeds part i t ioning on attr ibute B , producing finer partitions AuiBui to partitions Av\Bvm. Afterward, B U C is recursively called on those finer partitions to produce some cells in cuboids A B C , A B C D and A B D . When all recursions for part i t ion Avi return, B U C proceeds in the same way on other parti t ions for AVJ. When all partitions based on A finish, B U C continues on attributes B , C and D in the same way. Figure 2.10 shows how B U C part i t ioning proceeds. The arrows shows the part i t ioning order. The gray area depicts those partitions pruned out based on the constraints(minsup in this case). Al though B U C can exploit pruning, it can not optimize by share-sort or 24 1. A lgor i thm B U C - M a i n 2. I N P U T : Dataset R wi th dimensions {At, A2,.. .Am}, the min imum support Spt. 3. O U T P U T : Qualified cells in the 2 m cuboids of the cube. 4. P L A N : 5. Star t ing from the bot tom, output the aggregate on " a l l " , and then a depth first traversal of the lattice, induced by {Ay, A2, • • -Am}. 6. f o r e a c h dimension A 2 (?' from 1 to m) d o B\JC(R,rAnSpt,{}) 7. C U B E C O M P U T A T I O N : 8. p r o c e d u r e B\JC(R,TA,, Spt, prefix) 9. . prefix = prefix U{.4;} 10. f o r e a c h combination of values Vj of the attributes in prefix d o 11. parti t ion R to obtain Rj 12. if (the number of tuples in Rj is > Spt) 13. aggregate Rj, and write out the aggregation to cuboid with cube dimensions indicated by prefix 14. f o r e a c h dimension Ak, k from i + 1 to m d o 15. call B\JC(Rj,TAK, Spt, prefix) 16. e n d f o r 17. e n d f o r Figure 2.9: A Skeleton of B U C 25 Partition on A Partition on AB Partition on ABC Partition on ABCD b2 b3 b4 b5 Figure 2.10: B U C Par t i t ioning 26 smallest parent techniques. Paper [4] compares B U C with Par t i t ionedCube. It claimes that B U C per- forms better than Par t i t ionedCube. The pruning significantly reduces running t ime when the min imum support is above 1. Even with minsup as 1, that is, full C U B E is computed, B U C st i l l outperforms it. 27 Chapter 3 Parallel Iceberg-cube Algorithms The key to success for an online system is the abil i ty to respond to queries in a t imely fashion. The compute and data intensive nature of iceberg-cube queries necessitates a high performance machine. In the past, this required expensive platforms, such as symmetric multiprocessor machines. In recent years, however, a very economical alternative has emerged: a cluster of low-cost commodi ty processors and disks. P C - clusters provide several advantages over expensive multiprocessor machines. F i r s t , in terms of raw performance, processor speeds are similar to and often exceed those of multiprocessor architectures. Recent advancements in system area networks, such as Myr ine t , standards like V I A , and 100Mbit or Gigabi t Ethernet have significantly improved communication bandwidth and latency. Second, although I / O and the use of commodity disks are weaknesses in these systems, as we show, parallelism can easily be exploited. T h i r d , the affordability of PC-clusters makes them attractive for small to medium sized companies and they have been the dominant parallel platform used for many applications [5], including association rule mining [18]. In the remainder of this thesis, we wil l discuss various novel algorithms we have developed for parallelizing iceberg-cube computat ion. Our focus is on practical 28 techniques that can be readily implemented on low cost P C clusters using open source, Linux and public domain versions of the M P I message passing standard. As our results apply to low cost clusters, the question arises of how much our results may generalize to higher cost systems. In Section 4, we examine how the various algorithms would speed up in* the presence of more nodes/processors in the cluster. Thus, if the key difference between a low cost and a high cost cluster is only the number of nodes, then our results will be applicable. However, if the key difference is on the underlying communication layer, then our results may not be applicable. A l l of the algorithms to be presented use the basic framework of having a planning stage and an execution stage. In the case of parallel algorithms, the planning stage involves (i) breaking down the entire processing into smaller units called tasks, and (ii) assigning tasks to processors, where now the objective is to minimize the running time of the processor that takes the longest time to complete its tasks. To simplify our presentation, we do not include the aggregation for the node "all" as one of the tasks. This special node can be easily handled. Furthermore, it is assumed that the initial dataset is either replicated at each of the processors or partitioned. The output, that is, the cells of cuboids, remains distributed where processors output to their local disks. In this section, we introduce the algorithms. As shown in Table 1.1, the algorithmic space that we explore involves the following issues: • the first issue is how to write out the cuboids. Because B U C is bottom-up, the writing of cuboids is done in a depth-first fashion. As will be shown la,ter, this is not optimized from the point of view of writing performance. This leads us to develop an alternative breadth-first writing strategy; • the second issue is the classical issue of load balancing. This issue is intimately tied to the definition of what a task is. Different algorithms essentially work with different notions of a task. In general, when the tasks are too coarse- grained, load balancing is not satisfactory. However, if the tasks are too fine- 29 grained, a lot of overhead is incurred; • when it comes to iceberg-cube computat ion, an important issue is the strategy for traversing the cube lattice. A s discussed earlier, top-down traversal may exploit share-sort, whereas bottom-up traversal exploits pruning based on the constraints. Our algorithms consider these possibilities; in fact, one of the algorithms combines the two strategies in an interesting way; • as usual, for parallel computat ion, we explore whether data part i t ioning is effective. 3.1 Algorithm R P Recall from Figure 2.4(c) that the processing tree of B U C consists of independent subtrees rooted at each of the dimensions. Thus, in the algorithm called Replicated Parallel B U C , R P for short, each of these subtrees becomes a task. In other words, for a cube query involving m attributes, there are m tasks. Processor assignment is s imply done in a round-robin fashion. W i t h this simple assignment strategy, if there are more processors than tasks, some processors will be idle. The data set is replicated on all machines in a cluster. Each processor reads from its own copy of the dataset, and outputs the cuboids to its local disk. The skeleton of R P is showed in Figure 3.1. Figure 3.2 gives an example of computing a 4-dimensional C U B E on a cluster of 4 P C s . In total , 4 tasks are created: subtrees rooted by A , B , C and D respectively. Each machine compute one task. 3.2 Algorithm B P P While R P is easy to implement, it appears to be vulnerable in at least two of its aspects. F i r s t , the definition of a task may be too simplistic in R P . The division of 30 1. A lgor i thm R P 2. I N P U T : Dataset R wi th dimensions { A i , A2, • • • Am} and minimum support Spt; 3. O U T P U T : The 2 m cuboids of the data cube. 4. P L A N : •5. Task definition: identical to B U C , i.e., subtrees rooted at A , 6. Processor assignment: assign a processor, in round robin fashion, to each subtree rooted at dimension A ; (i from 1 to ra) 7. C U B E C O M P U T A T I O N (for a processor): 8. paral lel do For each subtree rooted at dimension A ; assigned to the processor 9. call B\JC(R, TAI , Spt, {}) (with output writ ten on local disks) 10. end do Figure 3.1: A Skeleton of the Replicated Paral lel B U C Algo r i t hm Raw Data Replicated Figure 3.2: Task Assignment in A lgor i thm R P 31 the cube lattice into subtrees is coarse-grained. One consequence is that some tasks are much bigger than others. For example, the subtree rooted at A , TA, is much larger than that rooted at C , Tc- Thus, load balancing is poor. Second, B U C is not optimized in wri t ing performance. To address these problems, we developed the algori thm called Breadth-first wri t ing Part i t ioned Paral lel B U C , or B P P for short. 3.2.1 T a s k D e f i n i t i o n a n d P r o c e s s o r A s s i g n m e n t To achieve better load balancing, B P P tries to get finer-grained tasks by range part i t ioning on each attr ibute. Th is is motivated by Ross and Srivastava's design of the Par t i t ioned-Cube, which attemps to part i t ion data into chunks which fit in memory [14]. B P P partit ions data in the following way: • for a given attr ibute A,-, the dataset R is range-partitioned into n chunks (i.e., • •., Ri(n))i where n is the number of processors. Processor Pj keeps its copy R{(j) on its local disk; • note that because there are m attributes, the above range part i t ioning is done for each attribute. Thus , processor Pj keeps m chunks on its local disk ( R\(j), • • •, i ? m ( j j ) . A n y of these chunks may have some tuples in common; • range part i t ioning itself for the m attributes can be conducted in parallel, wi th processor assignment done in a round-robin fashion. For instance, processor i may part i t ion attr ibute Ai, then A ; + n , and so on. Notice that as far as B P P execution is concerned, range part i t ioning is basically a pre-processing step. If there are m cube attributes, then there wil l be a total mxn chunks. Each chunk corresponds to one task. The processor who has the chunk in the local is responsible for processing i t . If processor Pj process chunk R^j), where R{(j) is produced by range part i t ioning on attribute i , Pj computes the (partial) cuboids in the subtree rooted at A ; . These cuboids are partial because Pj only deals with 32 PO range-partition R on A -BML- , , Ral_ , Ra2_ _Ra3 J P1 range-partition R on B RbO Rbl Rb2 P2 range-partition R on C Egg E B L ) I t I I P3 range-partition R on D -J<1D Ethernet P O PI P2 P 3 Figure 3.3: Task Assignment in B P P the part of the data it controls, in this case, R${j\- The cuboids are completed by merging the output of al l n processors. Figure 3.3 illustrates task allocation and process in B P P . Each of the 4 pro- cessors in the cluster takes on the responsibility of range part i t ioning the raw data set R on one dimension and distr ibuting the resulting partitions across the pro- cessors. Since there are 4 cube dimensions in total , after data part i t ioning each processor gets 4 chunks. D a t a chunks in the same color on the same row are parti- tioned on the same attr ibute and have no overlap. However, data chunks located in the same processor are partit ioned on different attributes and may have overlap. A processor takes chunk R^j) to compute subtree %, for example, P I would use Rsi) to compute subtree 7c- B y part i t ioning data across processors, B P P achieves better load balancing than R P . If data can be evenly distributed among processors, then the load may be very well balanced in a homogeneous environment. 33 Cuboid A Cuboid A B Cuboid A B C a l - ( D f l > (10)<2^ a l b l - (2)<3> /'""V a l b l c l (3)<7> -,. > a l b l c 2 (4)<8> <f S a l b l c 3 (5)<9> ^ a l b 2 c l (7)<10> alb2c2 (8)<11>^ ; \ l b 2 c 3 (9)<12> { Figure 3.4: Depth-first Wr i t i ng vs Breadth-first Wr i t i ng 3.2.2 Breadth-first Writ ing B U C computes in a bottom-up manner, and the cells of the cuboids are writ ten out in a depth-first fashion. In the situation shown in Figure 3.4, there are three attributes A , B and C , where the values of A are a\, a 2 , and so on, values of B are b\ and 62 1 values of C are c i , C2 and C3. A s shown in Figure 2 .9 , the tuples of a i are aggregated in line 14 (assuming that the support threshold is met), and the result is output. The recursive call in line 15 then leads the processing to the cell a\b\, then to the cell a\b\C\, then to a\b\C2, and so on. In Figure 3.4, the number in round brackets beside each node denotes the order in which the cell is processed and the output for depth-first wri t ing; and the black solid lines denote the wri t ing sequence. Note that these cells belong to different cuboids. For example, the cell 0,1 belongs to cuboid A , the cell a\b\ to cuboid AB, and the cells a i & i C i and a\b\C2 belong to A B C . Clear ly in depth-first wri t ing, the wr i t ing to the cuboids is scattered. This incurs a high I / O overhead. We could possibly use buffering to alleviate the 34 1. A l g o r i t h m B P P 2. I N P U T : Dataset R wi th dimensions {A\, A2, • • -A.m} and min imum support Spt 3. O U T P U T : The 2 m cuboids of the da ta cube 4. P L A N : 5. Task definition: (partial) cuboids of subtrees rooted at A{ 6. Processor assignment: as described in Section 3.2.1 7. C U B E C O M P U T A T I O N (for the processor Pj): 8. parallel do 9. for each A,- (i from 1 to TO) do 10. call B P P - B U C ( J R i ( j ) , TAi,Spt, {}) (with output wri t ten on local disks) 11. end for 12. end do 13. Subroutine B P P - B U C ( i ? , TA,:, Spt, prefix) 14. prefix = prefix U { A ; } 15. sort R according to the attributes ordered in prefix 16. R' = R 17. for each combination of values of the attributes in prefix do 18. if (the number of tuples for that combination > Spt) 19. aggregate on those tuples, and write out the aggregation 20. else remove all those tuples from R' 21. end for 22. for each dimension Ak, k from i + 1 to m do 23. call BPP-B\JC{R', TAk, Spt, prefix) 24. end for Figure 3.5: A Skeleton of the B P P Algor i thm scattered wri t ing to the disk. However, this requires a large amount of buffering space, thereby reducing the amount of memory available for the actual computat ion. Furthermore, many cuboids may need to be maintained in the buffers at the same time, causing extra management overhead. In B P P , this problem is solved by breadth-first wri t ing. Returning to the example in Figure 3.4, B P P completes the wri t ing of a cuboid before moving on to the next one. For example, the cells a\ and a2, which make up cuboid A , are first computed and written out. Then all the cells in cuboid A B are outputted, and 35 10 Cost Comparision between RP and BPP Processors Figure 3.6: I /O comparison between B P P (Breadth-first writing) and RP(Depth-first writing) on 9 dimensions on a dataset with 176,631 tuples, input size is 10Mbyte and output size is 86Mbyte. so on. In Figure 3.4, the number in angled brackets beside each node denotes the order in which the cell is processed for breadth-first writing, while the red dash lines depict its writing sequence. Figure 3.5 gives a skeleton of B P P . As described in Section 3.2, the pre- processing step of range partitioning the dataset assigns to each processor Pj of the appropriate tasks. In the main subroutine B P P - B U C , breadth-first writing is implemented by first sorting the input dataset on the "prefix" attributes in line 15 in the skeleton. In our example, if the prefix is A , meaning that the dataset has already been sorted on A , then line 15 sorts the dataset further on the next attribute B . The loop starting 36 from line 17 then completes breadth-first wri t ing by computing and output t ing the aggregation of all the cells in the cuboid A B . Because some cells may not meet the support threshold, there is the extra complicat ion in B P P - B U C of the need to begin pruning as early as possible. Th is is the purpose of lines 16 and 20. Note that as opposed to what is presented in line 16 for simplicity, in our implementation we do not actually create a separate copy of the data. Instead, an index is used to record the start ing and ending positions in the sorted dataset to indicate that all those tuples should be skipped for subsequent calls to B P P - B U C . Breadth-first I / O is a significant improvement over the scattering I / O used in B U C . For the baseline configuration to be described in Section 4 , the total I / O time R P took to write the cuboids was more than 5 times greater than the total I / O time for B P P . Figure 3.6 gives the I / O comparison between R P (depth-first writing) and B P P (breadth-first wri t ing) . 3.3 Algorithm ASL Although B P P gives a solution for load balancing, this solution is sti l l not satisfac- tory under some circumstances. The potential downfall of B P P comes from the fact that the amount of work each processor does is dependent on the ini t ia l part i t ioning of the data. However, the size of the task depends on the degree of skewness in the data set and the order in which the dimensions are sorted and partit ioned. If the skewness is significant, the tasks may vary greatly in size, thereby reducing load balancing. For example, for an at tr ibute named Gender, only two possible values, Female and Male , can be assigned to it. Range part i t ioning then can produce only 2 chunks. Even if we have more than 2 processors, only two of them wil l get applied to chunks; the others wil l be relatively lightly loaded. This motivates the development of another algori thm, called Affinity Skip Lis t , or A S L for short. A S L tries to create tasks that are as small as the cube 37 Search Path 25 26 NIL Original list, 17 to be inserted 12 17 19 - H H 21 25 26 NIL List after insertion Figure 3.7: P ic tor ia l Descript ion of Steps Involved in Performing an Insertion [22] lattice allows: each node in the lattice makes a task. Th i s allows efficient use of the processors, quite independent of the the skewness and dimensionality of the data set. In the following, two key features of A S L are presented: the data structure used, and the processor assignment. 3.3.1 Using Skip lists A skip list is a data structure proposed by W . Pugh [22]. It is much like a linked list wi th addit ional pointers. Figure 3.7 is an example of a skip list. The lowest levels of nodes make a linked list, the higher levels of nodes are used for efficient search and insert operations. A s showed in Figure 3.7, searching or insertion always starts from the highest level of the head node. If the next link emitted from that level points to a node that contains an element bigger than the element which is to be inserted or searched, we drop one level from the starting node, otherwise, we follow the link to the next node, and try the next step from there. Figure 3.7 shows how an element with key 16 is added into a skip list. The number of levels a new inserted element should have is determined randomly, but not allowed to exceed a threshold set by users. The benefits of using a skip list are threefold. F i r s t , A S L exhibits good aver- age case behavior for insertion and searching, quite similar to that of a balanced tree, 38 yet the implementation details are much simpler. Second, each node in the struc- ture requires very li t t le storage overhead. T h i r d , skip list incrementally increases as more elements are added, and the sort order of the list is always guaranteed. Th i s is very important , because before sorting the da ta set need not be entirely loaded. A S L uses skip lists to maintain the cells in cuboids. Whi le it scans the raw data set, A S L builds skip lists incrementally. If there, is a node in the skip list representing the new read-in tuple, then the aggregates and support counts of that node are updated; otherwise a new node wil l be inserted into the skip list. In theory, if there are k cuboids and if there is enough memory, A S L can maintain all k skip lists simultaneously for one scan of the data set. B u t for the data sets used in our experiments, this opt imizat ion brings minimal gain, so we did not explore that here. 3.3.2 Affinity Assignment Now, let's consider the task assignment policy of A S L . In order to achieve good load balancing, A S L defines tasks with very fine granularity. It takes each cuboid as a task and assigns it to processors dynamically. Dur ing task scheduling, A S L adopts a top- down approach to traversing the cube lattice. It always tries to assign uncomplete high dimensional cuboids to processors, while taking affinity into account. Once a processor finishes one task, it is assigned a new one which has some kind of affinity with the previously one. For example, i f a processor has just created the skip list for the task A B C D , then it makes sense for the processor to be assigned the task of computing the cuboid for A B C . The previous skip list for A B C D can simply be reused to produce the results for A B C . In the following, we refer to this situation as "prefix affinity". In another si tuation, if a processor has just created the skip list for A B C D , this skip list is st i l l useful if the processor is next assigned the task of computing the cuboid B C D , because now we need only take the counts of each cell in A B C D , and add them to the counts of the appropriate cells in the skip list for B C D . Then 39 Algor i t hm A S L 1. I N P U T : Dataset R cube dimensions { A i , . . . , Am}; min imum support Spt 2. O U T P U T : The 2m cuboids of the data cube 3. P L A N : 4. Task definition: a cuboid in the cube lattice 5. Processor assignment: a processor is assigned the next task based on prefix or subset affinity, as described in Section 3.3.2 6. C U B E C O M P U T A T I O N (for a processor): 7. parallel do 8. let the task be with dimensions A , - , . . . , Aj 9. if Ai,..., Aj is the prefix of the previous task or the first task 10. let C denote the skip list from that task 11. call prefix-reuse(C, Spt, Ai,..., Aj); 12. else if { A j , . . . , Aj} is a subset of the set of dimensions of the previous task, or the set of dimensions of the first task 13. let C denote the skip list from that task 14. call subset-create(C, Spt, Ai,..., Aj) 15. else call subset-create(R, Spt, Ai,..., Aj) 16. end do 17. Subroutine prefix-reuse(C, Spt, Ai,..., Aj) 18. Aggregate C based on A , - , . . . , Aj 19. Wri te out the cells if the support threshold is met 20. Subroutine subset-create(C, Spt, Ai,..., Aj) 21. initialize skip list L 22. for each cell (tuple), in C do 23. find the right cell in L (created if necessary) 24. update the aggregate and the support counts accordingly 25. end for 26. Traverse L , and write out the cells if the support threshold is met Figure 3.8: A Skeleton of A S L 40 groupings done for the skip list for A B C D are not wasted. For example, suppose in A B C D , a cell corresponds to the grouping of aib\C\di. For the iv tuples in the original dataset that belong to this cell, the current aggregate and support counts can readily be used to update the corresponding counts for the cell b\C\di for BCD. There is no need to re-read the w tuples and aggregate again. In the following, we refer to this situation as "subset affinity". Figure 3.8 shows a skeleton of A l g o r i t h m A S L . To implement prefix or sub- set affinity, a processor is designated the job of being the "manager" responsible for dynamical ly assigning the next task to a "worker" processor. Specifically, the manager does the following: • first tries to exploit prefix affinity, because if that is possible, the worker pro- cessor then has no need to create a new skip list for the current t ask /cuboid . The previous skip list can be aggregated in a simple way to produce the re- sult for the current task. This is executed by the subroutine prefix-reuse in Figure 3.8; • then tries to exploit subset affinity, if prefix affinity is not applicable. Instead of scanning the dataset, the worker processor can use the previous skip list to create the skip list for the current task. This is executed by the subroutine subset-create in Figure 3.8; • assigns to the worker a remaining cuboid with the largest number of dimen- sions, if neither prefix nor subset affinity can be arranged. In this case, a new skip list is created from scratch. Clearly, the last situation ought to be avoided as often as possible. In our implemen- tation of A S L , each worker processor maintains the first skip list it created. Because A S L is top-down, the first skip list corresponds to a cuboid with a large number of dimensions. Th is maximizes the chance of prefix and subset affinity. The affinity scheduling is very helpful for sort-sharing, especially when the 41 number of available processors is small . B u t as more processors are available, the affinity relationship between tasks assigned to the same processor tends to be weak. For example, if we have 2 processors, we may very possibly assign both A B C D and A B C to one machine; however, if we have 16 machines, this possibility becomes light, since we don't want machines to lie idle just to maintain strong affinity. In such a case, one processor may compute A B C D and another may compute A B C , then both would need to sort on A B C . Duplicated sortings then occur. Since A S L ' s task scheduling is dynamic, depending on how soon each proces- sor finishes its task, the lattice traversal sequence can not be determined in advance. Different runnings very likely result in different traversal sequences. Th i s makes A S L quite different from other top-down algorithms, such as PipeSort or P ipeHash . 3.4 Algorithm PT B y design, A S L does a very good job of load balancing. However, A S L may be vulnerable in two areas. F i rs t , the granularity of the tasks may be too fine - to an extent that too much overhead is incurred. Th is is particularly true where prefix or subset affinity cannot be well exploited, and thus not much sort sharing is applica- ble. Second, A S L ' s top-down lattice traversal cannot prune those cells which lack min imum support from skip lists. A s A S L executes, whether a cell has minimum support or not cannot be determined until the data set has been scanned entirely. Furthermore, at the end of the scan, even if there is a cell below the minimum support, this cell sti l l cannot be pruned, because its support may contribute to the support of another cell in subsequent cuboid processing. In an effort to combine the advantages of pruning in a bottom-up algorithm on one hand, with load balancing and sort-sharing of top-down lattice traversal on the other, we developed the algorithm called Part i t ioned Tree, ( P T ) . Recall that in R P and B P P , tasks are at the granularity level of subtrees rooted at a certain dimension, for example, TA{- In A S L , tasks are merely nodes 42 A B C D , ABC ABD / ACD / BCD AB ' / A C AD / BC BD / CD D Bottom-Up Cubiod Computation Taskl Task2 / Task3 N Task4 Top-Down Affinity Scheduling Figure 3.9: B ina ry Div i s ion of the Processing Tree into Four Tasks in the cube lattice. To strike a balance between the two definitions of tasks, P T works with tasks that are created by a recursive binary division of a tree into two subtrees, each having an equal number of nodes. B ina ry division is achieved by simply cutt ing the farthest left edge emitted from the root in a B U C processing tree or subtree in recursive callings. For instance, the B U C processing tree shown in Figure 2.4(c) can be divided into two parts: TA and Taa — TA- A further binary division on TA creates the two subtrees: TAB and TA — TAB- Similarly, a further division on Taii — TA creates these two subtrees: TB and Taa — TA — TB- Figure 3.9 shows the four subtrees. Each of these four subtree makes a task. Obviously, each time binary division is applied, two subtrees of equal size are produced. Through binary division, we finally achieve tasks of same size and appropriate granularity. Combin ing dynamic scheduling and binary division nicely solves the load balancing problem in P T . Like A S L , P T also exploits affinity scheduling. Dur ing processor assignment, the manager tries to assign a task to a worker processor that can take advantage of prefix affinity based on the root of the subtree. Note that in this case, subset affinity is not applicable. F rom this standpoint, P T is top-down. B u t interestingly, because 43 1. A lgo r i t hm P T 2. I N P U T : Dataset R cube dimensions {A\,..., Am}; min imum support Spt 3. O U T P U T : The 2 m cuboids of the data cube 4. P L A N : 5. Task definition: a subtree created by repeated binary part i t ioning 6. Processor assignment: a processor is assigned the next task based on prefix affinity on the root of the subtree 7. C U B E C O M P U T A T I O N (for a processor): 8. parallel do 9. let the task be a subtree T 10. sort R on the root of T (exploiting prefix affinity if possible) 11. call B P P - B U C ( J R , T, Spt, {}) 12. end do Figure 3.10:' A Skeleton of P T each task is a subtree, the nodes within the subtree can be traversed/computed in a bottom-up fashion. In fact, P T calls B P P - B U C , which offers breadth-first wri t ing, to complete the processing. In Figure 3.9, the roots of each subtree, enclosed in boxes, actually make up a small tree. The scheduling just happens on this small tree, similar to A S L . Once a processor gets a task, that is, a subtree, it computes it in a bottom-up manner, much like computing an R P ' s task. In this way, we seamlessly combine top-down and bottom-up methods, getting the benefits of both pruning and sort-sharing. Figure 3.10 shows a skeleton of P T . The step that requires elaboration is line 9, namely the exact definition of T • In general, as shown in Figure 3.9, there are two types of subtrees handled by P T . The first type is a "full" subtree, which means that all the branches of the subtree are included. For example, TAB is a full subtree. The second type is a "chopped" subtree, which means that some branches are not included. The subtrees TA — TAB and T„.u — TA — TB are examples. In line 11, depending on which type of subtree is passed on to B P P - B U C , B P P - B U C may execute in a slightly different way. Specifically, for the loop shown on line 22 in Figure 3.5, if a full subtree is given, no change is needed. Otherwise, the loop needs 44 to skip over the chopped branches. Since P T treats each final subtree resulting from binary division as a task, in an extreme case binary division wil l eventually create a task as each node in the cube lattice, as in A S L . Since task granularity in A S L might be too fine, in P T a parameter is used to determine when binary division stops, thus defining how fine tasks can be. The parameter is set as the ratio of the number of tasks to the number of available processors. The higher the ratio, the better the load balancing but the less pruning can be explored in each. task. Determining the parameter enables a tradeoff between load balancing and pruning. In Figure 3.9, the dotted line between " B o t t o m - U p C u b o i d Computa t ion" and "Top-Down Affinity Scheduling" depicts this tradeoff. M o v i n g up the line means letter load balancing; moving down the line means more pruning. P T wisely leaves this decision up to applications. For the experimental results presented later, we used the parameter "32n" to stop the division, once there are so many tasks (where n is the number of processors). 3.5 Hash-based Algorithms We also implemented two hash-based C U B E algorithms. In the following, we wil l • briefly discuss them. 3.5.1 Hash Tree Based Algori thm This algorithm was developed after B P P proved to ha,ve poor load balancing. Since B P P ' s performance is greatly affected by data skewness, which we could not change, it appears there was no way to improve it. However, considering most Associat ion Rules M i n i n g ( A R M ) algorithms proceed in a bottom-up fashion, also taking ad- vantage of pruning, we then thought about applying the techniques of parallel A R M to C U B E computat ion. The prototypical application of A R M is a "market-basket analysis", where the items represent products, and the records in a database represent point-of-sales 45 Items Database Frequent itemsets (min_sup = 50%) Jane Austen A Agatha Christie C Sir Arthur Conan Doyle D Mark Twain T P.G. Wodehouse W Transactioi Items 1 A C T W 2 C D W 3 A C T W 4 A C D W 5 A C D T W 6 C D T Association rules (min_conf = 100%) A » - C (4/4) A C - A » - W (4/4) AT " A * -CW(4 /4 ) AT— D * - D (4/4) A W - T * -C(4 /4 ) D W - W * - C (5/5) TW - - W (4/4) - C (3/3) - W (3/3) - C (4/4) -C (3/3) - A (3/3) TW AT TW A C T Support Itemsets 100% (6) C 83% (5) W, cw 67% (4) A, D, T, AC, AW CD, CT, ACW 50% (3) AT, DW, TW, ACT, ATW, CDW, CTW, ACTW -»-C(3/3) -» -CW(3/3) " * - A C (3/3) - * - W (3/3) ATW * - C (3/3) CTW * - A (3/3) Figure 3.11: Frequent Itemsets and Strong rules for a Bookstore Database [20] data at large grocery stores or department stores. Each record contains several items. The objective of A R M is to generate all rules with specified confidence and support. A n example rule might be, "90% of customers buying product {A,B,C} also buy product {D,E} .", where the confidence of the rule is 90%. In A R M terminoloy, A,B,C,D and E in this rule are called "items"; {A,B,C} and {TJ, E} are called "itemset". Later, we wil l use "k-itemset" to denote an itemset containing k items. A s well as C U B E , A R M aslo uses "support" to indicate how frequent an itemset occurs as a subset in transactions. Users can specify a "minimum support" (minsup) and "minimum confidence" (minconf) in their queries. M o s t A R M algorithms involve the following steps: 1. F i n d all frequent itemsets satisfying some specified minimum support . 2. Generate strong rules having minimum confidence from the frequent itemsets. The first step of A R M is much like a iceberg-cube problem if we imagine items are attributes with only one value. Then , generating all frequent itemsets means generating all different dimensional group-bys above a specified threshold(minsup). Consider the example bookstore-sales database shown in Figure 3.11. There 46 ABCEF A+_\ BCDE B +. J CEF C+ EF Candidate Hash Tree Figure 3.12: Subset Operat ion on the Root of a Candidate Hash Tree [23] are five different items (names of authors the bookstore carries), I = {A, C, D, T, W}. The database comprises six customers who bought books by these authors. F i g - ure 3.11 shows all the frequent itemsets contained in at least three customer trans- actions, that is, minsup = 50%. The figure also shows the set of all association rules with minconf = 100%). The A p r i o r i algorithm by Rakesh Agrawal and colleagues [20] has emerged as one of the best A R M algorithms, and also serves as the base algorithm for most par- allel algorithms. Apr io r i uses a complete, bottom-up search, iteratively enumerating from frequent 2-itemsets to higher dimensional frequent itemsets. The algorithm has three main steps: 1. Generate candidates of length k from the frequent (A;-l)-itemsets, by a self-join on F f c _ i . For example, for F 2 = {AC, AT, AW,CD,CT,CW,DW,TW}, we get C3 = {ACT, ACW, ATW, CDT, CDW, CTW). 2. Prune any candidate that has a.t least one infrequent subset. For example, CDT will be pruned because DT is not frequent. 3. Scan all transactions to obtain candidate supports. 47 These three steps are called iteratively from k=2 until no more new frequent itemset can be generated. A p r i o r i stores the candidates in a hash tree for fast support counting. In a hash tree, itemsets are stored in the leaves; internal nodes contain hash tables (hashed by items) which direct the search for a candidate. The hash tree structure of A p r i o r i is very efficient for candidate searching and insertion. Once a transaction is read in , all of its subsets can be quickly computed and inserted into the hash tree if they are not there already. Figure 3.12 gives an example of subset operation on the root of a candidate hash tree. Obviously, the bottom-up idea behind both A p r i o r and B U C is the same, ex- cept B U C searches the tree in a depth-first order while Apr io r searches in a breadth- first order. F rom this observation, we developped a C U B E algori thm wi th a similar hash tree structure as in A p r i o r , and exploit the breadth-first searching in C U B E computat ion exactly as in A p r i o i r . We kept the major structure of the A p r i o r i algo- r i thm and made only litt le modification to accommodate C U B E computat ion. For example, since C U B E doesn't assume only a value for each attr ibute (item in A R M ) , we built a global index table which counts all values of all attributes as items. For a small da ta set, this algori thm is feasible. However, its performace was proved unsatisfactory. Breadth-first searching creates too many candidates to be maintained in the hash tree. Th i s is mainly because the global index table contains too many items, exactly the sum of the cardinalities of all C U B E attributes. Th is creates a large amount of candidates. If the C U B E is sparse, the situation is even worse. Al though we can count on pruning to eliminate many candidates, the hash tree is st i l l a huge burden before pruning, and quickly consumes all available memory. Unfortunately, we had to admit this attempt failed. Since the performance of this hash tree based algorithm lags far behind other algorithms, we omit it from the following discussion. 48 3.5.2 Hash Table Based Algori thm After we finished the implementation of A S L , we tried to use the hash table as an alternative data structure for A S L , to see whether better preformance could be achieved. Then the Affinity Hash Table based algori thm was developed, A H T for short. ; A s with P ipeHash , A H T uses hash tables to maintain cells of nodes in a lattice, group-bys. However, A H T avoids creating a hash table for each cuboid. Once subset affinity becomes applicable, it reuses the hash table created for the previous task. Specifically, A H T builds an index which makes it possible to collapse the previous hash table whenever subset affinity is found. For this purpose, each C U B E attribute is assigned several bits which, when combined, form the complete index of buckets in a hash table. For example, for a 3-dimensional C U B E with attributes A , B and C , we give A three bits, B two bits and C one bit. Then the hash tables index has 6 bits (in binary) and the size of the hash table wi l l be 26. Whenever a tuple (cell) is read in , its location in the hashtable is determined by its values for the C U B E attributes. In this example, for its index, the first three bits are decided by the value for A , the next two bits are decided by the value for B , and the last bit is decided by the value for C . The number of bits assigned to each attribute depends both on the cardinality of that attribute and on how many tuples are in the raw dataset . Originally, the bits assigned to an attribute X is log (card(X)), where card(X) is the cardinality of X . This implies the length of a hash table would be the product of the cardinalities of all attributes. However, if the data set is sparse, this product would be much larger than the size of the data set. In this case, the bits assigned to each attribute would shrink appropriately, in order to define a smaller index. A smaller index, however, may introduce collisions. Here we simply tradeoff memory occupation with run time. This tradeoff would introduce severe bucket collision when many cells need to be maintained by the hash table. It degrades A H T ' s performance severely, especially 49 A l g o r i t h m A H T 1. I N P U T : Dataset R cube dimensions { A i , . . . , Am}; min imum support Spt 2. O U T P U T : The 2 m cuboids of the data cube 3. P L A N : 4. Task definition: a cuboid in the cube lattice 5. Processor assignment: a processor is1 assigned the next task based on subset affinity 6. C U B E C O M P U T A T I O N (for a processor): 7. parallel do 8. let the task be with dimensions A , - , . . . , Aj 9. if { A j , . . . , Aj} is a subset of the set of dimensions of the previous task, or the set of dimensions of the first task 10. let C denote the hash table from that task 11. call subset-collapse(C, Spt, Ai,..., Aj) 12. else call subset-newHashTable(i?, Spt, Ai,..., Aj) 13. end do 14. Subroutine subset-collapse(C, Spt, Ai,..., Aj) 15. Collapse C based on A , - , . . . ,.Aj 16. Wri te out the cells if the support threshold is met 17. Subroutine subset-newHashTable(C, Spt, Ai,..., Aj) 18. initialize a hash table i? 19. for each tuple in C do 20. find the right cell in H (created if necessary) 21. update the aggregate and the support counts accordingly 22. end for 23. Traverse H , and write out the cells if the support threshold is met Figure 3.13: A Skeleton of A H T 50 when problem size increases or a high dimensional C U B E need to be computed. We wil l discuss this further in Chapter 4. A s A S L , A H T also takes each group-by as a task. A H T ' s task scheduling is almost the same as A S L , except A H T does not prcocess prefix affinity differently from general subset affinity. If a new task's G R O U P B Y attributes make a subset of those of the previous task, then the hash table already built contains al l cells needed for the new task. So, we wil l create no new hash table but shrink the existing one by collapsing some buckets. Further to the example mentioned above, if we've built the hash table for cuboid A B C , we now get a new task for cuboid A C . The buckets x x x 00 x, x x x 01 x, x x x 11 x and xxx 10 x are collapsed into x x x 00 x, wi th the aggregate and the support upgraded at x x x 00 x. Those attributes missing from the new task (but found in the previous one) determine how many and what buckets wi l l be collapsed. In this example, C is the missing attr ibute. Since two bits (the forth and the fifth in the index) are assigned to C , then four buckets wi l l be collapsed into one bucket. Since the hash table does not maintain cells in any particular sorting order, no sorting is needed in A H T . If a sorted cuboid is required by users, the sorting wi l l be done online when users give their queries. We call this post-sorting. The skeleton of A H T is shown in Figure 3.13. 51 Chapter 4 Experimental Evaluation In this Chapter , we give a performance comparison of five algorithms: R P , B P P , A S L , P T and A H T . The hash tree based algorithm is not included in this testing nor in the following discussion, because its performance lags far behind the others. In order to give, a fair evaluation, we investigate the algori thms' memory occupation first before explaining the testing environment, and then give our test results. 4.1 Memory Occupation In four of the algorithms: R P , A S L , P T and A H T , the raw data set is replicated among processors. Conversely, B P P partitions the raw data set and distributes the partit ions among processors. Let 's first discuss data replication based algorithms. In the simplest algori thm, R P , each processor loads the whole replicated data set, i? , into its main memory as a large array for later computat ion, according to the task assigned to it. R P therefore only needs a space the size of R, in the main memory for each processor. Another data, replication algori thm, P T , is also an array based algori thm. Like R P , its memory footprint is not much larger than R for each processor. A H T uses hash tables as its data structure only to maintain cuboids. Since 52 the cells in a cuboid can be less than tuples in the data set, a hash table may possibly be much smaller than a data array in an array based algori thm. Besides cells, A H T needs also to maintain the index table for the hash table in memory. The index table is fixed-size in A H T ; in other words, the number of buckets in the hash table is fixed. Th is number greatly affects A H T ' s performance. In order to make the experiment evaluation reliable, we set the number of buckets in the hash table to the number of tuples in R. Therefore, A H T ' s memory footprint is not much more than R. In an extreme case, such as where the cuboid contains all tuples in the raw data set, each processor of A H T needs space in its main memory for R cells, plus the \R\ indices for a hash table. The memory footprint of A S L is the biggest of all the algorithms. It takes skip lists as its data structure. The memory overhead for each node of a skip list is mainly decided by the maximum number of forward links it allows a node to have. In our algori thm, we allow no more than 16 forward links in each node. Therefore, a node's memory footprint is no more than twice the size of an element of an array in array based algorithms, such as R P . Like A H T , A S L does not load the entire data set into memory, but only maintains cuboids as skip lists. Thus , a skip list may be smaller than a data array. Even in an extreme case, such that a cuboid contains the whole da ta set, its skip list size would be no more than twice that of R. A s well as the current working skip list, each processor maintains a "root" skip list in its main memory, to maximize sort sharing among local tasks. Then in an extreme case, A S L ' s memory footprint wi l l be no more than four times of R, for two skip lists in the memory of each processor. The data part i t ioning based algori thm: B P P is the most memory-efficient a lgori thm. Since each processor only works on local chunks, its memory footprint is the maximum size of its local chunks. Even in an extreme case, where only one chunk gets produced when range part i t ioning on an attribute, the memory footprint 53 would be no more than R. 4.2 Experimental Environment The experiments were carried out on a hetergeous P C cluster, consisting of eight 500MHz PHI processors with 256M of main memory and eight 266MHz PII proces- sors with 128M of main memory. Each machine is attached with a 30Gbyte hard disk and is connected to a lOOMbit/sec Ethernet network. The C U B E computations were performed on a weather data set containing weather conditions sent by various weather stations on land. The data set is the same as that used by Ross and Srivastava [14], and Beyer and Ramakrishnan [4]. It has 20 dimensions, and is very skewed on some of those dimensions. For example, partitioning the data on the 11th dimension produces one partition which is 40 times larger than the smallest one. In order to compare the effect of varying the different parameters of the problem, we used a fixed setting and then varied each of the parameters individually. The fixed setting, or baseline configurat ion for testing the algorithms, was the following: • the eight 500MHz processors; • 176,631 tuples (all from real data); • 9 dimensions chosen arbitrarily (but with the product of the cardinalities roughly equal to 10 1 3); • with minimum support set at two. For the dynamic scheduling algorithms A S L , P T and A H T , we overlapped the manager and one worker on one processor. This maximized the usage of the processor on which the manager resided, leading to a reasonable performance eval- uation. 54 In the experiment, we investigated how the algorithms perform under differ- ent circumstances. We are concerned with the following issues in C U B E computa- t ion: • load balancing, tested by comparing loads on each processor; • scalability with processors, tested by varying the number of processors; • scalability with problem size, tested by varying problem size; • scalability with dimensions, tested by varying the number of dimensions; • pruning effects, tested by varying the minimum support; • accommodation for sparse C U B E computat ion, tested by varying the sparse- ness of the data set. In the following figures, "wall clock" time means the maximum time taken by any processor. It includes both C P U and I / O cost. 4.3 Load Distribution Figure 4.1 shows the load distr ibution among processors when testing on the baseline configuration. A S L , A H T and P T have quite an even load distr ibution while the loads distributed to each processor by R P and B P P vary greatly. For R P , the reason for the uneven load distr ibution is due to its static task assignment. Al though the number of tasks is approximately equal, the amount of computat ion and I / O for the tasks differs significantly. For B P P , the dataset is partitioned statically across all nodes. Because the data is very skewed on some of the dimensions, the computation is not well balanced. A S L , A H T and P T decrease the granularity of the tasks to a single cuboid in A S L and A H T and to a small subtree in P T . The finer granularity leads to better load balancing, and the use of demand scheduling makes it easier to maintain balanced even when the da tase t is very skewed. 55 Load on Each Parallel Computing Nodes 1401 1 1 1 1 1 1 1 2 3 4 5 6 7 8 Parallel Computing Nodes Figure 1.1: Load Balancing on 8 Processors 56 Speedup Comparision T : 1 1 1 1 1 r 01 i i i i i i i i I 0 2 4 6 8 10 12 14 16 18 Number of Processors Figure 4.2: Scalabil i ty 4 . 4 Varying the Number of Processors Figure 4.2 shows the performance of the algorithms when running on different num- bers of processors. The performances' are largely determined by each algori thms' load balancing abil i ty; generally, the better the load balances, the better the perfor- mance. R P ' s performance is the worst, no matter how many processors are used. Besides poor load balancing, R P ' s depth-first wr i t ing strategy exacerbates its poor performance as well. B P P does well when running only on 2 processors, where the data part i t ion- ing is done quite evenly. However, as more processors are added to the computing environment, the da ta part i t ioning becomes uneven. Uneven tasks with coarse gran- 57 ularity quickly upset load balancing. B P P is quickly outperformed by A S L when four processors are available. The performance of A S L is poor when run on only two processors. Th is is largely due to the overhead from creating and maintaining skip lists. When the number of processors increases, A S L gains from good load balancing and scales very well. A H T ' s performance is similar to A S L ' s , because their task definition and scheduling are almost the same. P T shows the best performance overall due to both good balancing and pruning. A S L , A H T and P T use affinity scheduling to take advantage of share-sort to reduce computat ion. A s we mentioned in section 3.3.2, the affinity relationship among local tasks on one processor tends to weaken as the number of processors increases. It is interesting to note that the speedup from eight processors to sixteen processors is negligible, relatively. 4.5 Varying the Problem Size Figure 4.3 shows that with increasing problem size, P T and A S L do significantly better than other algorithms. Bo th P T and A S L appear to grow sublinearly as the number of tuples increases. Th is is due to two factors. F i r s t , there is an overhead when creating the 2 9 cuboids, which is independent of the amount of data. Second, doubling the number of tuples does not change the cardinality of the dimensions (except for the date field) and does not imply twice the amount of I / O , since more aggregation may take place. It is possible to use more processors to solve a fixed problem faster or to solve a larger problem in the same amount of t ime. The results in Figure 4.3 show- that P T and A S L scale well with problem size and indicate that these algorithms could be used, given sufficient memory and disk space, to solve larger problems on larger cluster machines. 58 Varying Size of Dataset 11001 1 1 1 0 200 400 600 800 1000 1200 Number of Tuples in Dataset (*K) Figure 4.3: Results for varying the dataset size 59 Varying the Number of Cube Dimensions 50001 1 1 1 1 1 1 r Number of Cube Dimensions Figure 4.4: Results for varying the Number of Cube Dimensions Unl ike other algorithms, A H T scales unpredictably with problem size. O n one hand, this is because collision within a bucket tends to happen more often, as more and more cells are maintained by hash tables. Th is damages A H T ' s per- formance severely. On the other hand, the data distr ibution in the raw data set dramatical ly affects how many collisions may occur. This leads to inconsistent scal- ability in A H T . 4.6 V a r y i n g the N u m b e r of D imens ions Figure 4.4 shows the effect of increasing the number of dimensions on each algo- r i thm. The wall clock time increases dramatical ly as the number of dimensions' increases, because the number of cuboids grows exponentially with dimension size. 60 For example, the 13-dimensional C U B E has 8,192 cuboids. The scalability of A H T with C U B E dimensions is the worst of all the al- gorithms. In fact, in our testing, when the number of C U B E dimensions is set as 13, the hash table size was fixed as ten times the size of the input data set, that is ten times larger than that in the baseline configuration. Even then, A H T ' s perfor- mance is very poor. There are two main reasons contr ibut ing to this effect. F i r s t , as high dimensional C U B E needs to be computed, a large number of cells need to be maintained in the hash table. Th is introduces a great amount of collisions within in buckets during insertion and searching operations. Second, since the size of the hash table is fixed, the index bits assigned to each C U B E attribute are far from adequate to appropriately collapse the hash table when subset affinity is applicable. If the data set is skewed on some C U B E attributes, the hash function behaves even poorer. The relative performance for the other four algorithms remains the same except for A S L , where for thirteen dimensions it stops being better than B P P . A S L is affected more than other algorithms because of its comparison operation. The comparison operation used to search and insert cells into the skip list becomes more costly as the length of the key increases. The length of the key grows linearly with the number of dimensions. Th i s is a significant source of overhead for A S L . Figure 4.4 also shows that when the number of dimensions is small , R P , A S L , A H T and P T all give similar performances. Because the size of the output is small for a small number of dimensions, the simple R P algorithm can keep up to the others. 4.7 V a r y i n g the M i n i m u m Suppor t Figure 4.5 shows the effect of increasing the minimum support . As the minimum support increases, there is more pruning, and as a result, less I / O . The total output size for the algorithms given in Figure 4.5 starts at 469Mbyte for a support of 61 Figure 4.5: Results for varying the minimal support 62 one, 86Mbyte for a support of two, 27Mbytes for a support of four, and U M b y t e s for a support of eight. After eight, very l i t t le additional pruning occurs. Except between one and two, the output size does not appear to have much affect on overall performance. Th is is surprising since we expected P T to do better as support increased, because more pruning should have led to less computat ion. The relative flatness of the curve for P T is largely due to the order of the dimensions choosen. For the baseline configuration, the pruning occurs more towards the leaves, where it does not save as much in computation time. Notice A S L and A H T can not prune during computat ion; their better per- formance wi th higher minimum support is due only to less I / O cost but not to pruning. 4.8 Varying the Sparseness of the Dataset Figure 4.6 shows the effect of sparseness of the data set on the four algorithms. We consider a data set to be sparse when the number of tuples is small relative to the product of the number of distinct at tr ibute values for each dimension in the C U B E . Since the number of tuples in the baseline configuration is fixed, we can vary the sparseness of the data set by choosing smaller dimensions over larger cardinali ty dimensions. The three data sets chosen for Figure 4.6 consisted of the nine dimensions with the smallest cardinalities, the nine dimensions with the largest cardinalities, and one in between. Note that even for the smallest of the three, there are st i l l about 10' possible total cells in the cube. A s shown in Figure 4.6, A H T is apparently more affected by sparseness than the other algorithms. The more C U B E dimensions, the more collisions happen, which badly hamper A H T ' s performance. If few collisions occurs, as when dimen- sionality is low, A H T outperforms all others. A H T and A S L perform well on dense datasets and are more adversely affected by spareness than others. A S L performs well for dense datasets because each cuboid 63 Varying the Exponent of Cardinality Product of Cube Dimensions 10001 1 1 1 1 1 1 6 8 10 12 14 16 18 20 22 Cardinality Product of Cube Dimensions (Exponent of 10) Figure 4.6: Results for varying the sparseness of the dataset 64 Situations P T A S L R P B P P A H T dense cubes V V small dimensionality (< 5) v 7 V V V high dimensionality v 7 less memory occupation V otherwise vV v 7 online support Figure 4.7: Recipe for selecting the best algori thm contains relatively few cells, which makes searching or inserting into a skip list relatively fast. The B U C - b a s e d algorithms have little opportuni ty to take advantage of density. In fact, the denser the dataset, the less pruning can be done. A s a result, while traversing the lattice, the B U C - b a s e d algorithms need to sort almost the entire dataset for many of the cuboids. B P P does particularly poorly for cube dimensions with small cardinalities because B P P cannot part i t ion the data very evenly, which leads to serious load imbalance. A S L does worse than the B U C - b a s e d algorithms when the product of the cardinalities is high, partly because of the amount of pruning that occurs for the BUC-based algorithms, and partly because A S L has to maintain larger skip lists. 4.9 S u m m a r y 4.9.1 Recipe Recommended The experimental results shown thus far explores the different parameters affecting overall performance. After careful examination, we recommend the "recipe" shown in Figure 4.7 for selecting the best algori thm in various situations. It is clear that A H T and A S L dominate all other algorithms when the cube is dense, or when the total number of cells in the data cube is not too high (e.g., < 10 8 ) . However, A H T is more adversely affected by sparseness and dimensionality. For data cubes with a small number of dimensions (e.g., < 5), almost all algorithms 65 behave similarly. In this case, R P may have a slight edge in that it is the simplest algori thm to implement. For all other situations, except when the data cube has a large number of dimensions, P T , A H T and A S L are relatively close in performance, wi th P T typically a constant factor faster than A H T and A S L . For cubes of high dimensionality, there is significant difference among the three, and P T should be used. The last entry in Figure 4.7 concerns online support. Th is is the topic of the next section. 4.9.2 Further Improvement There is st i l l room for improvement in some of the algorithms. W i t h the affinity scheduling, the current prefix and subset affinity can be expanded to cooperate with the sorting overlap idea behind the Overlap algori thm, mentioned in Chapter 2. Therefore, even if we can not assign a task to a processor wi th C U B E dimensions perfectly prefixing the previous task, we can try to assign a task with the longest possible prefix of the previous task. This may improve the performance of A S L . For A H T , we can attemp more sophisticated hash function instead of the naive M O D hash function currently use. A better hash function may relieve A H T ' s struggling performance when faced with sparse and high dimensional C U B E com- putat ion. 66 Chapter 5 Online Aggregation Recall that the C U B E computat ion is just a precomputation designed to instantly respond to online iceberg queries. However, sometimes a user's query can not be answered by the precomputed C U B E . W h e n the min imum support for the online query is lower than that for the precomputation, it is no longer possible to compute a query, essentially a cuboid, from a precomputed cuboid. This problem can be solved in two ways. F i r s t , we can choose a small mini- mum support for the precomputation, therefore, most of the queries can be answered by aggregating from a precomputed cuboid. Second, we can simply aggregate from the raw data set to answer an unpredictable query online. In the following sections, we discuss issues concerning these two separate methods. 5.1 Selective Mater ia l i za t ion C U B E with low constraints usually produces a large body of result for which the computation may take a long time and also may not be sa.ved to disk entirely. To solve this problem, it is natural to consider selecting only one set of cuboids to materialize instead of all the available cuboids. Al though our experiments show that in many cases, our parallel algorithms can do well in computing the entire 67 iceberg-cube query from scratch (e.g., < 100 seconds), for t ruly online processing, selective material ization can st i l l help significantly. A s an exercise, we compared two different plans for answering online queries using A S L . The first plan is to simply re-compute the query based on the specified minimum support. If the min imum support was two, as in Figure 4.5, A S L would take approximately sixty seconds to complete the entire C U B E . The second plan consists of a precomputation stage and an online stage. In the precomputation stage, A S L computes only the leaves of the traversal tree using the smallest min imum support (i.e., 1). In the online stage, A S L uses top-down aggregation and returns those cells satisfying the new specified support. In this second stage, A S L can make returns almost immediately; and interestingly, even for the precomputation, it only took fifty seconds for the same example. (The value of fifty seconds was obtained from our addit ional experiment, not directly from Figure 4.5. The values in Figure 4.5 include the total time for the nodes in the tree, not just the leaves.) This suggests that even simple selective materialization can help. It is a topic of future work to develop more intelligent materialization strategies. 5.2 On l ine Aggrega te f r o m a R a w D a t a Set Besides selective materialization, in this thesis, we also consider computing online aggregates from a raw data set. Thus, we manage to provide a comprehensive solution for the iceberg query problem. Hellerstein, Haas and Wang proposed an online aggregation framework [11], in which a sampling technique is applied for instant response and further progressive refinement. We took this framework for our online aggregate algorithm to allow a user to observe the progress of a query and dynamically direct or redirect the computat ion. In the case of an iceberg query, the user would see a rough ini t ia l cuboid which would become more accurate as more tuples are processed. 68 Like A S L , we took a skip list as the fundamental data structure, making it possible to construct a cuboid by incrementally inserting tuples into the skip list. Each tuple can therefore be handled independently. In terms of incremently building a cuboid, the hash tables used in A H T provides a good alternative. However, since its performance is too sensitive to dimensionality and data sparseness, as viewed in Chapter 4, a hash table does not make a good da ta structure for the online aggregate algori thm. The array based algorithms, R P , B P P and P T , are also difficult to be extended to handling online issues, mainly because an array does not efficiently support incrementally insertion. Once query results from new data are computed, they then have to be merged with the results from the old data. Merg ing operations introduce addit ional overhead and do not support parallelism well. In fact, the online advantages of A S L over other algorithms was one of the main motivations for its development. In the following section, we present our Parallel OnLine aggregation algorithm ( P O L ) . 5.3 Parallel Online Aggregation 5.3.1 Data Partitioning and Skip List Partitioning Online aggregation implies only one group-by need be computed. Usually, comput- ing one group-by is not time consuming. The computation is much smaller than computing C U B E . To necessitate parallel computat ion, we assume the raw data set is huge, shown in two aspects. F i r s t , a raw data set is range-partitioned across processors without any sorting. If there are n processors in a cluster, n partit ions, Ri to Rn, are produced; processor j gets Rj. Second, neither a processor can load its local da ta parti t ion entirely into its main memory. A processor has to proceed the computat ion step by step; at each step, one block of data from its local da ta part i t ion is loaded and computed. The data block is in fact a sample taken from the unprocessed part of the processor's local da ta part i t ion. 69 Located on Pi Located on P2 Located on P3 Located on P4 Passed to Pi (Chunkn) (Chunky) (Chunky) (Chunky) Passed to P2 (Chunk2\) (Chunk'n) (Chunk23) (Chunk 24) Passed to P 3 (Chunksi) (Chun (132) {Chunk33) (Chunky) Passed to P4 (Chunk^i) (Chunky) (C'hunk 43) (Chunk44) Table 5.1: Task A r r a y for 4 Processors In order to utilize all available machines in a cluster, P O L range-partitions a skip list to n partitions as well, where n is the number of processors. Each processor, therefore, maintains only one skip list par t i t ion. P O L determines boundaries of the skip list partitions assigned to different processors at the beginning of its computa- tion through sampling. Afterward, a processor is only responsible for searching or inserting cells into its skip list part i t ion as delimited by boundaries. A s a processor scans its local da ta part i t ion, since it is unsorted, the processor finds tuples which should be inserted into skip list partitions maintained by other processors. In such a case, the processor then passes the tuples to other processors appropriately. If there are n processors in the cluster, one processor might pass (n- l ) / n of its local data to other processors. The overhead from data communication is then introduced. 5.3.2 Task Definition and Scheduling A s mentioned above, P O L proceeds with computation step by step. W i t h i n a step, each processor computes a block of data, and data commutat ion takes place among processors when necessary. P O L guarantees that one block of data is loaded only once. Only after all processors complete computing on the tuples in this block, does the loading processor discard the block and move to the next step. Therefore, pro- cessors proceed with their computat ion synchronously, and synchronizations happen amongst processors between every two steps. Tasks are defined in P O L for each step, that is, between synchronizations. 70 Synchronization ' Tasks for P1, from One Step £ h u n k j l O i u n k l i r fJiunk3i Chunk4jJ I Tasks for P2, from ( j Chunk22 Chunk32 Chunk42 3! | Tasks for P3, from C h u n k l . C 3 Chunk23 C h u n k 3 3 ^ B Chunk43 I Tasks for P4, from Cg^^ Figure 5.1: Tasks Assignment in P O L Suppose at one step, after the processor P 8 loads in a data block from its local data part i t ion Rt, it groups the tuples in the block into n chunks, Chunky to Chunkni, according to the parti t ion boundaries set for the skip list partit ions, where n is the number of processors. Note that Chunk ji indicates a chunk, which although located in processor Pt, wi l l be passed to processor Pj to maintain F j ' s skip list parti t ion. Therefore, for Pu all but one chunk(Chunka) are passed over network and checked by other processors. For a cluster wi th 4 processors, the task array created for one step is shown in Table 5.1. Since there are n processors and each processor has n chunks, n x n chunks are produced in total . These chunks correspond to n X n tasks, indicated as t&Bk{Chunkji)(both i and j from 1 to n). Task(Chunkji) is the computation based on chunk Chnnkji- Notice that at each step, tasks have to be redefined. Tasks in different steps are separately scheduled. Like some of C U B E algorithms, in P O L , a manager responsible for task 71 scheduling, and many workers responsible for computing(computing aggregations in this case). The manager ini t ia l ly assigns a number of tasks to each processor. How- ever, once a processor finishes its assigned tasks, it can then help other processors finish their tasks. Originally, processor Pi, are assigned task(Chunkij) (j is from 1 to n). For ex- ample, wi th four processors, P 2 is required to finish t a sk (C7 i tmA:2i ) , task(C7imiA:22), task(Chunk23) and task(C hunk24). For some of these tasks, P 2 has to fetch appro- priate chunks located on other processors. P2 then needs Chunk2\ on Pi to finish tdisk{Chunk2\), Chunk23 on P 3 to finish task(Chunk23), Chunk24 on P4 to finish task(Chunk24) and local C'hunk22 to finish task(C/i?mA:22). The sequence for proces- sor Pi to compute its assigned tasks is this: from task(C/i?mA;j,-) to task(Chunkin), it then wraps back from task(C 'hunkn) to task(Chunki^_1)). Th is sequence maxi- mizes the possibility of each processor working on data located on different proces- sors at one time, thus reducing the possibility of a burst of da ta requests happening on a particular processor. Figure 5.1 illustrates the original task assignment in P O L for a computing environment consisting of 4 processors. To balance the load, a processor is allowed to offload wait ing tasks from busy processors after it has finished its own assigned tasks. The manager tries to assign to it those untouched tasks that the processor keep the input data chunk in local. The processor then compute a new skip list for the task. Once it finishes this task, or gets a data request from the processor responsible for the task, it passes the skip list it has already built on to that processor. Then , that processor merges the skip list wi th its local skip list part i t ion. Apparently, this method of task scheduling does not introduce additional data communication overhead. To provide a constant update of query results, a set t imer periodically gives response back to the user. Whenever the timer expires, the manager collects results from all workers, displays the results on screen, and resets the timer. If the user wishes to discontinue the computat ion, he or she can interrupt it at any point. 72 1. A lgor i thm P O L 2. I N P U T : Range-parti t ioned data sets, each located on one processor (processor Pi has part i t ion Rl;) G R O U P B Y dimensions { A i , A2,.. -Am} and min imum support Spt 3. O U T P U T : The iceberg query results 4. O N L I N E A G G R E G A T E : 5. The manager takes a sample, and determines the boundaries of skip list partit ions assigned to each. processor 6. parallel do 7. while (not all data has been processed) 8. if (worker processors Pi) 9. loads in one block the samples from its local part i ton which have not been processed 10. it then groups the samples into n chunks, Chunkn, to Chunkni 11. calls on\'me-s\a,ve(Chunkii,... ,Chunkni, Spt, Ay,..., Am) 12. if(the manager) 13. defines n X n tasks, each for one chunk on workers 14. schedules the tasks, as described in Section 5.3.2 15. synchronize 16. end while 17. end do 18. Subroutine onl ine-slave(C/i?m/ci j ' , . . . , Chunkni, Spt, A\,..., Am) 19. gets a task from the manger; if there is no uncompleted task at the manager, return 20. if the task is Task(Chunkij) 21. asks processor Pj for the chunk Chunky 22. updates the local skip list based on Chunkij 23. if the task is Task(C7itmfc7-,-) 24. computes a new skip list from the chunk Chunkji 25. sends the skip list to Pj, then Pj merges it into its local skip list pari t i ton. 26. during processing, if any request comes from another process asking for a data chunk, sends it to the processor 27. during processing, if any request comes from the manager for current result, estimates current minimum support, collects result and send them to the manager 28. during processing, if any request comes from the manager for stopping the computation, return Figure 5.2: A Skeleton of the P O L Algo r i t hm 73 A skeleton of algorithm P O L is shown in Figure 5.2. 5.4 Exerimental Evaluation The testing environment for P O L is similar to that for the C U B E algorithms, except that we based our experiments on a larger weather dataset, which contains 1,000,000 tuples. Although the data set is larger, it has the same number of dimensions as the smaller one used for testing the C U B E algorithms. We focused on the following issues in P O L during the experiments: • scalability with the number of processors; • scalability with the buffer size on each processor. 5.4.1 Varying the Number of Processors Figure 5.3 shows the performance of P O L with different numbers of processors. In testing, a 12-dimensional iceberg query was answered online. The minimum support was set as 2 and the buffer on each processor was set to contain 8000 tuples at each step. The computation created a huge skip list with 924,585 nodes. The performance of P O L was tested on three clusters of machines: • Clusterl consists of eight 500MHz PIIL processors with 256M of main memory connected by an Ethernet network; • Cluster2 consists of eight 266MHz PII processors with 128M of main memory connected by an Ethernet network; ' • Cluster3 consists of eight 266MHz PII processors with 128M of main memory connected by a higher speed network, Myrinet, which is approximately three times faster than the Ethernet used in the first two clusters. Data communication among worker processors is the main factor affecting POL's performance. If the data distribution is uniform, for each processor nearly 74 150 Varying the Number of Processors —100 E t- O 50 ciusten —e- Cluster2 —*- Cluster3 — H 4 5 6 Number of Processors Figure 5.3: POL ' s Scalability with the Number of Processors (n — l)/n of data needed are located on other processors, where n is the number of processors. Apparently, the higher n is, the more data needs to be transfered over the network. However, adding more machines decreases the computations carried out at each processor because the work load is shared. Therefore, whether we can achiever better overall performance with more processors or not remains uncertain. It largely depends on how much the computation decreases or the communcation increases on each processor, and which is the dominating factor. Generally, more time spent on computation versus the less spent on communication, the better performance can be achieved. Computing high dimensional queries always implies more computation be- cause a large skip list needs to be maintained. Therefore, we can conclude that P O L is feasible especially for computing high dimensional queries. 75 Figure 5.3 shows the speedup achieved on Clusters'2 and Cluster3 is better than on C l u s t e r l , mainly because the computation on the clusters of slow machines takes up more total run time than on the cluster of fast machines. Concerning load balancing, dynamic offloading from other busy processors can balance uneven load resulted from unevenly distributed data among processors. However, if both the skip list part i t ioning and the data distr ibution are uneven, the load may be poorly balanced. Fortunately, in our testings, this adverse situation did not arise. 5.4.2 Varying the Buffer Size Buffer size l imits the amount of data processed at each step. The larger the buffer size, the fewer steps are needed in P O L , and thus, less synchronizations and less sampling happens between steps. Usually, synchronization and sampling mean the introduction of overhead. Therefore, as shown in Figure 5.4, as the buffer size increases, performance improves. 76 Figure 5.4: Scalabil i ty with Buffer Size 77 Chapter 6 Conclusion In this thesis we discuss a collection of novel parallel algorithms we developed di- rected towards online and offline creation of C U B E to support iceberg queries. We evaluated the C U B E algorithms, R P , B P P , P T , A S L and A H T , across a variety of parameters to determine the best situations for use. R P has the advantage of being simple to implement. However, except for cubes with low dimensionality, R P is outperformed by the other algorithms. B P P is also outperformed; but B P P reveals that breadth-first wr i t ing is a useful opt imizat ion. A s an extension of B P P , P T is the algorithm of choice in most situations. There are, however, two excep- tional situations where A S L and A H T are recommended. A S L and A H T are more efficient for dense cubes, whereas A S L supports sampling and progressive refinement especially. For the online aggregation, we tested our algori thm, P O L , for aggregating online over a large data set. Experiments revealed that P O L behaves well in a cluster of machines connected with high speed networks, and is valuable in answering high dimensional online queries which require more time to complete computat ion. In future work, we would investigate how the lessons we have learned re- garding parallel iceberg query computat ion can be applied to other tasks in O L A P computation and data mining. These include (constrained) frequent set queries [24], 78 and O L A P computat ion, taking into account correlations between attributes. 79 Bibliography [1] M . J . A . Ber ry and G . Linoff. D a t a M i n i n g Techniques: For marketing, Sales, and Customer Support . John Wi l ey & Sons, New York , 1997 [2] R . Agrawal , S. Agrawal , P. Deshpande, A . Gup ta , J . Naughton, R . Ramakr - ishnan and S. Sarawagi. O n the computat ion of multidimensional aggregates. In Proc. 1996 VLDB, pp. 506-521. [3] E . Baral is , S. Paraboschi and E . Teniente. Mater ial ized view selection in a multidimensional database. In Proc. 1997 VLDB, pp. 98-112. [4] K . Beyer and R . Ramakr ishnan. B o t t o m - U p Computa t ion of Sparse and Ice- berg C U B E s . In Proc. 1999 ACM SIGMOD, pp 359-370. [5] M . Ebe r l , W . K a r l , C . Trini t is , and A . Blaszczyk. Paral lel Compu t ing on P C Clusters - A n Alternat ive to Supercomputers for Industrial Appl ica t ions . In Proc. 6th European Parallel Virtual Machine/Message Passing Interface Conference, L N C S vol . 1697, pp. 493-498, 1999. [6] M . Fang, N . Shivakumar, H . Garc i a -Mol ina , R . M o t w a n i and J . U l l m a n . C o m - puting iceberg queries efficiently. In Proc. 1998 VLDB, pp. 299-310. [7] S. G o i l and A . Choudhary. High Performance O L A P and D a t a M i n i n g on Parallel Computers . In The Journal of Data Mining and Knowledge Discovery, 1, 4, pp. 391-418, 1997. 80 [8] J . Gray, A . Bosworth , A . L a y m a n and H . Pirahesh. Datacube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. In Proc. 1996 ICDE, pp. 152-159. [9] H . G u p t a , V . Harinarayan, A . Rajaraman and J . U l l m a n . Index selction for O L A P . In Proc. 1997 ICDE, pp. 208-219. [10] V . Harinarayan, A . Rajaraman and J . U l l m a n . Implementing data cubes efficiently. In Proc. 1996 ACM SIGMOD, pp. 205-216. [11] J . Hellerstein, J . Haas and H . Wang . Online Aggregation. In Proc. 1997 SIGMOD, pp. 171-182. [12] M . Kamber , J . Han and J . Ch iang . Metarule-guided mining of mult i - dimensional association rules using data cubes. In Proc. 1997 KDD, pp. 207- 210. [13] Y i H o n g Zhao, Prasad Deshpande, and Jeffrey F . Naughton A n Array-based algorithm for simultaneous Mul t id imensional aggregates. S I G M O D Conference 1997, pp. 159-170 [14] K . Ross and D . Srivastava. Fast Computa t ion of Sparse Datacubes. In Proc. 1997 VLDB, pp. 116-125. [15] S. Sarawagi. Expla in ing differences in multidimensional aggregates. In Proc. 1999 VLDB, pp. 42-53. [16] A . Shukla, P. Deshpande and J . Naughton. Material ized view selection for multidimensional datasets. In Proc. 1998 VLDB, pp 488-499. [17] A . Srivasta,va, E . Han, V . K u m a r and V . Singh. Parallel formulations of decision-tree classification algori thm. In The Journal of Data Mining and Knowledge Discovery, 3, 3, pp. 237-262, 1999. 81 [18] M . Tamura and M . Kitsuregawa. D y n a m i c Load Balance for Paral lel Associa- tion Rule M i n i n g on Heterogeneous P C Cluster System. In Proc. 1999 VLDB, pp. 162-173. [19] Soroush Momen-Pour Paral le l Computa t ion of D a t a Cubes M S c . Thesis, Universi ty of Br i t i sh Co lumbia , Computer Science Dept . , 1999. [20] M . Zak i . Paral le l and distributed association mining: a survey. In IEEE Concurrency, 7, 4, pp. 14-25, 1999. [21] S. Agarwa l , R . Agrawal , P . M . Deshpande, A . G u p t a , J . F . Naughton, R .Ramakr i shnan and S. Sarawagi On the Computa t ion of Mul t id imensional Aggregates. mProc. 1996 VLDB, pp. 506-521. [22] W . Pugh . Skip Lists : a Probabi l is t ic Al ternat ive to Balance Trees. In Com,- munications ofthe ACM 1990. [23] Eur -Hong (Sam) Han , George Karyp i s , V i p i n K u m a r Scalable Paral lel D a t a M i n i n g for Associat ion Rules Proceedings of the A C M S I G M O D international conference on Management of data M a y 11 - 15, 1997, Tucson, A Z U S A [24] R . T . N g , L . V . S . Lakshmanan, J . Han , and A . Pang . Explora tory mining and pruning optimizations of constrained associations rules. In Proc. 1998 SIGMOD, pp. 13-24. 82


Citation Scheme:


Usage Statistics

Country Views Downloads
Japan 4 0
United States 3 0
City Views Downloads
Tokyo 4 0
Indianapolis 2 0
Unknown 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}


Share to:


Related Items