UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Iceberg-cube computation with PC cluster Yin, Yu 2001

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2001-0590.pdf [ 3.33MB ]
Metadata
JSON: 831-1.0051672.json
JSON-LD: 831-1.0051672-ld.json
RDF/XML (Pretty): 831-1.0051672-rdf.xml
RDF/JSON: 831-1.0051672-rdf.json
Turtle: 831-1.0051672-turtle.txt
N-Triples: 831-1.0051672-rdf-ntriples.txt
Original Record: 831-1.0051672-source.json
Full Text
831-1.0051672-fulltext.txt
Citation
831-1.0051672.ris

Full Text

Iceberg-cube Computation with P C Cluster by  Yu Y i n B.Sc,  Jilin  University,  1993  A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR T H EDEGREE OF M a s t e r of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science)  We accept this thesis as conforming to the required standard  The University of British Columbia April 2001 © Yu Y i n , 2001  I n p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the r e q u i r e m e n t s f o r an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree t h a t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and s t u d y . I f u r t h e r agree t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g of t h i s t h e s i s f o r s c h o l a r l y ' p u r p o s e s may be g r a n t e d by the head of my department o r by h i s or her r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n of t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n .  Department of  I/wrMWAvi  The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Date  Od  <o > ^°°<  Abstract Iceberg queries constitute one of the most important classes of queries for O L A P applications. This thesis investigates using low cost P C clusters to parallelize the computation of iceberg queries. We concentrate on techniques for querying large, high-dimensional data sets. Our exploration of an algorithmic space considers tradeoffs between parallelism, compuation, memory and I / O . The main contribution of this thesis is the development and evaluation of various novel, parallel algorithms for C U B E computation and online aggregation. These include the following: one, the C U B E Algorithm R P , which is a straightforward parallel version of BUC[BR99]; two, the C U B E Algorithm B P P , which attempts to reduce I / O by outputting results in a more efficient way; three, the C U B E Algorithms A S L and A H T , which maintain cells in a cuboid in a skip list and a hash table respectively, designed to put the utmost priority on load balancing; four, alternatively, the C U B E Algorithm P T load-balances by using binary partitioning to divide the cube lattice as evenly as possible; and five, the online aggregating algorithm P O L , based on A S L and sampling technique, which gives back instant response and further progressive refinement. We present a thorough performance evaluation of all these algorithms in a variety of parameters, including the dimensionality and the sparseness the cube, the selectivity of the constraints, the number of processors, and the size of the data set. The key to understanding the C U B E algorithms is in that one-algorithm-does-notfit-all. We recommend a "recipe" which uses P T as the default algorithm, but may also deploy A S L or A H T in appropriate circumstances. The online aggregation algorithm, P O L , is especially suitable for computing a high dimensional query over a large data set with a cluster of machines connected by high speed networks.  Contents Abstract  »  Contents  i"  List of Tables  vi  List of Figures  vii  Acknowledgements  ix  1  Introduction  1  2  Review  7  2.1  Iceberg Query  7  2.2  C U B E Operator  9  2.3  Iceberg-cube Computation  12  2.4  Sequential C U B E Algorithms  12  2.4.1  Top-down C U B E algorithms  13  2.4.2  Bottom-Up C U B E Algorithm  23  3 Parallel Iceberg-cube Algorithms  28  3.1  Algorithm R P  30  3.2  Algorithm B P P  30  3.2.1  32  Task Definition and Processor Assignment  iii  3.2.2 3.3  4  5  Breadth-first Writing  34  Algorithm A S L . . 3.3.1  Using Skip lists  3.3.2  Affinity Assignment  37 •  38 39  3.4  Algorithm P T  42  3.5  Hash-based Algorithms  45  3.5.1  Hash Tree Based Algorithm  45  3.5.2  Hash Table Based Algorithm  49  Experimental Evaluation  52  4.1  Memory Occupation  52  4.2  Experimental Environment  54  4.3  Load Distribution  55  4.4  Varying the Number of Processors  57  4.5  Varying the Problem Size  58  4.6  Varying the Number of Dimensions  60  4.7  Varying the Minimum Support  61  4.8  Varying the Sparseness of the Dataset  63  4.9  Summary  65  4.9.1  Recipe Recommended  65  4.9.2  Further Improvement  66  Online Aggregation  67  5.1  Selective Materialization  67  5.2  Online Aggregate from a Raw Data Set  68  5.3  Parallel Online Aggregation  69  5.3.1  Data Partitioning and Skip List Partitioning  69  5.3.2  Task Definition and Scheduling  70  5.4  Exerimental Evaluation  74  iv  6  5.4.1  Varying the Number of Processors  74  5.4.2  Varying the Buffer Size  76  Conclusion  78  Bibliography  80  v  List of Tables 1.1  Key Features of the Algorithms  4  2.1  Example relation R  8  5.1  Task Array for 4 Processors  70  vi  List of Figures 2.1  Iceberg Query  8  2.2  C U B E Operation on Relation S A L E S [8] .  10  2.3  Cube in Multi-dimensional Array Format [8]  11  2.4  Lattice and Processing Trees for C U B E Computation [4]  14  2.5  A n Example of 4-Dimensional Lattice for Algorithm PipeSort [2] . .  16  2.6  A n Example of P l a n and Pipelines for Algorithm PipeSort [2] . . . .  17  2.7  PipeHash on a Four Attribute Group-by [2]  19  2.8  Examples for PartitionedCube and MemoryCube Algorithms [14] . .  21  2.9  A Skeleton of B U C  25  2.10 B U C Partitioning  26  3.1  A Skeleton of the Replicated Parallel B U C Algorithm  31  3.2  Task Assignment in Algorithm R P  31  3.3  Task Assignment in B P P  33  3.4  Depth-first Writing vs Breadth-first Writing  34  3.5  A Skeleton of the B P P Algorithm  35  3.6  I / O comparison between B P P (Breadth-first writing) and RP(Depthfirst writing) on 9 dimensions on a dataset with 176,631 tuples, input size is 10Mbyte and output size is 86Mbyte  36  3.7  Pictorial Description of Steps Involved in Performing an Insertion [22] 38  3.8  A Skeleton of A S L  40  vii  3.9  Binary Division of the Processing Tree into Four Tasks  43  3.10 A Skeleton of P T  44  3.11 Frequent Itemsets and Strong rules for a Bookstore Database [20] . .  46  3.12 Subset Operation on the Root of a Candidate Hash Tree [23]  47  . . . .  3.13 A Skeleton of A H T  50  4.1  Load Balancing on 8 Processors  56  4.2  Scalability  57  4.3  Results for varying the dataset size  59  4.4  Results for varying the Number of Cube Dimensions  60  4.5  Results for varying the minimal support  62  4.6  Results for varying the sparseness of the dataset  64  4.7  Recipe for selecting the best algorithm  65  5.1  Tasks Assignment i n P O L  71  5.2  A Skeleton of the P O L Algorithm  73  5.3  P O L ' s Scalability with the Number of Processors  75  5.4  Scalability with Buffer Size  77  viii  Acknowledgements I would like to express my gratitude to my supervisor, D r . A l a n Wagner, and professor D r . Raymond Ng, for helping me in carrying out this research project and for reading the manuscript of my thesis and offering their valuable comments. I also would like to thank Kirsty Barclay for reading the manuscript of my thesis and providing me with her helpful comments.  Yu YIN  The University of British Columbia April 2001  ix  Chapter 1  Introduction As computing and the Internet advance, we see a massive increase in the raw data available to institutions, corporations, and individuals. For example, large numbers of radiological images have been generated in hospitals and immense product and customer databases have been accumulated [1].  Extracting meaningful patterns  and rules from such large data sets is therefore becoming more and more important. In this context, On-line Analytical Processing ( O L A P ) has emerged as a powerful tool for data analysis. In decision support systems, O L A P enables analysts and managers to obtain insight into data. By interactively posing complex queries, they can extract different views of data. In many O L A P applications, aggregation queries constitute a large percentage of the computation. Many of these queries are only concerned with finding aggregate values above some specified threshold. We call this kind of query "iceberg queries". Query results consisting of above-threshold aggregate values are typically small compared to the total input data (the iceberg). Through Structured Query Language (SQL) aggregate functions and the G R O U P B Y operator, O L A P applications can easily produce aggregates for one group-by, however, most applications need aggregates for a set of group-bys in order to gain more insight into the data. This necessitates generalization of G R O U P B Y  1  operator. The C U B E operator, defined by Gray et al [8], provides this generalization. It computes aggregation for every possible combination of a set of specified attributes.  For instance, if C U B E operator is applied on 2 attributes, A and B ,  then the aggregates from G R O U P B Y on all, G R O U P B Y on A , G R O U P B Y on B and G R O U P B Y on A B will be returned together. The computation introduced by the C U B E operator is huge, because for d specified attributes, 2 G R O U P B Y s are d  computed. Furthermore, In each cuboid, there are also numerous cells, or partitions, computed. In C U B E terminology, output for an n-attributes G R O U P B Y is called an n-dimensional cuboid, also called an n-dimentional group-by. When the C U B E operator is employed to answer a set of iceberg queries, we call it an "iceberg-cube". In this thesis, we investigate the algorithms for answering iceberg queries, especially for iceberg-cube computation. Recently, several algorithms have been proposed, including the PipeSort and the PipeHash algorithms proposed by Agrawal et al. [2], the Overlap algorithm proposed by Naughton et al [21], the PartitionedCube algorithm proposed by K . Ross and D . Srivastava [14] and the Bottom-Up algorithm (BUC) proposed by Beyer and Ramakrishnan [4]. A l l these algorithms except B U C are general C U B E computation algorithms, in the regard that they do not specifically target iceberg-cube computation. They proceed in a top-down fashion, that is, computing from more dimensional group-bys to less dimensional group-bys. Many of them try to utilize previous sorting in the top-down traversal. B U C provides another efficient solution, specifically for threshold-set iceberg queries. It proceeds in a bottom-up fashion, trying to prune tuples which do not satisfy threshold as early as possible to reduce computation. We will discuss these two kinds of algorithms in more detail in Chapter 2 . We based our work on the algorithms mentioned above, however, we were interested in providing parallel solutions. The previous C U B E algorithms mainly proposed for running on stand alone machines were developed to execute on a single processor, so-called sequential algorithms. In this thesis, we propose several par-  2  allel algorithms for answering iceberg queries, and promote the benefits of using distributed c o m p u t i n g platforms to solve problems. O u r underlying architecture is a dedicated cluster of P C s . W i t h elegant parallel algorithms, these machines have the potential to achieve the performance of massive parallel machines at a much lower cost. W e focused our work on practical techniques that could be readily i m plemented on low cost P C clusters using open source, L i n u x and public d o m a i n versions of the M P I message passing standard. To improve the response time of iceberg .queries, two different solutions are explored: precomputing and online querying. P r e c o m p u t a t i o n is a c o m m o n technique used by m a n y O L A P applications. Usually, precomputation computes a C U B E operator, e x t r a c t i n g multiple aggregates and saving the results on disks. It supports instant response if the precomputed results match a user's queries.  Towards efficient iceberg-cube precomputation w i t h  P C clusters, this thesis explores different trade-offs between parallelism, c o m p u t a tion and I / O . A s s u m i n g input d a t a sets fit in m a i n memory on each machine of the cluster, we developed several novel, parallel algorithms for iceberg-cube c o m p u t a tion and give a comprehensive evaluation in this thesis. Here is a summery of the parallel algorithms: • A l g o r i t h m R P (Replicated  Parallel  B U C ) , is a straightforward parallel version  of B U C . It is simple and introduces little overhead above its sequential version. However, a l g o r i t h m R P is poor in d i s t r i b u t i n g tasks and balancing w o r k l o a d . In an a t t e m p t to achieve better load-balancing, algorithm B P P writing,  Partitioned,  Parallel  (Breadth-first  B U C ) , was developed. B P P differs from R P in  two key ways. F i r s t , the dataset is range-partitioned and distributed other than replicated in R P ; second, the output of cuboids is done in a breadthfirst fashion, as opposed to the depth-first w r i t i n g in R P and B U C . Table 1.1 summarizes the key features of the algorithms. • T h o u g h B P P is better than R P concerning load-balancing, this improvement  3  Algorithms  Writing Strategy  Load Balance  Relationship of cuboids  Data Decomposition  RP BPP ASL PT  depth-first breadth-first breadth-first breadth-first  weak weak strong strong  bottom-up bottom-up top-down hybrid  replicated partitioned replicated replicated  Table 1.1: K e y Features of the A l g o r i t h m s  is limited when the raw d a t a set skews on some attributes. T h i s is primarily because the task granularity of R P and B P P is relatively large and uneven. T o consider load balancing as the utmost priority, a l g o r i t h m A S L (Affinity SkipList) is developed. In A S L each cuboid is treated as a task. A S L uses an affinity task scheduling strategy to seek the relationship among tasks assigned to the same processor and m a x i m i z e sort sharing among them.  Thus A S L  resembles to the top-down algorithms. A S L is also unique in the regard t h a t it maintains the cells of a cuboid in a different d a t a structure, namely a skip list. • A l g o r i t h m P T (Partitioned Tree) is a hybrid a l g o r i t h m , combining both the idea of p r u n i n g from B U C and affinity scheduling from A S L . It  processes  tasks of slightly coarser granularity. T h e idea is to use binary partitioning to divide the cuboids into tasks as evenly as possible, in order to make the load well-balanced. T h e c o m p u t a t i o n in each task proceeds in bottom-up fashion, however, the task assignment is processed by affinity scheduling in a top-down fashion. • T w o other algorithms based on a hash tree and a hash table were also developed. T h e implementation based on a hash tree used up memory too rapidly that it fails to process large d a t a set. T h e hash table based a l g o r i t h m was i m plemented much like A S L , in terms of task definition and scheduling. However, its performance is no better than A S L in most cases.  4  T w o questions natually arise at this point: one, which algorithm is the best; and two, do we really need to know about all these algorithms? In considering the first question, we present a thorough performance evaluation of all these algorithms on a variety of parameters. T h e parameters include the dimensionality, the sparseness of the group-bys, the selectivity of the constraints, the number of processors, and the sizes of the d a t a sets. W i t h respect to the second question, a key finding of our evaluation is that when it comes to iceberg-cube c o m p u t a t i o n w i t h P C clusters, it is not a "one-algorithm-fits-all" s i t u a t i o n . Based on our results, we recommend a "recipe" which uses P T as the default a l g o r i t h m , but may also deploy A S L under specific circumstances. P u t t i n g parallel iceberg-cube algorithmic development and evaluation aside temporarily, we next consider the concept of " t r u l y online". P r e c o m p u t a t i o n can answer users' queries instantly if the query pattern can be predicted. However, if the threshold set by online queries differs from what the precomputation assumed, precomputed cuboids can no longer be used to answer those queries.  Therefore,  those queries have to be computed online. W e posit a scenario that the input raw d a t a set no longer fits in main memory. O n l y w i t h this precondition will the query c o m p u t a t i o n be large enough to necessitate a p p l y i n g parallelism.  In the online  aggregation framework proposed and studied by Hellerstein, Haas and W a n g , an online query a l g o r i t h m based on A S L was developed. U s i n g the sampling technique, a user's online query can be responded to instantly. A n d with more and more d a t a processed, the answer becomes more and more refined and accurate. Integrating C U B E precomputation and online querying c o m p u t a t i o n together, this thesis gives a relative complete solution for the special problem d o m a i n : iceberg query c o m p u t a t i o n . T h e outline of the thesis is as follows. C h a p t e r 2 reviews key concepts and the main sequential algorithms for iceberg-cube c o m p u t a t i o n . C h a p t e r 3 introduces the various parallel algorithms we developed. C h a p t e r 4 presents a comprehensive  5  experimental evaluation of these algorithms, and concludes with a recipe for picking the best algorithms under various circumstances. processing. F i n a l l y , a conclusion is given in C h a p t e r 6.  6  C h a p t e r 5 discusses online  Chapter 2  Review The  background material necessary for understanding the parallel algorithms to  be introduced in C h a p t e r 3 is presented in this chapter.  W e first discuss iceberg  query, then the C U B E operator. A special C U B E operator, iceberg-cube, is introduced seperately. T h e last part of this chapter, Section 2.4 presents some sequential algorithms for C U B E and iceberg-cube c o m p u t a t i o n .  2.1  Iceberg Query  A n iceberg query is much like a regular aggregate query, except that it eliminates aggregate values that fall below some specified threshold after it performs an aggregate function over an attribute or a set of attributes. T h e p r o t o t y p i c a l iceberg query considered in this thesis is as follows for a relation R(targetl, targetk,  rest, aggregateField)  and a threshold T.  SELECT  targetl,  target2,...,  FROM  R  GROUP BY  targetl,  HAVING  count (rest) > T  targetk,  target2,  ...,  SVM(aggregateField)  targets,targetk  If the above iceberg query is applied to the relation R in Table 2.1, with T  7  targetl  target2  rest  aggregate Field  Item  Location  Customer  Sales  Sony 25" T V J V C 21" T V Sony 25" T V J V C 21" T V Sony 25" T V Panasonic H i - F i V C R  Seattle Vancouver Seattle LA Seattle Vancouver  Joe Fred sally sally bob torn  700 400 TOO 400 700 250  Table '2.1: E x a m p l e relation R  0  \P  Over huge data set Cut off output by setting threshold  The outpjt is just the small tip of the Icebeig  Iceberg Query 'SELECT FROM GROUP BY ^HAVING  A, B, C, COUNTT^ R A, B, C, COUNT!*) >= 2 j  F i g u r e 2.1: Iceberg Query  = 2 and k = 2, the result would be the tuple <Sony 25" T V , Seattle, 2100>. We notice that relation R and the number of unique target values are typically huge (the iceberg), and the answer, that is, the number of frequently o c c u r r i n g targets, is very small (the tip of the iceberg). T h i s situation is pictured in F i g u r e 2.1. A n iceberg query becomes especially i m p o r t a n t when the amount of input d a t a is tremendous, since d a t a analysts or managers can not possibly go through all the detailed information w i t h i n a huge data set. Usually, they only note frequently occurring behaviors, which are typically more important than unusual occurrences.  8  In realistic d a t a analysis, d a t a analysts often execute multiple iceberg queries, which G R O U P B Y on different number of dimensions. F o r example, they may want to know more detailed information if the previous query returns too few results. A f terward they might like to "drill-down" by G R O U P B Y on more attributes. O n the other hand, if the previous query gives back too detailed and too much information, they may like to "roll-up" by giving less G R O U P B Y attributes in the upcoming query. A Generated report containing results from all those queries can be formulated in standard S Q L , but its representation is inconvenient. A s well as drill-down and roll-up, some other frequently used queries i n c l u d i n g histogram and cross-tab are also difficult to represent in standard S Q L [19].  2.2  C U B E Operator  To exceed the l i m i t a t i o n posed by the standard S Q L , as mentioned in Section 2.1, the C U B E operator was introduced in [8] by J . G r a y et al. It generalizes the standard G R O U P B Y operator to compute aggregates for every c o m b i n a t i o n of G R O U P B Y attributes. F o r instance, consider the following relation S A L E S ( M o d e l , Year, C o l o r , Sales), shown in the lefthand table in F i g u r e 2.2. W h e n C U B E is on R w i t h G R O U P B Y attributes M o d e l , Year and C o l o r , aggregate on attribute Sales ( S U M in this case), the result returned will contain the sum of Sales for the entire relation (i.e. no G R O U P B Y ) , for each i t e m : ( M o d e l ) , (Year), ( C o l o r ) , for each pair: ( M o d e l , Year), ( M o d e l , C o l o r ) , and (Year, C o l o r ) , and finally for each ( M o d e l , Year, C o l o r ) . T h e result is shown in the righthand table in F i g u r e 2.2. F i g u r e 2.3 shows the C U B E in a multi-dimensional array format. In O L A P terminology, the G R O U P B Y attributes are called "dimensions", the attributes that are aggregated are called "measures", and one particular G R O U P B Y , (e.g., ( M o d e l , Y e a r ) ) , in a C U B E c o m p u t a t i o n is called a " c u b o i d " or simply a "group-by". Three types of aggregate functions are identified in [8]. Consider aggregating  9  SALES  SALES Model Chevy Chevy Chevy Chevy Chevy Chevy Chevy Chevy Chevy Ford Ford Ford Ford Ford Ford Ford Ford Ford  Year 1990 1990 1990 1991 1991 1991 1992 1992 1992 1990 1990 1990 1991 1991 1991 1992 1992 1992  Color red white blue red white blue red white blue red white blue red white blue red white blue  Sales 5 87 62 54 95 49 31 54 71 64 62 63 52 9 55 27 62 39  SELECT Model, Year, Color SUM(Sales) FROM  SALES  CUBE BY Model, Year, Color  Relation S A L E S  Model Year ALL ALL Chevy ALL Ford ALL 1990 ALL ALL 1991 1992 ALL ALL ALL ALL ALL ALL ALL Chevy 1990 Chevy 1991 Chevy 1992 Ford 1990 Ford 1991 Ford 1992 Chevy ALL Chevy ALL Chevy ALL Ford ALL Ford ALL Ford ALL ALL 1990 ALL 1990 ALL 1990 ALL 1991 ALL 1991 ALL 1991 ALL 1992 ALL 1992 ALL 1992 All Tuples in  Color Sales ALL 942 ALL 510 ALL 432 ALL ' 343 ALL 314 285 ALL red 165 white 273 blue 339 ALL 154 ALL 199 157 ALL ALL 189 116 ALL ALL 128 red 91 white 236 blue 183 red 144 white 133 blue 156 red 69 white 149 blue 125 107 red white 104 blue 104 59 red white 116 blue 110 Relation SALES  C U B E of SALES on attributes Model, Year and Color, where aggregate attribute is Sales.  F i g u r e 2.2: C U B E O p e r a t i o n on Relation S A L E S [8]  10  Aggregate  Group By (with total)  Sum  By Color R E D WHITE  Cross Tab  BLUE  C h e v y Ford ByCobr  The Data Cube and The Sub-Space Aggregates  RED  Sum  WHITE BLUE  By Mike  By Year By Make & Year  ByCobr&Year  t^ByMake&Cobr  Sum  1  By Cobr  Figure 2.3: Cube in Multi-dimensional Array Format [8] a set of tuples T . Let {Si \ ii = 1 . . . n} be any complete set of disjointed subsets of T such that U; $ = T and (% Si = {}. • A n aggregate function F is distributive F(T)  = G{{F(Si)  \ i=l,..«}).  if there is a function G such that  S U M , M I N and M A X are distributive with  G = F. C O U N T is distributive with G =  SUM.  • A n aggregate function JF is algebraic if there is an M-tuple valued function G and a function H such that F(T)  = H({G(Si)  \ i = 1...»}.)., and M is  constant regardless of \T\ and n. A l l distributive functions are algebraic, as are Average, standard deviation, M a x N , and M i n N . For Average, G produces the sum and count, and H divides the result. • A n aggregate function F is holistic if it is not algebraic. For example, Median and Rank are holistic.  1 1  2.3  Iceberg-cube Computation  The basic C U B E problem is to compute all aggregates as efficiently as possible. Its chief difficulty is that the C U B E computation is exponential with the number of dimensions: for ci dimensions, 2 group-bys are computed. The size of each groupd  by (cuboid) depends upon the cardinalities of its dimensions, possibly the product of the G R O U P B Y attributes' cardinalities. When the product of the cardinalities for a group-by is large relative to the number of the cells (partitions) that actually appear in the cuboid, we say the group-by is "sparse". When the number of sparse group-bys is large relative to the number of total number of group-bys, we say the. C U B E is sparse. As is well-recognized, given the large result size of the entire C U B E , especially on sparse data set, it is important to identify subsets of interest. Deriving from this background, the concept of an "iceberg-cube" was introduced in [4, 12]. the iceberg-cube was described as a variant of the C U B E problem, which allows us to selectively compute cells that satisfy a user-specified aggregate condition. It is essentially a C U B E for iceberg queries. For example, an iceberg-cube is easily expressed in SQL with the C U B E B Y clause: SELECT  A , B , C, S U M ( X )  FROM  R  CUBE BY  A, B, C  where N is a count condition, called "min-  HAVING COUNT(*) > N imum support" of a cell, or "minsup" for short. In this thesis, we only discuss this count condition; other aggregate conditions can be handled as well [4].  2.4  Sequential C U B E Algorithms  All C U B E algorithms uses a lattice view for discussion.  Figure 2.4(a) depicts a  sample lattice where A , B , C and D are dimensions. Nodes in the lattice represent  12  group-bys (cuboids). attributes.  T h e group-bys are labeled according to their G R O U P B Y  T h e edges in the lattice show potential c o m p u t i n g paths.  A l l of the  C U B E algorithms in fact convert this lattice into a directed processing tree. E a c h node in a processing tree therefore has no more than one parent, because it is c o m p u t e d only once from its parent or from the raw d a t a set. C U B E algorithms are classified into two categories according to their comp u t a t i o n fashion.  A l g o r i t h m s which follow paths from the raw d a t a towards the  t o t a l aggregate value are called "top-down" approaches. A l g o r i t h m s which compute paths in the reverse direction are called "bottom-up" approaches.  F o r the exam-  ple shown in F i g u r e 2.4(a), a top-down approach computes from A B C D , to A B C , to A B and eventually to A ; a b o t t o m - u p approach goes in the opposite direction. F i g u r e 2.4(b) gives a sample processing tree of top-down a l g o r i t h m . T h e processing tree of bottom-up algorithm is illustrated in F i g u r e 2.4(c). In the following, we will discuss some significant sequential C U B E algorithms proposed. C U B E algorithms can be viewed as having two stages: the planning stage and the execution stage. In the planning stage, the algorithms decide how to convert the lattice into a processing tree; in the execution stage, the algorithm computes cuboids.  2.4.1  Top-down C U B E algorithms  C U B E algorithms always t r y to discover and take advantage of c o m m o n a l i t y between a node and its parent in the lattice view.  F o r many top-down algorithms, they  recognize that group-bys w i t h c o m m o n attributes can share, sorts, or partial sorts, and utilize those sharings. T a k i n g the processing tree shown in F i g u r e 2.4(b) as an example, A D represents the cuboid G R O U P B Y on A and D . If the d a t a set has been sorted with respect to A and D in order to compute A D , then for c o m p u t i n g cuboid A , the d a t a set does not have to be re-sorted.  We can simply accumulate  the sums for each of the values in A . A p p a r e n t l y , cuboid A and A D share sort on  13  ABCD  AB  AC  AD  BC  BD  CD  (a) 4-Dimension Lattice ABCD  (b) Sample Processing Tree of Top-Down Algorithms 5ABCD  (c) Processing Tree of Bottom-Up Algorithms  F i g u r e 2.4: L a t t i c e and Processing Trees for C U B E C o m p u t a t i o n [4] 14  attribute A . Besides sort sharing, there are some other commonalities which were exploited by top-down algorithms. Some of these, specified as o p t i m i z a t i o n techniques, are listed by Sarawagi [2]: •  Smallest-parent:  T h i s aims at c o m p u t i n g a group-by from the smallest previ-  ously computed group-by. F o r example, we can compute group-by A B from group-by A B C and A B D . However, among the two potential parents, only the one w i t h smallest size will be selected, because c o m p u t i n g from the small parent will lead to lower cost. •  Cache-results:  T h i s technique tries to compute a group-by when its parent is  still in memory, hence,.reducing disk I / O . •  Amortize  scans: T h i s technique also aims at reducing disk I / O by a m o r t i z i n g  disk reads by c o m p u t i n g as many group-bys as possible together in memory. For instance, d u r i n g scanning group-by A B C D , we can compute group-bys A B C , A C D , A B D and B C D at the same time. •  Share-sorts:  Sort-based algorithms use this technique to share sorting cost  among multiple group-bys. •  Share-partitions:  T h i s is specific to the hash-based a l g o r i t h m . W h e n a hash  table can not fit in the memory, d a t a will be partitioned into chunks which do fit in memory. Once a chunk is read i n , multiple group-bys will be computed in order to share the p a r t i t i o n i n g costs. In the following, we will discuss several sequential top-down algorithms.  PipeSort, PipeHash and Overlap P i p e S o r t and P i p e H a s h algorithms are among the first algorithms for efficient C U B E c o m p u t a t i o n . T h e y were proposed by Sarawagi et al. in [2]. B o t h assume the cost  15  BC 5 15  AB 5 15  ABC 10 30  AC 4 14  BD 5 15  ABD 15 40  ACD 5 20  AD 5 15  CD 1020  BCD 45 130  ABCD 50 160  F i g u r e 2.5: A n E x a m p l e of 4-DimensionaI L a t t i c e for A l g o r i t h m P i p e S o r t [2]  of each node in a lattice p r o p o r t i o n a l to the product of the cardinalities of G R O U P B Y attributes and t r y to compute each cuboid from a parent having the smallest cost. However, T h e d a t a structures of the two algorithms are different: P i p e S o r t uses array and sorting is done prior to aggregation; P i p e H a s h uses hash tables. Furthermore, P i p e S o r t considers share-sorts o p t i m i z a t i o n , t r y i n g to minimize the number of sorts, whereas P i p e H a s h focuses on share-partitions o p t i m i z a t i o n . P i p e S o r t distinguishes between two different costs attached to each node X in the lattice view: cost A ( X ) and cost S ( X ) . A ( X ) is induced when one child of X is created through aggregating without any sort on X . A c t u a l l y only one child of X can be computed w i t h cost A ( X ) . F o r instance, for cuboid A B C D , only its child, cuboid A B C , can be computed without any sort on A B C D . F o r other children, if they are computed from A B C D , cost S ( A B C D ) is induced becasue resort on A B C D is necessary. In this way, the sorting cost is counted by P i p e S o r t . Assuredly, cost S ( X ) is always greater than or equal to A ( X ) . In the planning stage, a processing tree with a m i n i m u m total cost, t a k i n g both A ( X ) and S ( X ) into account, is computed in a level-by-level manner, where level N contains all A^-dimensinal cuboids. W h e n c o m p u t i n g on a level, the algor i t h m determines what edges between the nodes in this level and the next level in  16  all  t A 4 16  B  1  D  f  T  AC14 4  DB15 5  1  \  BA  +  L-- "1".'  BAD 15 40  ACD 5 20  1  _-  -' ~~  DBC 45 130  CBAD 50 160 i  raw data  (a) M i n i m u m Cost Sort Plan  (b) Pipelines Executed  F i g u r e 2.6: A n E x a m p l e of P l a n and Pipelines for A l g o r i t h m P i p e S o r t [2]  the lattice should be left in the final m i n i m u m cost tree. Since each edge has a cost attached, either A ( X ) or S ( X ) , the problem is converted into finding the m i n i m u m cost m a t c h i n g in a bipartite graph. G i v e n the lattice shown in F i g u r e 2.5, the  final  m i n i m u m cost plan becomes that shown in F i g u r e 2.6(a). T h e pair of numbers underneath each group-by in the figure denote the A ( X ) and S ( X ) costs. T h e detailed plan c o m p u t a t i o n is elaborated in [2]. After a plan is created, in the execution stage each path is computed in a pipeline manner. F i g u r e 2.6(b) shows the pipelines' execution for the generated plan in Figure 2.6(a). T h e head of each pipeline implies a re-sort, from its parent in the processing tree. L i k e P i p e S o r t , P i p e H a s h aims at c o m p u t i n g cuboids from their smallest parents. Since P i p e H a s h takes hash tables as its d a t a structure, no sorting is required. Therefore, each node in the lattice has only one cost, which is similar to A ( X ) in PipeSort.  In the planning stage of P i p e H a s h a m i n i m u m spanning t r e e ( M S T ) is  computed based on the singular cost of each node. F i g u r e 2.7(a) gives an example  17  of an M S T . Besides smallest-parent optimization, PipeHash also explores share-partitions and amortized-scans optimizations. It computes as many cuboids as possible if their parents are in memory. If the main memory is big enough to hold hash tables for all cuboids, PipeHash can finish the cube computation in one data scan without any sorting. If no enough memory is available, PipeHash partitions data on some selected attribute, then processes each partition independently. Although data partitioning solve the memory problem, the partitioning attribute limits computations to only include group-bys with that particular attribute. For example, from the M S T in Figure 2.7(a), we compute a C U B E on dimensions A , B , C and D . If we partition on A , then the partitions are only used to produce cuboids containing dimension A , including A B C D , A B C , A B D , A B , A C , A D and A . Other cuboids will be computed afterward from cuboids with attribute A . Ideally, they can fit in memory and no further partitioning is necessary. This makes M S T divide into subtrees, as shown in figure 2.7(b) and (c). By processing as large a subtree(or a set of subtrees) of the M S T as possible in memory, computing all nodes in it (or them) simultaneously, PipeHash favors optimizations cache-results and amortize-scans. Sarawagi compared PipeSort and PipeHash in  [2]. PipeHash suffers two  apparent problems, requiring re-hash for every group-by and requiring a significant amount of memory. This makes it can only outperform PipeSort as the data is dense. However, in this thesis, the problem domain is iceberg-cube computation in which data is supposed to be highly sparse; therefore, hash-based algorithms are not our major concern. However, we did implement some hash-based algorithms and they will be discussed in Chapter 3 . Overlap, proposed in [21], as well as PipeSort, considers sorting cost, but it deals with it in a different way. It tries to overlap as much sorting as possible by computing group-by from a parent with the maximum sort-order overlap. The  18  A  A  2  B  C  / 8  A  AB  10 A  AC If  BC  \i  D  4^  5  AD  V  CD  BD  2 0 ' ^ 20 y 20 • /d 12  20 ABC^  30  JC ^  ABD ^90  ACD BC *7 5 0 - ^  AB^D  ACD  40 ABCD  ABCD A IOO  i Raw Data  i  Raw Data  n  (b) Subtree: tioned on A  (a) M i n i m u m Spanning Tree  Ltl *  Parti-  D  >V 5  AB \ B C  A ABC  BCD  Y  L\BCD  (c) Remaining Subtrees  F i g u r e 2.7: P i p e H a s h on a Four A t t r i b u t e G r o u p - b y [2]  19  algorithm recognizes that if a group-by shares a prefix of G R O U P B Y attributes w i t h its parent, then the parent consists of a number of partitions, one for each value of the prefix. For example, since cuboid A B C and cuboid A C share a G R O U P B Y prefix A , the A B C group-by has | A | partitions t h a t can be sorted independently on C to produce the A C sort order, where \A\ is the number of values for attribute A . Overlap always selects a parent for a cuboid which shares the longest G R O U P B Y prefix w i t h that c u b o i d .  T h e n the size of p a r t i t i o n is m i n i m i z e d .  If several  potential parents of a group-by share the same length of prefix w i t h it, and then the smallest one will be picked as the final parent. Overlap chooses a sort order for the root of the processing tree, then all subsequent sorts are some suffix of this order. The planning stage will build a tree like t h a t shown in F i g u r e 2.4(b). Once this processing tree is formed, O v e r l a p tries to fit as many partitions in memory as possible. If a partition of a group-by can fit in the main memory, then a subtree of the processing tree rooted by t h a t group-by will be computed in a pipeline manner when the partition is scanned i n . T h i s is expected to save much I / O costs for writing intermediate results. The experiments show that Overlap performs consistently better than P i p e S o r t and P i p e H a s h . However, [14] argues that Overlap on sparse C U B E S still produces a large amount of I / O by sorting intermediate results.  PartitionedCube and MemoryCube W h e n the above C U B E algorithms are applied to sparse d a t a sets, their performance becomes poor. G r o u p - b y s for sparse d a t a sets are more likely to be large; buffering intermediate group-bys in memory requires too much memory. If the main memory is limited, then intermediate group-bys will be written out and read into memory multiple times, which increases I / O dramatically. Moreover, predictation of the size of group-bys becomes very difficult, because the real size of a group-by may not be proportional to the product of cardinalities of the G R O U P B Y attributes. T h i s  20  makes the cost of c o m p u t a t i o n in P i p e S o r t and P i p e H a s h no longer feasible. M o r e recently, Ross and Srivastava proposed an efficient top-down algorithm designed for large, high-dimensional and sparse C U B E s [14]. T h e i r algorithm consists of two parts: P a r t i t i o n e d C u b e and M e m o r y C u b e . P a r t i t i o n e d C u b e partitions the d a t a on some a t t r i b u t e into memory-sized units and M e m o r y C u b e computes the C U B E on each in-memory p a r t i t i o n . P a r t i t i o n i n g in P a r t i t i o n e d C u b e is very similar to P i p e H a s h . T h e algorithm chooses an a t t r i b u t e to p a r t i t i o n input into fragments. T h e n all cuboids containing that attribute will be computed on each fragment separately. F o r example, if C U B E is to be computed on attributes { A , B , C , D } , we might p a r t i t i o n the input relation on a t t r i b u t e A , and get three partitions. T h e n , we compute cuboids A B C D , A B C , A B D , A C D , A B , A C , A D and A for each p a r t i t i o n . B y t a k i n g the union of corresponding partial cuboids computed from each p a r t i t i o n , we get finally the complete cuboids. T h e n cuboid A B C D can be taken as the input to compute another c u b o i d . P a r t i t i o n e d C u b e is called recursively if the fragments or cuboid A B C D is still too big to fit in the memory; in that case, the d a t a will be further partitioned on other attributes. F i g u r e 2.8(a) gives an illustration of this example. Once the input of P a r t i t i o n e d C u b e fits in the memory, then M e m o r y C u b e can be applied. M e m o r y C u b e is a sort-based a l g o r i t h m , which is its main difference from P i p e H a s h . L i k e P i p e S o r t , however, M e m o r y C u b e algorithm takes advantage of the P i p e l i n i n g technique. It tries to minimize the number of pipelines and hence, the number of sorts.  Its P a t h s algorithm (not to be discussed in detailed  here),  guarantees that the number of pipelines(paths) it generates will be the m i n i m u m number of paths to cover all the nodes in the lattice. F i g u r e 2.8(b) shows the paths for 4-dimension C U B E c o m p u t a t i o n . There are six pipelines in total built from the input data. T h e cuboids in boxes are the heads of the pipelines. S o r t i n g is required to create the head of the pipeline, which is shown as dash lines in F i g u r e 2.8(b), however, no sorting is needed in the pipelines.  21  BD i  ' AB  V  CD | / T  \  ABC \  \ ^B' B  AD^B  \  I  /  \  - i i  Cuboid(BCD)(In memroy, B projected out;  !•/  '  ABCD  Cube(ABCD) (Partitioned by B, A projected out)  R (Partitioned by A)  (a) Partitioning  all B  AB  BC  D  A  BD  CD  DA  7T~ BCD  CAD  DAB  _ — KBCD  in-memory partition (b) Paths Found by MemoryCube  F i g u r e 2.8: E x a m p l e s for P a r t i t i o n e d C u b e and M e m o r y C u b e A l g o r i t h m s [14]  22  Since P a r t i t i o n e d C u b e only considers pipelines in the M e m o r y C u b e , this algorithm tries to reduce the amount of I / O for intermediate results, and thus enhance the performance for sparse C U B E c o m p u t a t i o n .  Array-Based Algorithms W h e n using the array-based algorithms, as one proposed in [13], d a t a sets are stored in a multi-dimension array, where each coordinate matches a C U B E a t t r i b u t e .  A  tuple's location in the array is determined by its value in each dimension. T h e algor i t h m requires no tuple comparison, only array indexing. Unfortunately, if the d a t a is sparse, the algorithms become infeasible, as the array becomes huge. Therefore, we find array-based algorithms are too limited to warrant further discussion here.  2.4.2 Our in  Bottom-Up C U B E Algorithm background search revealed only one b o t t o m - u p a l g o r i t h m . It was introduced  [4] by K . Beyer and R . R a m a k r i s h n a n , and called the B o t t o m U p C u b e , B U C for  short. It especially targets iceberg-cube c o m p u t a t i o n . Setting thresholds in iceberg queries always cuts off a lot of cells in general cuboids. F o r the d a t a set used in [4], and which was also used in our experiments, as many as 20% of the group-bys consisted entirely of cells w i t h support as one. For the iceberg queries with minsup higher than 1, those group-bys do not need to be computed at a l l . T h i s makes sense to consider a way to prune as early as possible in C U B E c o m p u t a t i o n . Unfortunately, when we traverse a lattice in a top-down fashion, we can not prune cells which have insufficient support in any c u b o i d , until the last step.  F o r example, suppose the threshold is set by specifying  HAVING  C o u n t ( * ) > = 2 in iceberg-cube (the minsup is 2). Before we compute cuboid A B C from cuboid A B C D , we can not prune the cells with support as in 1, for example, a l b l c l d l ( m i n s u p : l ) and a l b l c l d 2 ( m i n s u p : l ) . T h i s is because they contribute to the cells in A B C , whose supports are bigger than 1, for example, a l b l c l ( m i n s u p  23  is 2). However if we compute from c u b o i d A B C to cuboid A B C D in a bottom-up fashion, pruning is possible. A l t h o u g h cuboid A B C D can not be directly computed from c u b o i d A B C , we can make sure t h a t tuples which do not contribute to cells in cuboid A B C will not contribute to cells in A B C D . W e could therefore prune out those tuples in the raw d a t a earlier, before the c o m p u t a t i o n for A B C D proceeds. T h u s , in B U C , a bottom-up approach is adopted. T h e idea is to combine the I / O efficiency of the P a r t i t i o n e d C u b e a l g o r i t h m , with m i n i m u m support pruning. T h e processing tree of B U C is illustrated in F i g u r e 2.4(c). T h e numbers in F i g u r e 2.9 indicate the order in which B U C visits the group-bys. A skeleton of B U C is shown in F i g u r e 2.9; we use the notation TA to de{  note the set of all nodes in the subtree rooted at A . F o r the example given in t  F i g u r e 2.4(c), T  = {B, BC, BD, BCD}.  B  Prefix in line 9 in F i g u r e 2.9 indicates the  current processed cuboid's G R O U P B Y dimensions. Take the B U C processing tree in F i g u r e 2.9 as an example: B U C starts with cuboid all, and then cuboid with G R O U P B Y attribute A . F o r each value Vj in A , the d a t a set is partitioned. T h e n for those partitions w i t h higher support than minsup, B U C is called recursively in a depth-first manner to process other dimensional groupbys (in lines 14-16). For example, for p a r t i t i o n A u i , in the first further recursion, B U C proceeds partitioning on attribute B , producing finer partitions AuiBui to partitions Av\Bv . m  A f t e r w a r d , B U C is recursively called on those finer partitions  to produce some cells in cuboids A B C , A B C D and A B D . W h e n all recursions for partition Avi  return, B U C proceeds in the same way on other partitions for AVJ.  W h e n all partitions based on A finish, B U C continues on attributes B , C and D in the same way. F i g u r e 2.10 shows how B U C p a r t i t i o n i n g proceeds.  T h e arrows shows the  partitioning order. T h e gray area depicts those partitions pruned out based on the constraints(minsup in this case). A l t h o u g h B U C can exploit pruning, it can not optimize by share-sort or  24  1. 2. 3. 4. 5.  Algorithm B U C - M a i n I N P U T : Dataset R w i t h dimensions {A , A ,.. .A }, the m i n i m u m support Spt. O U T P U T : Qualified cells in the 2 cuboids of the cube. PLAN: Starting from the b o t t o m , output the aggregate on " a l l " , and then a depth first traversal of the lattice, induced by {Ay, A , • • -A }. f o r e a c h dimension A (?' from 1 to m) d o B\JC(R,r Spt,{}) CUBE COMPUTATION: p r o c e d u r e B\JC(R,TA,, Spt, prefix) . prefix = prefix U{.4;} f o r e a c h combination of values Vj of the attributes in prefix d o partition R to obtain Rj if (the number of tuples in Rj is > Spt) aggregate Rj, and write out the aggregation to cuboid with cube dimensions indicated by prefix f o r e a c h dimension Ak, k from + 1 to m d o t  m  m  2  6.  2  m  2  An  7.  8. 9. 10. 11. 12. 13.  14. 15. 16. 17.  i  call B\JC(Rj,T , AK  e n d e n d  Spt,  prefix)  f o r  f o r  F i g u r e 2.9: A Skeleton of B U C  25  Partition on A  Partition on A B C  Partition on AB  b2 b3 b4 b5 F i g u r e 2.10: B U C P a r t i t i o n i n g  26  Partition on ABCD  smallest parent techniques. P a p e r [4] compares B U C with P a r t i t i o n e d C u b e . It claimes t h a t B U C performs better than P a r t i t i o n e d C u b e . T h e pruning significantly reduces running time when the m i n i m u m support is above 1. E v e n w i t h minsup as 1, that is, full C U B E is computed, B U C still outperforms it.  27  Chapter 3  Parallel Iceberg-cube Algorithms T h e key to success for an online system is the ability to respond to queries in a t i m e l y fashion. T h e compute and d a t a intensive nature of iceberg-cube queries necessitates a high performance machine. In the past, this required expensive platforms, such as symmetric multiprocessor machines. In recent years, however, a very economical alternative has emerged: a cluster of low-cost c o m m o d i t y processors and disks. P C clusters provide several advantages over expensive multiprocessor machines. F i r s t , in terms of raw performance, processor speeds are similar to and often exceed those of multiprocessor architectures. Recent advancements in system area networks, such as M y r i n e t , standards like V I A , and 1 0 0 M b i t or G i g a b i t Ethernet have significantly improved c o m m u n i c a t i o n b a n d w i d t h and latency. Second, although I / O and the use of c o m m o d i t y disks are weaknesses in these systems, as we show, parallelism can easily be exploited. T h i r d , the affordability of P C - c l u s t e r s makes them  attractive  for small to medium sized companies and they have been the dominant parallel platform used for many applications [5], including association rule mining [18]. In the remainder of this thesis, we will discuss various novel algorithms we have developed for parallelizing iceberg-cube c o m p u t a t i o n . O u r focus is on practical  28  techniques that can be readily implemented on low cost P C clusters using open source, Linux and public domain versions of the M P I message passing standard. As our results apply to low cost clusters, the question arises of how much our results may generalize to higher cost systems. In Section 4, we examine how the various algorithms would speed up in* the presence of more nodes/processors in the cluster. Thus, if the key difference between a low cost and a high cost cluster is only the number of nodes, then our results will be applicable. However, if the key difference is on the underlying communication layer, then our results may not be applicable. A l l of the algorithms to be presented use the basic framework of having a planning stage and an execution stage. In the case of parallel algorithms, the planning stage involves (i) breaking down the entire processing into smaller units called tasks, and (ii) assigning tasks to processors, where now the objective is to minimize the running time of the processor that takes the longest time to complete its tasks. To simplify our presentation, we do not include the aggregation for the node "all" as one of the tasks. This special node can be easily handled. Furthermore, it is assumed that the initial dataset is either replicated at each of the processors or partitioned. The output, that is, the cells of cuboids, remains distributed where processors output to their local disks. In this section, we introduce the algorithms. As shown in Table 1.1, the algorithmic space that we explore involves the following issues: • the first issue is how to write out the cuboids. Because B U C is bottom-up, the writing of cuboids is done in a depth-first fashion. As will be shown la,ter, this is not optimized from the point of view of writing performance. This leads us to develop an alternative breadth-first writing strategy; • the second issue is the classical issue of load balancing. This issue is intimately tied to the definition of what a task is. Different algorithms essentially work with different notions of a task. In general, when the tasks are too coarsegrained, load balancing is not satisfactory. However, if the tasks are too fine-  29  grained, a lot of overhead is incurred; • when it comes to iceberg-cube c o m p u t a t i o n , an i m p o r t a n t issue is the strategy for traversing the cube lattice. A s discussed earlier, top-down traversal may exploit share-sort, whereas b o t t o m - u p traversal exploits p r u n i n g based on the constraints.  O u r algorithms consider these possibilities; in fact, one of the  algorithms combines the two strategies in an interesting way; • as usual, for parallel c o m p u t a t i o n , we explore whether d a t a p a r t i t i o n i n g is effective.  3.1  Algorithm R P  Recall from F i g u r e 2.4(c) that the processing tree of B U C consists of independent subtrees rooted at each of the dimensions. T h u s , in the algorithm called Replicated Parallel B U C , R P for short, each of these subtrees becomes a task. In other words, for a cube query involving m attributes, there are m tasks. Processor assignment is simply done in a round-robin fashion. W i t h this simple assignment strategy, if there are more processors than tasks, some processors will be idle. T h e d a t a set is replicated on all machines in a cluster. E a c h processor reads from its own copy of the dataset, and outputs the cuboids to its local disk. T h e skeleton of R P is showed in F i g u r e 3.1. F i g u r e 3.2 gives an example of c o m p u t i n g a 4-dimensional C U B E on a cluster of 4 P C s . In t o t a l , 4 tasks are created: subtrees rooted by A , B , C and D respectively. Each machine compute one task.  3.2  Algorithm B P P  W h i l e R P is easy to implement, it appears to be vulnerable in at least two of its aspects. F i r s t , the definition of a task may be too simplistic in R P . T h e division of  30  1.  Algorithm R P  2. 3. 4. •5. 6.  I N P U T : Dataset R w i t h dimensions { A i , A2, • • • A } and m i n i m u m support Spt; O U T P U T : T h e 2 cuboids of the d a t a cube. PLAN: Task definition: identical to B U C , i.e., subtrees rooted at A , Processor assignment: assign a processor, in round robin fashion, to each subtree rooted at dimension A ; (i from 1 to ra) C U B E C O M P U T A T I O N (for a processor): p a r a l l e l do For each subtree rooted at dimension A ; assigned to the processor call B\JC(R, TAI , Spt, {}) (with o u t p u t written on local disks)  7. 8. 9. 10.  m  m  end do F i g u r e 3.1: A Skeleton of the Replicated P a r a l l e l B U C A l g o r i t h m  Raw Data Replicated  Figure 3.2: Task Assignment in A l g o r i t h m R P  31  the cube lattice into subtrees is coarse-grained. One consequence is that some tasks are much bigger than others.  F o r example, the subtree rooted at A , TA, is much  larger than that rooted at C , Tc-  T h u s , load balancing is poor. Second, B U C is  not optimized in w r i t i n g performance. T o address these problems, we developed the a l g o r i t h m called Breadth-first w r i t i n g P a r t i t i o n e d Parallel B U C , or B P P for short.  3.2.1  Task Definition and Processor Assignment  T o achieve better load balancing, B P P tries to get finer-grained tasks by range p a r t i t i o n i n g on each attribute.  T h i s is motivated by Ross and Srivastava's design  of the P a r t i t i o n e d - C u b e , which attemps to p a r t i t i o n d a t a into chunks which fit in memory [14]. B P P partitions d a t a in the following way: • for a given attribute A,-, the dataset R is range-partitioned into n chunks (i.e., • •., Ri( ))i n  where n is the number of processors. Processor Pj keeps its  copy R{(j) on its local disk; • note that because there are m attributes, the above range p a r t i t i o n i n g is done for each attribute. R\(j),  T h u s , processor Pj keeps m chunks on its local disk (  • • •, i ? ( j j ) . A n y of these chunks may have some tuples in c o m m o n ; m  • range p a r t i t i o n i n g itself for the m attributes can be conducted in parallel, w i t h processor assignment done in a round-robin fashion. F o r instance, processor i may partition attribute Ai, then A ;  + n  , and so on. N o t i c e that as far as B P P  execution is concerned, range partitioning is basically a pre-processing step. If there are m cube attributes, then there will be a t o t a l mxn chunk corresponds to one task. responsible for processing it.  chunks. E a c h  T h e processor who has the chunk in the local is  If processor Pj process chunk R^j),  where R{(j) is  produced by range p a r t i t i o n i n g on attribute i , Pj computes the (partial) cuboids in the subtree rooted at A ; . These cuboids are partial because Pj only deals w i t h  32  PO range-partition R on A  -BML-,  P1 range-partition R on B  RbO  P2 range-partition R on C  Egg  , Ral_  , Ra2_  Rbl  Rb2  )  P3 range-partition R on D  I  _Ra3  J L  EB t  I  I  -J<1D  Ethernet  P O  PI  P2  P 3  F i g u r e 3.3: Task Assignment in B P P  the part of the d a t a it controls, in this case, R${j\- T h e cuboids are completed by merging the output of all n processors. Figure 3.3 illustrates task allocation and process in B P P . Each of the 4 processors in the cluster takes on the responsibility of range partitioning the raw d a t a set R on one dimension and d i s t r i b u t i n g the resulting partitions across the processors.  Since there are 4 cube dimensions in t o t a l , after d a t a partitioning each  processor gets 4 chunks. D a t a chunks in the same color on the same row are partitioned on the same a t t r i b u t e and have no overlap. However, d a t a chunks located in the same processor are partitioned on different attributes and may have overlap. A processor takes chunk R^j) to compute subtree %, for example, P I would use Rsi) to compute subtree 7cB y p a r t i t i o n i n g d a t a across processors, B P P achieves better load balancing than R P . If d a t a can be evenly distributed among processors, then the load may be very well balanced in a homogeneous environment.  33  Cuboid A  Cuboid A B C  Cuboid A B albl(2)<3>  al -  (Dfl>  /'""V  alblcl  (3)<7> -,.  >alblc2  (4)<8> <f  Salblc3  (5)<9>  ^  alb2cl  (7)<10>  alb2c2  (8)<11>^  \lb2c3  (9)<12> {  ;  (10)<2^  F i g u r e 3.4: Depth-first W r i t i n g vs Breadth-first W r i t i n g  3.2.2  Breadth-first W r i t i n g  B U C computes in a bottom-up manner, and the cells of the cuboids are written out in a depth-first fashion.  In the situation shown in F i g u r e 3.4, there are three  attributes A , B and C , where the values of A are a\, a , and so on, values of B 2  are b\ and 62 1 values of C are c i , C2 and C3. A s shown in F i g u r e 2.9, the tuples of a i are aggregated in line 14 (assuming that the support threshold is met), and the result is o u t p u t . T h e recursive call in line 15 then leads the processing to the cell a\b\,  then to the cell a\b\C\,  then to  a\b\C2,  and so o n . In F i g u r e 3.4, the number  in round brackets beside each node denotes the order in which the cell is processed and the output for depth-first w r i t i n g ; and the black solid lines denote the w r i t i n g sequence. Note that these cells belong to different cuboids. belongs to cuboid A , the cell a\b\  For example, the cell 0,1  to cuboid AB, and the cells  ai&iCi  and  a\b\C2  belong to A B C . C l e a r l y in depth-first w r i t i n g , the w r i t i n g to the cuboids is scattered. T h i s incurs a high I / O overhead.  We could possibly use buffering to alleviate the  34  1. 2.  Algorithm B P P I N P U T : Dataset R w i t h dimensions {A\, A , • • -A. } 2  m  and m i n i m u m support  Spt cuboids of the d a t a cube  3.  O U T P U T : The 2  4. 5. 6. 7.  PLAN: Task definition: (partial) cuboids of subtrees rooted at A{ Processor assignment: as described in Section 3.2.1 C U B E C O M P U T A T I O N (for the processor Pj):  8. 9. 10.  m  parallel do for each A,- (i from 1 toTO)do call B P P - B U C ( R ( j ) , T ,Spt, J  i  {}) (with output w r i t t e n on  Ai  local disks) 11. 12.  end for end do  13. Subroutine B P P - B U C ( i ? , T , , Spt, prefix) 14. prefix = prefix U { A ; } 15. sort R according to the attributes ordered in prefix 16. R' = R 17. for each combination of values of the attributes in prefix do 18. if (the number of tuples for t h a t combination > Spt) 19. aggregate on those tuples, and write out the aggregation 20. else remove all those tuples from R' 21. end for 22. for each dimension Ak, k from i + 1 to m do 23. call BPP-B\JC{R', T , Spt, prefix) 24. end for F i g u r e 3.5: A Skeleton of the B P P A l g o r i t h m A :  Ak  scattered w r i t i n g to the disk. However, this requires a large amount of buffering space, thereby reducing the amount of memory available for the actual c o m p u t a t i o n . F u r t h e r m o r e , many cuboids may need to be maintained in the buffers at the same time, causing e x t r a management  overhead.  In B P P , this problem is solved by breadth-first w r i t i n g .  R e t u r n i n g to the  example in F i g u r e 3.4, B P P completes the w r i t i n g of a cuboid before m o v i n g on to the next one. For example, the cells a\ and a , 2  which make up c u b o i d A , are  first computed and written out. T h e n all the cells in cuboid A B are o u t p u t t e d , and  35  10 Cost Comparision between RP and BPP  Processors  Figure 3.6: I / O comparison between B P P (Breadth-first writing) and RP(Depth-first writing) on 9 dimensions on a dataset with 176,631 tuples, input size is 10Mbyte and output size is 86Mbyte. so on. In Figure 3.4, the number in angled brackets beside each node denotes the order in which the cell is processed for breadth-first writing, while the red dash lines depict its writing sequence. Figure 3.5 gives a skeleton of B P P . As described in Section 3.2, the preprocessing step of range partitioning the dataset assigns to each processor Pj of the appropriate tasks. In the main subroutine B P P - B U C , breadth-first writing is implemented by first sorting the input dataset on the "prefix" attributes in line 15 in the skeleton. In our example, if the prefix is A , meaning that the dataset has already been sorted on A , then line 15 sorts the dataset further on the next attribute B . The loop starting  36  from line 17 then completes breadth-first w r i t i n g by c o m p u t i n g and o u t p u t t i n g the aggregation of all the cells in the cuboid A B . Because some cells may not meet the support threshold, there is the e x t r a complication in B P P - B U C of the need to begin pruning as early as possible. T h i s is the purpose of lines 16 and 20. N o t e that as opposed to what is presented in line 16 for simplicity, in our implementation we do not actually create a separate copy of the d a t a . Instead, an index is used to record the starting and ending positions in the sorted dataset to indicate t h a t all those tuples should be skipped for subsequent calls to B P P - B U C . Breadth-first I / O is a significant improvement over the scattering I / O used in B U C . F o r the baseline configuration to be described in Section 4 , the t o t a l I / O time R P took to write the cuboids was more than 5 times greater than the t o t a l I / O time for B P P . F i g u r e 3.6 gives the I / O comparison between R P (depth-first writing) and B P P (breadth-first w r i t i n g ) .  3.3  Algorithm ASL  A l t h o u g h B P P gives a solution for load balancing, this solution is still not satisfact o r y under some circumstances. T h e potential downfall of B P P comes from the fact t h a t the amount of work each processor does is dependent on the initial partitioning of the data.  However, the size of the task depends on the degree of skewness in  the d a t a set and the order in which the dimensions are sorted and partitioned. If the skewness is significant, the tasks may vary greatly in size, thereby reducing load balancing. F o r example, for an attribute named Gender, only two possible values, Female and M a l e , can be assigned to it. Range p a r t i t i o n i n g then can produce only 2 chunks. Even if we have more than 2 processors, only two of them will get applied to chunks; the others will be relatively lightly loaded. T h i s motivates the development of another a l g o r i t h m , called Affinity Skip L i s t , or A S L for short.  A S L tries to create tasks that are as small as the cube  37  Search Path  NIL  25 26 Original list, 17 to be inserted  17  12  NIL  25 19 - H H  21  26  List after insertion  F i g u r e 3.7: P i c t o r i a l D e s c r i p t i o n of Steps Involved in P e r f o r m i n g an Insertion [22]  lattice allows: each node in the lattice makes a task. T h i s allows efficient use of the processors, quite independent of the the skewness and dimensionality of the d a t a set.  In the following, two key features of A S L are presented:  the d a t a structure  used, and the processor assignment.  3.3.1  Using Skip lists  A skip list is a d a t a structure proposed by W . P u g h [22]. It is much like a linked list w i t h additional pointers. F i g u r e 3.7 is an example of a skip list. T h e lowest levels of nodes make a linked list, the higher levels of nodes are used for efficient search and insert operations. A s showed in F i g u r e 3.7, searching or insertion always starts from the highest level of the head node. If the next link emitted from that level points to a node that contains an element bigger than the element which is to be inserted or searched, we drop one level from the starting node, otherwise, we follow the link to the next node, and try the next step from there. F i g u r e 3.7 shows how an element w i t h key 16 is added into a skip list. T h e number of levels a new inserted element should have is determined randomly, but not allowed to exceed a threshold set by users. T h e benefits of using a skip list are threefold. F i r s t , A S L exhibits good average case behavior for insertion and searching, quite similar to that of a balanced tree, 38  yet the i m p l e m e n t a t i o n details are much simpler. Second, each node in the structure requires very little storage overhead. T h i r d , skip list incrementally increases as more elements are added, and the sort order of the list is always guaranteed.  This  is very i m p o r t a n t , because before sorting the d a t a set need not be entirely loaded. A S L uses skip lists to maintain the cells in cuboids. W h i l e it scans the raw d a t a set, A S L builds skip lists incrementally. If there, is a node in the skip list representing the new read-in tuple, then the aggregates and support counts of t h a t node are updated; otherwise a new node will be inserted into the skip list. In theory, if there are k cuboids and if there is enough memory, A S L can maintain all k skip lists simultaneously for one scan of the d a t a set. B u t for the d a t a sets used in our experiments, this o p t i m i z a t i o n brings m i n i m a l gain, so we did not explore that here.  3.3.2  Affinity Assignment  Now, let's consider the task assignment policy of A S L . In order to achieve good load balancing, A S L defines tasks with very fine granularity. It takes each cuboid as a task and assigns it to processors dynamically. D u r i n g task scheduling, A S L adopts a topdown approach to traversing the cube lattice. It always tries to assign uncomplete high dimensional cuboids to processors, while t a k i n g affinity into account. Once a processor finishes one task, it is assigned a new one which has some kind of affinity with the previously one. For example, i f a processor has just created the skip list for the task A B C D , then it makes sense for the processor to be assigned the task of c o m p u t i n g the cuboid for A B C . T h e previous skip list for A B C D can simply be reused to produce the results for A B C . In the following, we refer to this situation as "prefix affinity". In another s i t u a t i o n , if a processor has just created the skip list for A B C D , this skip list is still useful if the processor is next assigned the task of c o m p u t i n g the cuboid B C D , because now we need only take the counts of each cell in A B C D , and add them to the counts of the appropriate cells in the skip list for B C D . T h e n  39  Algorithm A S L 1. I N P U T : Dataset R cube dimensions { A i , . . . , A }; m  2. 3. 4. 5.  6.  7.  m i n i m u m support Spt  O U T P U T : T h e 2 cuboids of the d a t a cube PLAN: Task definition: a cuboid in the cube lattice Processor assignment: a processor is assigned the next task based on prefix or subset affinity, as described i n Section 3.3.2 C U B E C O M P U T A T I O N (for a processor): m  parallel do  8. 9. 10.  let the task be w i t h dimensions A , - , . . . , Aj i f Ai,..., Aj is the prefix of the previous task or the first task let C denote the skip list from that task  11. 12.  call prefix-reuse(C, Spt, Ai,..., Aj); else i f { A j , . . . , Aj} is a subset of the set of dimensions of the previous task, or the set of dimensions of the first task let C denote the skip list from that task  13. 14. 15. 16.  call subset-create(C, Spt, Ai,..., Aj) else call subset-create(R, Spt, Ai,..., Aj)  end do  17. Subroutine prefix-reuse(C, Spt, Ai,..., 18. 19.  Aggregate C based on A , - , . . . , Aj W r i t e out the cells i f the support threshold is met  20. Subroutine subset-create(C, Spt, Ai,..., 21. 22. 23. 24.  Aj)  initialize skip list L for each cell (tuple), in C  Aj)  do  find the right cell in L (created if necessary) update the aggregate and the support counts accordingly  25.  end for  26.  Traverse L , and write out the cells if the support threshold is met  F i g u r e 3.8: A Skeleton of A S L  40  groupings done for the skip list for A B C D are not wasted. F o r example, suppose in A B C D , a cell corresponds to the grouping of aib\C\di.  F o r the iv tuples in the  original dataset that belong to this cell, the current aggregate and support counts can readily be used to update the corresponding counts for the cell b\C\di  for  BCD.  T h e r e is no need to re-read the w tuples and aggregate again. In the following, we refer to this situation as "subset affinity". F i g u r e 3.8 shows a skeleton of A l g o r i t h m A S L . T o implement prefix or subset affinity, a processor is designated the j o b of being the "manager" responsible for d y n a m i c a l l y assigning the next task to a "worker" processor. Specifically, the manager does the following: • first tries to exploit prefix affinity, because if that is possible, the worker processor then has no need to create a new skip list for the current t a s k / c u b o i d . The  previous skip list can be aggregated in a simple way to produce the re-  sult for the current task. T h i s is executed by the subroutine prefix-reuse in F i g u r e 3.8; • then tries to exploit subset affinity, if prefix affinity is not applicable. Instead of scanning the dataset, the worker processor can use the previous skip list to create the skip list for the current task. T h i s is executed by the subroutine subset-create in F i g u r e 3.8; • assigns to the worker a remaining cuboid with the largest number of dimensions, if neither prefix nor subset affinity can be arranged. In this case, a new skip list is created from scratch. Clearly, the last situation ought to be avoided as often as possible. In our implementation of A S L , each worker processor maintains the first skip list it created. Because A S L is top-down, the first skip list corresponds to a cuboid with a large number of dimensions. T h i s maximizes the chance of prefix and subset affinity. The  affinity scheduling is very helpful for sort-sharing, especially when the  41  number of available processors is s m a l l . B u t as more processors are available, the affinity relationship between tasks assigned to the same processor tends to be weak. For example, if we have 2 processors, we may very possibly assign both A B C D and A B C to one machine; however, if we have 16 machines, this possibility becomes light, since we don't want machines to lie idle just to maintain strong affinity. In such a case, one processor may compute A B C D and another may compute A B C , then both would need to sort on A B C . D u p l i c a t e d sortings then occur. Since A S L ' s task scheduling is d y n a m i c , depending on how soon each processor finishes its task, the lattice traversal sequence can not be determined in advance. Different runnings very likely result in different traversal sequences. T h i s makes A S L quite different from other top-down algorithms, such as P i p e S o r t or P i p e H a s h .  3.4  Algorithm P T  B y design, A S L does a very good j o b of load balancing.  However, A S L may be  vulnerable in two areas. F i r s t , the granularity of the tasks may be too fine - to an extent that too much overhead is incurred. T h i s is particularly true where prefix or subset affinity cannot be well exploited, and thus not much sort sharing is applicable. Second, A S L ' s top-down lattice traversal cannot prune those cells which lack m i n i m u m support from skip lists. A s A S L executes, whether a cell has m i n i m u m support or not cannot be determined until the d a t a set has been scanned entirely. F u r t h e r m o r e , at the end of the scan, even if there is a cell below the m i n i m u m support, this cell still cannot be pruned, because its support may contribute to the support of another cell in subsequent cuboid processing. In an effort to combine the advantages of pruning in a b o t t o m - u p algorithm on one hand, with load balancing and sort-sharing of top-down lattice traversal on the other, we developed the algorithm called P a r t i t i o n e d Tree, ( P T ) . Recall that in R P and B P P , tasks are at the granularity level of subtrees rooted at a certain dimension, for example, TA {  42  In A S L , tasks are merely nodes  ABCD,  ABD  ABC  AB  '/AC  /  ACD  AD /  BC  /  BCD  BD /  D  Taskl Task2  Task4 / Task3 N  CD  Bottom-Up Cubiod Computation  Top-Down Affinity Scheduling  F i g u r e 3.9: B i n a r y D i v i s i o n of the Processing Tree into F o u r Tasks  in the cube lattice. T o strike a balance between the two definitions of tasks, P T works with tasks that are created by a recursive binary division of a tree into two subtrees, each having an equal number of nodes.  B i n a r y division is achieved by  simply c u t t i n g the farthest left edge emitted from the root in a B U C processing tree or subtree in recursive callings. For instance, the B U C processing tree shown in F i g u r e 2.4(c) can be divided into two parts: TA and T a — TA- A further binary a  division on TA creates the two subtrees: TAB and TA — TAB-  Similarly, a further  division on T ii — TA creates these two subtrees: TB and T a — TA — TB- F i g u r e 3.9 a  a  shows the four subtrees. E a c h of these four subtree makes a task. Obviously, each time binary division is applied, two subtrees of equal size are produced.  T h r o u g h binary division, we finally achieve tasks of same size and  appropriate granularity. C o m b i n i n g dynamic scheduling and binary division nicely solves the load balancing problem in P T . L i k e A S L , P T also exploits affinity scheduling. D u r i n g processor assignment, the manager tries to assign a task to a worker processor t h a t can take advantage of prefix affinity based on the root of the subtree. Note that in this case, subset affinity is not applicable. F r o m this standpoint, P T is top-down. B u t interestingly, because  43  1.  Algorithm P T  2.  I N P U T : Dataset R cube dimensions {A\,...,  3. 4.  O U T P U T : The 2 PLAN:  5. 6.  Task definition: a subtree created by repeated binary p a r t i t i o n i n g Processor assignment: a processor is assigned the next task based on prefix affinity on the root of the subtree C U B E C O M P U T A T I O N (for a processor):  7.  8.  m  A }; m  m i n i m u m support Spt  cuboids of the d a t a cube  parallel do  9. 10.  let the task be a subtree T sort R on the root of T (exploiting prefix affinity if possible)  11. 12.  call B P P - B U C ( R , T, Spt, J  {})  end do F i g u r e 3.10:' A Skeleton of P T  each task is a subtree, the nodes w i t h i n the subtree can be traversed/computed in a b o t t o m - u p fashion. In fact, P T calls B P P - B U C , which offers breadth-first w r i t i n g , to complete the processing. In F i g u r e 3.9, the roots of each subtree, enclosed in boxes, actually make up a small tree. T h e scheduling just happens on this small tree, similar to A S L . Once a processor gets a task, that is, a subtree, it computes it in a b o t t o m - u p manner, much like c o m p u t i n g an R P ' s task. In this way, we seamlessly combine top-down and bottom-up methods, getting the benefits of both p r u n i n g and sort-sharing. F i g u r e 3.10 shows a skeleton of P T . T h e step t h a t requires elaboration is line 9, namely the exact definition of T • In general, as shown in F i g u r e 3.9, there are two types of subtrees handled by P T . T h e first type is a "full" subtree, which means that all the branches of the subtree are included. F o r example, TAB is  a  full  subtree. T h e second type is a "chopped" subtree, which means that some branches are not included. T h e subtrees TA — TAB and T„.u — TA — TB are examples. In line 11, depending on which type of subtree is passed on to B P P - B U C , B P P - B U C may execute in a slightly different way. Specifically, for the loop shown on line 22 in F i g u r e 3.5, if a full subtree is given, no change is needed. Otherwise, the loop needs  44  to skip over the chopped branches. Since P T treats each final subtree resulting from binary division as a task, in an extreme case binary division will eventually create a task as each node in the cube lattice, as in A S L . Since task granularity in A S L might be too fine, in P T a parameter is used to determine when binary division stops, thus defining how fine tasks can be. T h e parameter is set as the ratio of the number of tasks to the number of available processors. T h e higher the ratio, the better the load balancing but the less pruning can be explored in each. task.  D e t e r m i n i n g the parameter enables a  tradeoff between load balancing and pruning. In F i g u r e 3.9, the dotted line between " B o t t o m - U p C u b o i d C o m p u t a t i o n " and " T o p - D o w n Affinity Scheduling" depicts this tradeoff.  M o v i n g up the line means letter load balancing; moving down the  line means more p r u n i n g . P T wisely leaves this decision up to applications. F o r the experimental results presented later, we used the parameter "32n" to stop the division, once there are so many tasks (where n is the number of processors).  3.5  Hash-based Algorithms  We also implemented two hash-based C U B E algorithms. In the following, we will • briefly discuss t h e m .  3.5.1  Hash Tree Based A l g o r i t h m  T h i s algorithm was developed after B P P proved to ha,ve poor load balancing. Since B P P ' s performance is greatly affected by d a t a skewness, which we could not change, it appears there was no way to improve it. However, considering most Association Rules M i n i n g ( A R M ) algorithms proceed in a b o t t o m - u p fashion, also t a k i n g advantage of pruning, we then thought about a p p l y i n g the techniques of parallel A R M to C U B E c o m p u t a t i o n . The  p r o t o t y p i c a l application of A R M is a "market-basket analysis", where  the items represent products, and the records in a database represent point-of-sales  45  Items  Database Transactioi  Frequent itemsets (min_sup =  Jane Austen  A  Agatha Christie  C  1  Items  Sir Arthur Conan Doyle  D  2  CDW  Mark Twain  T  3  ACTW  P.G. Wodehouse  W  4  ACDW  5  ACDTW  6  CDT  Association rules (min_conf =  Support  A CT W  Itemsets  100% (6)  C W,  83% (5) 67% (4)  50% (3)  50%)  cw  A, D, T, A C , A W CD, CT, ACW AT, DW, TW, ACT, ATW, CDW, CTW, A C T W  100%)  A  » - C (4/4)  AC -  -W  TW  -»-C(3/3)  A  » - W (4/4)  AT "  - C (3/3)  (4/4)  AT  -»-CW(3/3)  A  *-CW(4/4)  AT—  -W  TW  "*-AC  D  * - D (4/4)  AW-  - C (4/4)  ACT  -*-W  (3/3)  (3/3) (3/3)  T  *-C(4/4)  DW-  -C  (3/3)  ATW  *-C  W  * - C (5/5)  TW -  -A  (3/3)  CTW  * - A (3/3)  (3/3)  F i g u r e 3.11: Frequent Itemsets and S t r o n g rules for a Bookstore Database [20]  d a t a at large grocery stores or department stores.  E a c h record contains several  items. T h e objective of A R M is to generate all rules w i t h specified confidence and support. A n example rule might be, "90% of customers buying product also buy product {D,E} terminoloy, A,B,C,D  .", where the confidence of the rule is 90%.  and E in this rule are called "items"; {A,B,C}  {A,B,C} In A R M  and {TJ, E}  are called "itemset". L a t e r , we will use "k-itemset" to denote an itemset containing k items. A s well as C U B E , A R M aslo uses "support" to indicate how frequent an itemset occurs as a subset in transactions. Users can specify a " m i n i m u m support" (minsup) and " m i n i m u m confidence" (minconf) in their queries. M o s t A R M algorithms involve the following steps: 1. F i n d all frequent itemsets satisfying some specified m i n i m u m support. 2. Generate strong rules having m i n i m u m confidence from the frequent itemsets. The first step of A R M is much like a iceberg-cube problem if we imagine items are attributes with only one value. T h e n , generating all frequent itemsets means generating all different dimensional group-bys above a specified threshold(minsup). Consider the example bookstore-sales database shown in F i g u r e 3.11. T h e r e  46  A+_\  BCDE  ABCEF B +. J CEF C+  EF  Candidate Hash Tree  F i g u r e 3.12: Subset O p e r a t i o n on the R o o t of a C a n d i d a t e H a s h Tree [23]  are five different items (names of authors the bookstore carries), I = {A, C, D, T, T h e database comprises six customers who bought books by these authors.  W}. Fig-  ure 3.11 shows all the frequent itemsets contained in at least three customer transactions, t h a t is, minsup = 50%. T h e figure also shows the set of all association rules w i t h m i n c o n f = 100%). T h e A p r i o r i algorithm by Rakesh A g r a w a l and colleagues [20] has emerged as one of the best A R M algorithms, and also serves as the base algorithm for most parallel algorithms. A p r i o r i uses a complete, bottom-up search, iteratively enumerating from frequent 2-itemsets to higher dimensional frequent itemsets. T h e algorithm has three m a i n steps: 1. Generate candidates of length k from the frequent (A;-l)-itemsets, by a self-join on  F _i. fc  get C  3  F o r example, for F  2  = {AC, AT, AW,CD,CT,CW,DW,TW},  = {ACT, ACW, ATW, CDT, CDW,  CTW).  2. P r u n e any candidate that has a.t least one infrequent subset. CDT  will be pruned because DT is not frequent.  3. Scan all transactions to obtain candidate supports.  47  we  F o r example,  These three steps are called iteratively from k = 2 until no more new frequent itemset can be generated. A p r i o r i stores the candidates in a hash tree for fast support counting. In a hash tree, itemsets are stored in the leaves; internal nodes contain hash tables (hashed by items) which direct the search for a candidate. T h e hash tree structure of A p r i o r i is very efficient for candidate searching and insertion. Once a transaction is read i n , all of its subsets can be quickly computed and inserted into the hash tree if they are not there already. F i g u r e 3.12 gives an example of subset operation on the root of a candidate hash tree. Obviously, the b o t t o m - u p idea behind both A p r i o r and B U C is the same, except B U C searches the tree in a depth-first order while A p r i o r searches in a breadthfirst order. F r o m this observation, we developped a C U B E a l g o r i t h m w i t h a similar hash tree structure as in A p r i o r , and exploit the breadth-first searching in C U B E c o m p u t a t i o n exactly as in A p r i o i r . W e kept the major structure of the A p r i o r i algor i t h m and made only little modification to accommodate C U B E c o m p u t a t i o n . F o r example, since C U B E doesn't assume only a value for each a t t r i b u t e (item in A R M ) , we built a global index table which counts all values of all attributes as items. For a small d a t a set, this a l g o r i t h m is feasible. However, its performace was proved unsatisfactory.  Breadth-first searching creates too many candidates to be  maintained in the hash tree. T h i s is mainly because the global index table contains too many items, exactly the sum of the cardinalities of all C U B E attributes. T h i s creates a large amount of candidates. If the C U B E is sparse, the situation is even worse. A l t h o u g h we can count on p r u n i n g to eliminate many candidates, the hash tree is still a huge burden before pruning, and quickly consumes all available memory. Unfortunately, we had to a d m i t this attempt failed. Since the performance of this hash tree based algorithm lags far behind other algorithms, we o m i t it from the following discussion.  48  3.5.2  Hash Table Based A l g o r i t h m  After we finished the implementation of A S L , we tried to use the hash table as an alternative d a t a structure for A S L , to see whether better preformance could be achieved. T h e n the Affinity Hash Table based a l g o r i t h m was developed, A H T for short.  ;  A s with P i p e H a s h , A H T uses hash tables to m a i n t a i n cells of nodes i n a lattice, group-bys.  However, A H T avoids creating a hash table for each c u b o i d .  Once subset affinity becomes applicable, it reuses the hash table created for the previous task. Specifically, A H T builds an index which makes it possible to collapse the previous hash table whenever subset affinity is found. For this purpose, each C U B E attribute is assigned several bits which, when combined, form the complete index of buckets in a hash table. F o r example, for a 3-dimensional C U B E w i t h attributes A , B and C , we give A three bits, B two bits and C one bit. T h e n the hash tables index has 6 bits (in binary) and the size of the hash table will be 2 . Whenever a tuple (cell) is read i n , its location in the hashtable 6  is determined by its values for the C U B E attributes. In this example, for its index, the first three bits are decided by the value for A , the next two bits are decided by the value for B , and the last bit is decided by the value for C . The number of bits assigned to each attribute depends both on the cardinality of that attribute and on how many tuples are in the raw d a t a s e t . Originally, the bits assigned to an attribute X is log (card(X)), where card(X) is the cardinality of X . T h i s implies the length of a hash table would be the product of the cardinalities of all attributes. However, if the d a t a set is sparse, this product would be much larger than the size of the d a t a set. In this case, the bits assigned to each attribute would shrink appropriately, in order to define a smaller index. A smaller index, however, may introduce collisions. Here we simply tradeoff memory occupation with run time. T h i s tradeoff would introduce severe bucket collision when many cells need to be maintained by the hash table. It degrades A H T ' s performance severely, especially  49  Algorithm A H T 1. I N P U T : Dataset R cube dimensions { A i , . . . , A }; m i n i m u m support Spt m  2. 3. 4. 5.  O U T P U T : T h e 2 cuboids of the d a t a cube PLAN: Task definition: a cuboid in the cube lattice Processor assignment: a processor is assigned the next task based on  6.  subset affinity C U B E C O M P U T A T I O N (for a processor):  7. 8. 9.  m  1  parallel do let the task be w i t h dimensions A , - , . . . , Aj i f { A j , . . . , Aj} is a subset of the set of dimensions of the previous task, or the set of dimensions of the first task let C denote the hash table from t h a t task  10. 11. 12. 13.  call subset-collapse(C, Spt, Ai,..., Aj) else call subset-newHashTable(i?, Spt, Ai,...,  Aj)  end do  14. Subroutine subset-collapse(C, Spt, Ai,..., Aj) 15. 16.  Collapse C based on A , - , . . . ,.Aj W r i t e o u t the cells if the support threshold is met  17. Subroutine subset-newHashTable(C, Spt, Ai,..., Aj) 18. initialize a hash t a b l e i ? 19. 20. 21.  for each tuple in C do  22. 23.  end for  find the right cell in H (created if necessary) update the aggregate and the support counts accordingly Traverse H , and write out the cells if the support threshold is met  F i g u r e 3.13: A Skeleton of A H T  50  when problem size increases or a high dimensional C U B E need to be c o m p u t e d . W e will discuss this further in C h a p t e r 4. A s A S L , A H T also takes each group-by as a task. A H T ' s task scheduling is almost the same as A S L , except A H T does not prcocess prefix affinity differently from general subset affinity. If a new task's G R O U P B Y attributes make a subset of those o f t h e previous task, then the hash table already built contains a l l cells needed for the new task. So, we will create no new hash table but shrink the existing one by collapsing some buckets. F u r t h e r to the example mentioned above, if we've built the hash table for c u b o i d A B C , we now get a new task for cuboid A C . T h e buckets x x x 00 x, x x x 01 x, x x x 11 x and x x x 10 x are collapsed into x x x 00 x, w i t h the aggregate and the support upgraded at x x x 00 x. Those attributes missing from the new task (but found in the previous one) determine how many and what buckets will be collapsed. In this example, C is the missing a t t r i b u t e . Since two bits (the forth and the fifth in the index) are assigned to C , then four buckets will be collapsed into one bucket. Since the hash table does not m a i n t a i n cells in any particular sorting order, no sorting is needed in A H T . If a sorted c u b o i d is required by users, the sorting will be done online when users give their queries. W e call this post-sorting. The skeleton of A H T is shown in F i g u r e 3.13.  51  Chapter 4  Experimental Evaluation In this C h a p t e r , we give a performance comparison of five algorithms: R P , B P P , A S L , P T and A H T . T h e hash tree based algorithm is not included in this testing nor in the following discussion, because its performance lags far behind the others. In order to give, a fair evaluation, we investigate the a l g o r i t h m s ' memory occupation first before explaining the testing environment, and then give our test results.  4.1  Memory Occupation  In four of the algorithms: R P , A S L , P T and A H T , the raw d a t a set is replicated among processors. Conversely, B P P partitions the raw d a t a set and distributes the partitions among processors. L e t ' s first discuss d a t a replication based algorithms. In the simplest a l g o r i t h m , R P , each processor loads the whole replicated d a t a set, i ? , into its m a i n memory as a large array for later c o m p u t a t i o n , according to the task assigned to it. R P therefore only needs a space the size of R, in the main memory for each processor. A n o t h e r data, replication a l g o r i t h m , P T , is also an array based a l g o r i t h m . Like R P , its memory footprint is not much larger than R for each processor. A H T uses hash tables as its d a t a structure only to maintain cuboids. Since  52  the cells in a cuboid can be less than tuples in the d a t a set, a hash table may possibly be much smaller than a d a t a array in an array based a l g o r i t h m . Besides cells, A H T needs also to m a i n t a i n the index table for the hash table in memory. T h e index table is fixed-size in A H T ; in other words, the number of buckets in the hash table is fixed.  T h i s number greatly affects A H T ' s performance.  In order to make the  experiment evaluation reliable, we set the number of buckets in the hash table to the number of tuples in R. Therefore, A H T ' s memory footprint is not much more t h a n R. In an extreme case, such as where the cuboid contains all tuples in the raw d a t a set, each processor of A H T needs space in its main memory for R cells, plus the \R\ indices for a hash table. T h e memory footprint of A S L is the biggest of all the a l g o r i t h m s . It takes skip lists as its d a t a structure. T h e m e m o r y overhead for each node of a skip list is mainly decided by the m a x i m u m number of forward links it allows a node to have. In our a l g o r i t h m , we allow no more t h a n 16 forward links in each node. Therefore, a node's memory footprint is no more than twice the size of an element of an array in array based algorithms, such as R P . L i k e A H T , A S L does not load the entire d a t a set into memory, but only maintains cuboids as skip lists. T h u s , a skip list may be smaller t h a n a d a t a array. E v e n in an extreme case, such that a cuboid contains the whole d a t a set, its skip list size would be no more than twice t h a t of R. A s well as the current working skip list, each processor maintains a "root" skip list in its main memory, to m a x i m i z e sort sharing among local tasks. T h e n in an extreme case, A S L ' s memory footprint will be no more than four times of R, for two skip lists in the memory of each processor. T h e d a t a p a r t i t i o n i n g based a l g o r i t h m : B P P is the most memory-efficient a l g o r i t h m . Since each processor only works on local chunks, its memory footprint is the m a x i m u m size of its local chunks. Even in an extreme case, where only one chunk gets produced when range p a r t i t i o n i n g on an attribute, the memory footprint  53  would be no more than R.  4.2  Experimental  Environment  The experiments were carried out on a hetergeous P C cluster, consisting of eight 500MHz P H I processors with 256M of main memory and eight 266MHz PII processors with 128M of main memory. Each machine is attached with a 30Gbyte hard disk and is connected to a lOOMbit/sec Ethernet network. The C U B E computations were performed on a weather data set containing weather conditions sent by various weather stations on land. The data set is the same as that used by Ross and Srivastava [14], and Beyer and Ramakrishnan [4]. It has 20 dimensions, and is very skewed on some of those dimensions. For example, partitioning the data on the 11  th  dimension produces one partition which is 40 times  larger than the smallest one. In order to compare the effect of varying the different parameters of the problem, we used a fixed setting and then varied each of the parameters individually. The fixed setting, or baseline c o n f i g u r a t i o n for testing the algorithms, was the following: • the eight 500MHz processors; • 176,631 tuples (all from real data); • 9 dimensions chosen arbitrarily (but with the product of the cardinalities roughly equal to 10 ); 13  • with minimum support set at two. For the dynamic scheduling algorithms A S L , P T and A H T , we overlapped the manager and one worker on one processor. This maximized the usage of the processor on which the manager resided, leading to a reasonable performance evaluation. 54  In the experiment, we investigated how the algorithms perform under different circumstances. We are concerned with the following issues in C U B E c o m p u t a tion: • load balancing, tested by c o m p a r i n g loads on each processor; • scalability w i t h processors, tested by varying the number of processors; • scalability with problem size, tested by varying problem size; • scalability with dimensions, tested by varying the number of dimensions; • pruning effects, tested by varying the m i n i m u m support; • accommodation for sparse C U B E c o m p u t a t i o n , tested by varying the sparseness of the d a t a set. In the following figures, "wall clock" time means the m a x i m u m time taken by any processor. It includes both C P U and I / O cost.  4.3  Load Distribution  F i g u r e 4.1 shows the load d i s t r i b u t i o n among processors when testing on the baseline configuration.  A S L , A H T and P T have quite an even load d i s t r i b u t i o n while the  loads distributed to each processor by R P and B P P vary greatly. F o r R P , the reason for the uneven load d i s t r i b u t i o n is due to its static task assignment. A l t h o u g h the number of tasks is approximately equal, the amount of c o m p u t a t i o n and I / O for the tasks differs significantly. F o r B P P , the dataset is partitioned statically across all nodes. Because the d a t a is very skewed on some of the dimensions, the computation is not well balanced. A S L , A H T and P T decrease the granularity of the tasks to a single cuboid in A S L and A H T and to a small subtree in P T . T h e finer granularity leads to better load balancing, and the use of demand scheduling makes it easier to maintain balanced even when the d a t a s e t is very skewed.  55  Load on Each Parallel Computing Nodes 1401  1  1  1  1  2  3  1  1  4 5 Parallel Computing Nodes  1  6  7  Figure 1.1: L o a d B a l a n c i n g on 8 Processors  56  8  Speedup Comparision T  01  0  :  1  i  i  i  2  4  6  1  1  i  i  8 10 Number of Processors  1  1  r  i  i  i  I  12  14  16  18  F i g u r e 4.2: Scalability  4.4  Varying the Number of Processors  F i g u r e 4.2 shows the performance of the algorithms when running on different numbers of processors.  T h e performances' are largely determined by each a l g o r i t h m s '  load balancing ability; generally, the better the load balances, the better the performance. R P ' s performance is the worst, no matter how many processors are used. Besides poor load balancing, R P ' s depth-first w r i t i n g strategy exacerbates its poor performance as well. B P P does well when running only on 2 processors, where the d a t a p a r t i t i o n ing is done quite evenly. However, as more processors are added to the c o m p u t i n g environment, the d a t a partitioning becomes uneven. Uneven tasks with coarse gran-  57  ularity quickly upset load balancing. B P P is quickly outperformed by A S L when four processors are available. T h e performance of A S L is poor when run on only two processors. T h i s is largely due to the overhead from creating and m a i n t a i n i n g skip lists.  W h e n the  number of processors increases, A S L gains from good load balancing and scales very well. A H T ' s performance is similar to A S L ' s , because their task definition and scheduling are almost the same. P T shows the best performance overall due to both good balancing and pruning. A S L , A H T and P T use affinity scheduling to take advantage of share-sort to reduce c o m p u t a t i o n . A s we mentioned in section 3.3.2, the affinity relationship among local tasks on one processor tends to weaken as the number of processors increases. It is interesting to note t h a t the speedup from eight processors to sixteen processors is negligible, relatively.  4.5  Varying the Problem Size  F i g u r e 4.3 shows that w i t h increasing problem size, P T and A S L do significantly better than other algorithms. B o t h P T and A S L appear to grow sublinearly as the number of tuples increases. T h i s is due to two factors. F i r s t , there is an overhead when creating the 2 cuboids, which is independent of the amount of d a t a . Second, 9  doubling the number of tuples does not change the cardinality of the dimensions (except for the date field) and does not imply twice the amount of I / O , since more aggregation may take place. It is possible to use more processors to solve a fixed problem faster or to solve a larger problem in the same amount of time. T h e results in F i g u r e 4.3 showthat P T and A S L scale well with problem size and indicate that these algorithms could be used, given sufficient memory and disk space, to solve larger problems on larger cluster machines.  58  Varying Size of Dataset 11001  0  1  200  1  1  400 600 800 Number of Tuples in Dataset (*K)  1000  F i g u r e 4.3: Results for varying the dataset size  59  1200  50001  1  1  Varying the Number of Cube Dimensions 1 1 1 1 r  Number of Cube Dimensions  F i g u r e 4.4: Results for varying the N u m b e r of C u b e Dimensions  U n l i k e other algorithms, A H T scales unpredictably with problem size. O n one hand, this is because collision w i t h i n a bucket tends to happen more often, as more and more cells are maintained by hash tables. T h i s damages A H T ' s performance severely.  O n the other hand, the d a t a d i s t r i b u t i o n in the raw d a t a set  d r a m a t i c a l l y affects how many collisions may occur. T h i s leads to inconsistent scalability in A H T .  4.6  V a r y i n g the N u m b e r of D i m e n s i o n s  F i g u r e 4.4 shows the effect of increasing the number of dimensions on each algorithm.  T h e wall clock time increases d r a m a t i c a l l y as the number of dimensions'  increases, because the number of cuboids grows exponentially w i t h dimension size.  60  For example, the 13-dimensional C U B E has 8,192 cuboids. The  scalability of A H T w i t h C U B E dimensions is the worst of all the al-  gorithms. In fact, in our testing, when the number of C U B E dimensions is set as 13, the hash table size was fixed as ten times the size of the i n p u t d a t a set, t h a t is ten times larger than t h a t in the baseline configuration. E v e n then, A H T ' s performance is very poor. T h e r e are two m a i n reasons c o n t r i b u t i n g to this effect. F i r s t , as high dimensional C U B E needs to be computed, a large number of cells need to be maintained in the hash table. T h i s introduces a great amount of collisions w i t h i n in buckets d u r i n g insertion and searching operations. Second, since the size of the hash table is fixed, the index bits assigned to each C U B E attribute are far from adequate to appropriately collapse the hash table when subset affinity is applicable. If the d a t a set is skewed on some C U B E attributes, the hash function behaves even poorer. The  relative performance for the other four algorithms remains the same  except for A S L , where for thirteen dimensions it stops being better than B P P . A S L is affected more than other algorithms because of its comparison operation.  The  comparison operation used to search and insert cells into the skip list becomes more costly as the length of the key increases. T h e length of the key grows linearly with the number of dimensions. T h i s is a significant source of overhead for A S L . F i g u r e 4.4 also shows that when the number of dimensions is small, R P , A S L , A H T and P T all give similar performances. Because the size of the output is small for a small number of dimensions, the simple R P a l g o r i t h m can keep up to the others.  4.7  V a r y i n g the M i n i m u m S u p p o r t  F i g u r e 4.5 shows the effect of increasing the m i n i m u m support.  A s the m i n i m u m  support increases, there is more pruning, and as a result, less I / O . T h e t o t a l output size for the algorithms given in F i g u r e 4.5 starts at 4 6 9 M b y t e for a support of  61  Figure 4.5: Results for varying the minimal support  62  one, 8 6 M b y t e for a support of two, 2 7 M b y t e s for a support of four, and U M b y t e s for a support of eight.  After eight, very little additional pruning occurs.  Except  between one and two, the output size does not appear to have much affect on overall performance.  T h i s is surprising since we expected P T to do better as support  increased, because more pruning should have led to less c o m p u t a t i o n . T h e relative flatness of the curve for P T is largely due to the order of the dimensions choosen. For the baseline configuration, the pruning occurs more towards the leaves, where it does not save as much in c o m p u t a t i o n time. Notice A S L and A H T can not prune d u r i n g c o m p u t a t i o n ; their better performance w i t h higher m i n i m u m support is due only to less I / O cost but not to pruning.  4.8  Varying the Sparseness of the Dataset  F i g u r e 4.6 shows the effect of sparseness of the d a t a set on the four algorithms. We consider a d a t a set to be sparse when the number of tuples is small relative to the product of the number of distinct attribute values for each dimension in the C U B E . Since the number of tuples in the baseline configuration is fixed, we can vary the sparseness of the d a t a set by choosing smaller dimensions over larger cardinality dimensions. T h e three d a t a sets chosen for F i g u r e 4.6 consisted of the nine dimensions w i t h the smallest cardinalities, the nine dimensions w i t h the largest cardinalities, and one in between. Note t h a t even for the smallest of the three, there are still about 10' possible total cells in the cube. A s shown in F i g u r e 4.6, A H T is apparently more affected by sparseness than the other algorithms.  T h e more C U B E dimensions, the more collisions happen,  which badly hamper A H T ' s performance. If few collisions occurs, as when dimensionality is low, A H T outperforms all others. A H T and A S L perform well on dense datasets and are more adversely affected by spareness than others. A S L performs well for dense datasets because each c u b o i d  63  Varying the Exponent of Cardinality Product of Cube Dimensions 10001  6  1  8  1  1  1  1  1  10 12 14 16 18 Cardinality Product of Cube Dimensions (Exponent of 10)  20  F i g u r e 4.6: Results for v a r y i n g the sparseness of the dataset  64  22  Situations  PT  dense cubes small dimensionality (< 5) high dimensionality less memory occupation otherwise  v v  online support  7  ASL  RP  V V  V  BPP  AHT  V V  7  vV  V v  7  F i g u r e 4.7: Recipe for selecting the best a l g o r i t h m  contains relatively few cells, which makes searching or inserting into a skip list relatively fast. T h e B U C - b a s e d algorithms have little o p p o r t u n i t y to take advantage of density. In fact, the denser the dataset, the less pruning can be done. A s a result, while traversing the lattice, the B U C - b a s e d algorithms need to sort almost the entire dataset for many of the cuboids. B P P does particularly poorly for cube dimensions w i t h s m a l l cardinalities because B P P cannot partition the d a t a very evenly, which leads to serious load imbalance. A S L does worse than the B U C - b a s e d algorithms when the product of the cardinalities is high, partly because of the amount of pruning that occurs for the B U C - b a s e d algorithms, and partly because A S L has to maintain larger skip lists.  4.9  Summary  4.9.1  Recipe R e c o m m e n d e d  T h e experimental results shown thus far explores the different parameters affecting overall performance. After careful e x a m i n a t i o n , we recommend the "recipe" shown in F i g u r e 4.7 for selecting the best a l g o r i t h m in various situations. It is clear that A H T and A S L dominate all other algorithms when the cube is dense, or when the total number of cells in the d a t a cube is not too high (e.g., < 1 0 ) . However, A H T is more adversely affected by sparseness and dimensionality. 8  For d a t a cubes with a small number of dimensions (e.g., < 5), almost all algorithms  65  behave similarly. In this case, R P may have a slight edge in t h a t it is the simplest a l g o r i t h m to implement. F o r all other situations, except when the d a t a cube has a large number of dimensions, P T , A H T and A S L are relatively close in performance, w i t h P T typically a constant factor faster than A H T and A S L . F o r cubes of high dimensionality, there is significant difference among the three, and P T should be used. T h e last entry in F i g u r e 4.7 concerns online support. T h i s is the topic of the next section.  4.9.2  Further Improvement  There is still room for improvement in some of the algorithms. W i t h the affinity scheduling, the current prefix and subset affinity can be expanded to cooperate w i t h the sorting overlap idea behind the Overlap a l g o r i t h m , mentioned in C h a p t e r 2. Therefore, even if we can not assign a task to a processor w i t h C U B E dimensions perfectly prefixing the previous task, we can t r y to assign a task w i t h the longest possible prefix of the previous task. T h i s may improve the performance of A S L . For A H T , we can a t t e m p more sophisticated hash function instead of the naive M O D hash function currently use. A better hash function may relieve A H T ' s struggling performance when faced w i t h sparse and high dimensional C U B E computation.  66  Chapter 5  Online Aggregation Recall that the C U B E c o m p u t a t i o n is just a p r e c o m p u t a t i o n designed to instantly respond to online iceberg queries.  However, sometimes a user's query can not be  answered by the precomputed C U B E . W h e n the m i n i m u m support for the online query is lower than that for the precomputation, it is no longer possible to compute a query, essentially a c u b o i d , from a precomputed c u b o i d . T h i s problem can be solved in two ways. F i r s t , we can choose a small m i n i mum support for the p r e c o m p u t a t i o n , therefore, most of the queries can be answered by aggregating from a precomputed c u b o i d . Second, we can simply aggregate from the raw d a t a set to answer an unpredictable query online. In the following sections, we discuss issues concerning these two separate methods.  5.1  Selective M a t e r i a l i z a t i o n  C U B E with low constraints usually produces a large body of result for which the computation may take a long time and also may not be sa.ved to disk entirely. T o solve this problem, it is natural to consider selecting only one set of cuboids to materialize instead of all the available cuboids.  A l t h o u g h our experiments show  that in many cases, our parallel algorithms can do well in c o m p u t i n g the entire  67  iceberg-cube query from scratch (e.g., < 100 seconds), for t r u l y online processing, selective materialization can still help significantly. A s an exercise, we compared two different plans for answering online queries using A S L . T h e first plan is to simply re-compute the query based on the specified m i n i m u m support. If the m i n i m u m support was two, as in F i g u r e 4.5, A S L would take approximately sixty seconds to complete the entire C U B E . T h e second plan consists of a precomputation stage and an online stage. In the precomputation stage, A S L computes only the leaves of the traversal tree using the smallest m i n i m u m support (i.e., 1). In the online stage, A S L uses top-down aggregation and returns those cells satisfying the new specified support.  In this  second stage, A S L can make returns almost immediately; and interestingly, even for the precomputation, it only took fifty seconds for the same example.  (The value  of fifty seconds was obtained from our additional experiment, not directly from F i g u r e 4.5.  T h e values in F i g u r e 4.5 include the t o t a l time for the nodes in the  tree, not just the leaves.) T h i s suggests that even simple selective materialization can help.  It is a topic of future work to develop more intelligent materialization  strategies.  5.2  O n l i n e A g g r e g a t e f r o m a R a w D a t a Set  Besides selective materialization, in this thesis, we also consider c o m p u t i n g online aggregates from a raw d a t a set.  T h u s , we manage to provide a comprehensive  solution for the iceberg query problem. Hellerstein, Haas and W a n g proposed an online aggregation framework [11], in which a sampling technique is applied for instant response and further progressive refinement.  W e took this framework for  our online aggregate algorithm to allow a user to observe the progress of a query and d y n a m i c a l l y direct or redirect the c o m p u t a t i o n . In the case of an iceberg query, the user would see a rough initial cuboid which would become more accurate as more tuples are processed.  68  L i k e A S L , we took a skip list as the fundamental d a t a structure, m a k i n g it possible to construct a cuboid by incrementally inserting tuples into the s k i p list. E a c h tuple can therefore be handled independently. In terms of incremently building a cuboid, the hash tables used in A H T provides a good alternative. However, since its performance is too sensitive to dimensionality and d a t a sparseness, as viewed in C h a p t e r 4, a hash table does not make a good d a t a structure for the online aggregate a l g o r i t h m . T h e array based algorithms, R P , B P P and P T , are also difficult to be extended to handling online issues, m a i n l y because an array does not efficiently support incrementally insertion. Once query results from new d a t a are c o m p u t e d , they then have to be merged w i t h the results from the old data. M e r g i n g operations introduce a d d i t i o n a l overhead and do not support parallelism well. In fact, the online advantages of A S L over other algorithms was one of the main motivations for its development. In the following section, we present our Parallel O n L i n e aggregation algorithm ( P O L ) .  5.3  Parallel Online Aggregation  5.3.1  Data Partitioning and Skip List Partitioning  Online aggregation implies only one group-by need be computed. Usually, computing one group-by is not time consuming. T h e c o m p u t a t i o n is much smaller than c o m p u t i n g C U B E . T o necessitate parallel c o m p u t a t i o n , we assume the raw d a t a set is huge, shown in two aspects. F i r s t , a raw d a t a set is range-partitioned across processors w i t h o u t any sorting. If there are n processors in a cluster, n partitions, Ri to R , n  are produced; processor j gets Rj.  Second, neither a processor can load  its local d a t a partition entirely into its main memory. A processor has to proceed the c o m p u t a t i o n step by step; at each step, one block of d a t a from its local d a t a partition is loaded and computed. T h e d a t a block is in fact a sample taken from the unprocessed part of the processor's local d a t a p a r t i t i o n .  69  Passed Passed Passed Passed  to Pi to P2 to P 3 to P4  Located on Pi (Chunkn) (Chunk \) (Chunksi) (Chunk^i) 2  Located on P2 Located on P3 Located on P4 (Chunky) (Chunk'n) (Chun (132) (Chunky)  (Chunky) (Chunk 3) {Chunk ) (C'hunk 43) 2  33  (Chunky) (Chunk 24) (Chunky) (Chunk44)  Table 5.1: Task A r r a y for 4 Processors  In order to utilize all available machines in a cluster, P O L range-partitions a skip list to n partitions as well, where n is the number of processors. E a c h processor, therefore, maintains only one skip list p a r t i t i o n . P O L determines boundaries of the skip list partitions assigned to different processors at the beginning of its c o m p u t a tion through s a m p l i n g . A f t e r w a r d , a processor is only responsible for searching or inserting cells into its skip list p a r t i t i o n as delimited by boundaries. A s a processor scans its local d a t a p a r t i t i o n , since it is unsorted, the processor finds tuples which should be inserted into skip list partitions m a i n t a i n e d by other processors. In such a case, the processor then passes the tuples to other processors appropriately. If there are n processors in the cluster, one processor might pass (nl)/n  of its local d a t a to other processors. T h e overhead from d a t a c o m m u n i c a t i o n  is then introduced.  5.3.2  Task Definition and Scheduling  A s mentioned above, P O L proceeds w i t h c o m p u t a t i o n step by step. W i t h i n a step, each processor computes a block of d a t a , and d a t a c o m m u t a t i o n takes place among processors when necessary. P O L guarantees that one block of d a t a is loaded only once. O n l y after all processors complete c o m p u t i n g on the tuples in this block, does the loading processor discard the block and move to the next step. Therefore, processors proceed with their c o m p u t a t i o n synchronously, and synchronizations happen amongst processors between every two steps. Tasks are defined in P O L for each step, that is, between synchronizations.  70  Synchronization  One Step  ' Tasks for P1, from  £hunkjl  I Tasks for P2, from  (j  | Tasks for P3, from  I Tasks for P4,  Chunkl.C  from  Oiunklir  fJiunk3i  Chunk4jJ  Chunk22  Chunk32  Chunk42 3!  3 Chunk23  C h u n k 3 3 ^ B Chunk43  Cg^^  Figure 5.1: Tasks Assignment in P O L Suppose at one step, after the processor P loads in a d a t a block from its local data 8  partition R , it groups the tuples in the block into n chunks, Chunky t  to  Chunk i, n  according to the partition boundaries set for the skip list partitions, where n is the number of processors. Note that Chunk ji indicates a chunk, which although located in processor P , will be passed to processor Pj to maintain F j ' s skip list partition. t  Therefore, for P  u  all but one chunk(Chunka)  are passed over network and checked  by other processors. F o r a cluster w i t h 4 processors, the task array created for one step is shown in Table 5.1. Since there are n processors and each processor has n chunks, n x n chunks are produced in total.  These chunks correspond to n X n tasks, indicated as  t&Bk{Chunkji)(both  i and j from 1 to n). Task(Chunkji)  is the c o m p u t a t i o n based  on chunk Chnnkji-  Notice that at each step, tasks have to be redefined. Tasks in  different steps are separately scheduled. Like some of C U B E algorithms, in P O L , a manager responsible for task  71  scheduling, and many workers responsible for c o m p u t i n g ( c o m p u t i n g aggregations in this case). T h e manager initially assigns a number of tasks to each processor. H o w ever, once a processor finishes its assigned tasks, it can then help other processors finish their tasks. Originally, processor Pi, are assigned task(Chunkij) ample, w i t h four processors, P task(Chunk23)  is required to finish t a s k ( C 7 i t m A : 2 i ) , task(C7imiA:22),  2  and task(C hunk ). 24  F o r some of these tasks, P  priate chunks located on other processors. P  2  tdisk{Chunk \), 2  task(Chunk 4) 2  Chunk 3 2  (j is from 1 to n). F o r ex-  on P  and local C'hunk  3  22  has to fetch appro-  2  then needs Chunk \ 2  to finish task(Chunk ),  Chunk 4  23  2  on Pi to finish on P  4  to finish  to finish task(C/i?mA:22). T h e sequence for proces-  sor Pi to compute its assigned tasks is this: from task(C/i?mA;j,-) to it then wraps back from task(C'hunkn)  to task(Chunk ^_ )). i  1  task(Chunki ), n  T h i s sequence m a x i -  mizes the possibility of each processor w o r k i n g on d a t a located on different processors at one time, thus reducing the possibility of a burst of d a t a requests happening on a particular processor. F i g u r e 5.1 illustrates the original task assignment in P O L for a c o m p u t i n g environment consisting of 4 processors. To balance the load, a processor is allowed to offload w a i t i n g tasks from busy processors after it has finished its own assigned tasks. T h e manager tries to assign to it those untouched tasks t h a t the processor keep the i n p u t d a t a chunk in local. The processor then compute a new skip list for the task. O n c e it finishes this task, or gets a d a t a request from the processor responsible for the task, it passes the skip list it has already built on to that processor. T h e n , that processor merges the skip list w i t h its local skip list p a r t i t i o n . A p p a r e n t l y , this method of task scheduling does not introduce a d d i t i o n a l d a t a c o m m u n i c a t i o n overhead. To provide a constant update of query results, a set timer periodically gives response back to the user. Whenever the timer expires, the manager collects results from all workers, displays the results on screen, and resets the timer. If the user wishes to discontinue the c o m p u t a t i o n , he or she can interrupt it at any point.  72  1. 2.  Algorithm P O L I N P U T : Range-partitioned d a t a sets, each located on one processor (processor Pi has p a r t i t i o n R ;) l  G R O U P B Y dimensions { A i , A ,.. -A } and m i n i m u m support Spt O U T P U T : T h e iceberg query results ONLINE AGGREGATE: T h e manager takes a sample, and determines the boundaries of skip list partitions assigned to each. processor 2  3. 4. 5.  6. 7. 8. 9. 10.  m  parallel do while (not all data has been processed) if (worker processors Pi) loads in one block the samples from its local p a r t i t o n which have not been processed it then groups the samples into n chunks, Chunkn, to Chunk i calls on\'me-s\a,ve(Chunkii,... ,Chunk i, Spt, Ay,..., A) if(the manager) defines n X n tasks, each for one chunk on workers schedules the tasks, as described in Section 5.3.2 synchronize n  11. 12. 13. 14. 15.  16.  n  m  end while  17. end do 18. Subroutine o n l i n e - s l a v e ( C / i ? m / c i j ' , . . . , Chunk i, Spt, A\,..., A) 19. gets a task from the manger; if there is no uncompleted task at the manager, return 20. if the task is Task(Chunkij) 21. asks processor Pj for the chunk Chunky 22. updates the local skip list based on Chunkij n  23. 24. 25.  m  26.  if the task is Task(C7itmfc -,-) computes a new skip list from the chunk Chunkji sends the skip list to Pj, then Pj merges it into its local skip list p a r i t i t o n . d u r i n g processing, if any request comes from another process asking for a  27.  d a t a chunk, sends it to the processor d u r i n g processing, if any request comes from the manager for  7  current result, estimates current m i n i m u m support, collects result and send them to the manager 28.  d u r i n g processing, if any request comes from the manager for stopping the c o m p u t a t i o n , return F i g u r e 5.2: A Skeleton of the P O L A l g o r i t h m 73  A skeleton of algorithm P O L is shown in Figure 5.2.  5.4  Exerimental Evaluation  The testing environment for P O L is similar to that for the C U B E algorithms, except that we based our experiments on a larger weather dataset, which contains 1,000,000 tuples. Although the data set is larger, it has the same number of dimensions as the smaller one used for testing the C U B E algorithms. We focused on the following issues in P O L during the experiments: • scalability with the number of processors; • scalability with the buffer size on each processor. 5.4.1  V a r y i n g the N u m b e r of Processors  Figure 5.3 shows the performance of P O L with different numbers of processors. In testing, a 12-dimensional iceberg query was answered online. The minimum support was set as 2 and the buffer on each processor was set to contain 8000 tuples at each step. The computation created a huge skip list with 924,585 nodes. The performance of P O L was tested on three clusters of machines: • Clusterl consists of eight 500MHz PIIL processors with 256M of main memory connected by an Ethernet network; • Cluster2 consists of eight 266MHz PII processors with 128M of main memory connected by an Ethernet network; ' • Cluster3 consists of eight 266MHz PII processors with 128M of main memory connected by a higher speed network, Myrinet, which is approximately three times faster than the Ethernet used in the first two clusters. Data communication among worker processors is the main factor affecting P O L ' s performance. If the data distribution is uniform, for each processor nearly  74  Varying the Number of Processors 150 ciusten Cluster2  —e—*-  Cluster3 — H  —100  E tO  50  4  5 6 Number of Processors  Figure 5.3: P O L ' s Scalability with the Number of Processors (n — l)/n of data needed are located on other processors, where n is the number of processors. Apparently, the higher n is, the more data needs to be transfered over the network. However, adding more machines decreases the computations carried out at each processor because the work load is shared. Therefore, whether we can achiever better overall performance with more processors or not remains uncertain. It largely depends on how much the computation decreases or the communcation increases on each processor, and which is the dominating factor. Generally, more time spent on computation versus the less spent on communication, the better performance can be achieved. Computing high dimensional queries always implies more computation because a large skip list needs to be maintained. Therefore, we can conclude that P O L is feasible especially for computing high dimensional queries. 75  F i g u r e 5.3 shows the speedup achieved on Clusters'2 and C l u s t e r 3 is better than on C l u s t e r l , mainly because the c o m p u t a t i o n on the clusters of slow machines takes up more t o t a l run time than on the cluster of fast machines. C o n c e r n i n g load balancing, d y n a m i c offloading from other busy processors can balance uneven load resulted from unevenly distributed d a t a among processors. However, if both the skip list p a r t i t i o n i n g and the d a t a d i s t r i b u t i o n are uneven, the load may be poorly balanced. Fortunately, in our testings, this adverse situation did not arise.  5.4.2  V a r y i n g the Buffer Size  Buffer size limits the amount of d a t a processed at each step. T h e larger the buffer size, the fewer steps are needed in P O L , and thus, less synchronizations and less sampling happens between steps. Usually, synchronization and s a m p l i n g mean the i n t r o d u c t i o n of overhead.  Therefore, as shown in F i g u r e 5 . 4 , as the buffer size  increases, performance improves.  76  Figure 5.4:  Scalability with Buffer Size  77  Chapter 6  Conclusion In this thesis we discuss a collection of novel parallel algorithms we developed d i rected towards online and offline creation of C U B E to support iceberg queries. We evaluated the C U B E algorithms, R P , B P P , P T , A S L and A H T , across a variety of parameters to determine the best situations for use. R P has the advantage of being simple to implement. However, except for cubes with low dimensionality, R P is outperformed by the other algorithms. B P P is also outperformed; but B P P reveals that breadth-first w r i t i n g is a useful o p t i m i z a t i o n . A s an extension of B P P , P T is the algorithm of choice in most situations. T h e r e are, however, two exceptional situations where A S L and A H T are recommended. A S L and A H T are more efficient for dense cubes, whereas A S L supports s a m p l i n g and progressive refinement especially. For the online aggregation, we tested our a l g o r i t h m , P O L , for aggregating online over a large d a t a set. E x p e r i m e n t s revealed t h a t P O L behaves well in a cluster of machines connected with high speed networks, and is valuable in answering high dimensional online queries which require more time to complete c o m p u t a t i o n . In future work, we would investigate how the lessons we have learned regarding parallel iceberg query c o m p u t a t i o n can be applied to other tasks in O L A P c o m p u t a t i o n and d a t a mining. These include (constrained) frequent set queries [24],  78  and O L A P c o m p u t a t i o n , t a k i n g into account correlations between  79  attributes.  Bibliography [1] M . J . A . B e r r y and G . Linoff. D a t a M i n i n g Techniques: F o r marketing, Sales, and C u s t o m e r S u p p o r t . J o h n W i l e y & Sons, N e w Y o r k , 1997 [2] R . A g r a w a l , S. A g r a w a l , P . Deshpande, A . G u p t a , J . N a u g h t o n , R . R a m a k r ishnan and S. Sarawagi. O n the c o m p u t a t i o n of multidimensional aggregates. In Proc. 1996 VLDB, pp. 506-521. [3] E . B a r a l i s , S. Paraboschi and E . Teniente.  M a t e r i a l i z e d view selection i n a  multidimensional database. In Proc. 1997 VLDB, pp. 98-112. [4] K . Beyer and R . R a m a k r i s h n a n . B o t t o m - U p C o m p u t a t i o n of Sparse and Iceberg C U B E s . In Proc. 1999 ACM SIGMOD,  pp 359-370.  [5] M . E b e r l , W . K a r l , C . T r i n i t i s , and A . Blaszczyk.  P a r a l l e l C o m p u t i n g on  P C Clusters - A n A l t e r n a t i v e to Supercomputers for Industrial A p p l i c a t i o n s . In Proc. 6th European Parallel Virtual Machine/Message Passing Interface Conference, L N C S v o l . 1697, pp. 493-498, 1999. [6] M . F a n g , N . Shivakumar, H . G a r c i a - M o l i n a , R . M o t w a n i and J . U l l m a n . C o m puting iceberg queries efficiently. In Proc. 1998 VLDB, pp. 299-310. [7] S. G o i l and A . C h o u d h a r y .  High Performance O L A P and D a t a M i n i n g on  Parallel C o m p u t e r s . In The Journal of Data Mining and Knowledge Discovery, 1, 4, pp. 391-418, 1997.  80  [8] J . G r a y , A . B o s w o r t h , A . L a y m a n and H . Pirahesh. D a t a c u b e :  A relational  aggregation operator generalizing group-by, cross-tab and sub-totals. In Proc. 1996 ICDE, pp. 152-159. [9] H . G u p t a , V . H a r i n a r a y a n , A . R a j a r a m a n and J . U l l m a n . Index selction for O L A P . In Proc. 1997 ICDE, pp. 208-219. [10] V . H a r i n a r a y a n , A . R a j a r a m a n  and J . U l l m a n .  efficiently. In Proc. 1996 ACM SIGMOD, [11] J . Hellerstein, J . Haas and H . W a n g . SIGMOD,  Implementing  d a t a cubes  pp. 205-216. Online Aggregation.  In Proc. 1997  pp. 171-182.  [12] M . K a m b e r ,  J . H a n and J . C h i a n g .  Metarule-guided  m i n i n g of m u l t i -  dimensional association rules using d a t a cubes. In Proc. 1997 KDD, p p . 2 0 7 210. [13] Y i H o n g Zhao, Prasad Deshpande, and Jeffrey F . Naughton  A n Array-based  algorithm for simultaneous M u l t i d i m e n s i o n a l aggregates. S I G M O D Conference 1997, pp. 159-170 [14] K . Ross and D . Srivastava. Fast C o m p u t a t i o n of Sparse Datacubes. In Proc. 1997 VLDB, pp. 116-125. [15] S. Sarawagi. E x p l a i n i n g differences in multidimensional aggregates. In Proc. 1999 VLDB, pp. 42-53. [16] A . S h u k l a , P . Deshpande and J . N a u g h t o n .  M a t e r i a l i z e d view selection for  multidimensional datasets. In Proc. 1998 VLDB, pp 488-499. [17] A . Srivasta,va, E . H a n , V . K u m a r and V . Singh. decision-tree  classification a l g o r i t h m .  In The Journal of Data Mining and  Knowledge Discovery, 3, 3, pp. 237-262, 1999.  81  Parallel formulations of  [18] M . T a m u r a and M . K i t s u r e g a w a . D y n a m i c L o a d Balance for P a r a l l e l A s s o c i a tion R u l e M i n i n g on Heterogeneous P C Cluster System. In Proc. 1999  VLDB,  pp. 162-173. [19] Soroush M o m e n - P o u r  P a r a l l e l C o m p u t a t i o n of D a t a C u b e s  M S c . Thesis,  University of B r i t i s h C o l u m b i a , C o m p u t e r Science D e p t . , 1999. [20] M . Z a k i .  P a r a l l e l and distributed association m i n i n g : a survey.  In  IEEE  Concurrency, 7, 4, pp. 14-25, 1999. [21] S. A g a r w a l , R . A g r a w a l , P . M . Deshpande,  A . Gupta, J . F . Naughton,  R . R a m a k r i s h n a n and S. Sarawagi O n the C o m p u t a t i o n of M u l t i d i m e n s i o n a l Aggregates. mProc. 1996 VLDB, pp. 506-521. [22] W . P u g h . S k i p Lists: a P r o b a b i l i s t i c A l t e r n a t i v e to Balance Trees. In Com,-  munications ofthe ACM 1990. [23] E u r - H o n g (Sam) H a n , George K a r y p i s , V i p i n K u m a r Scalable P a r a l l e l D a t a M i n i n g for A s s o c i a t i o n Rules Proceedings of the A C M S I G M O D international conference on M a n a g e m e n t of d a t a M a y 11 - 15, 1997, Tucson, A Z U S A [24] R . T . N g , L . V . S . L a k s h m a n a n , J . H a n , and A . P a n g .  E x p l o r a t o r y mining  and pruning optimizations of constrained associations rules. SIGMOD,  pp. 13-24.  82  In Proc. 1998  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051672/manifest

Comment

Related Items