Parallel Computation of Data Cubes by Soroush Momen-Pour M.Sc. Sharif University of Technology, 1991, B.Sc. Tehran University, 1989 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF Master of Science in T H E FACULTY OF GRADUATE STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia September 1999 © Soroush Momen-Pour, 1999 ln presenting degree freely at this the thesis in partial University of British available for copying of department publication this or of reference thesis by this for his thesis fulfilment scholarly her for I further purposes of representatives. financial dorter The University of British Columbia Vancouver, Canada Date DE-6 (2/88) ^ 3 ° > WW requirements gain It shall not & fence- that the agree may permission. Department the Columbia, I agree and study. or of be is that Library an granted by allowed advanced shall make permission for understood be for the that without it extensive head of my copying or my written Abstract Since its proposal, data cube has attracted a great deal of attention in both academic and industry research communities. Many research papers have been published about different issues related to data cubes and many commercial OLAP (On-Line Analytical Processing) systems have been released to market with data cube operations as their core functions. Several algorithms have been proposed to compute data cubes more efficiently. PIPESORT and PIPEHASH algorithms proposed by Agrawal et. al., OVERLAP algorithm proposed by Naughton et. al., Partitioned-Cube algorithm proposed by Ross et. al. and the Multi-way Array algorithm proposed by Naughton et. al. are the most significant ones. All of these algorithms are designed for implementation on sequential machines, however computing a data cube can be an expensive task. For some organizations it may take a very powerful computer working around the clock for a week to compute all the data cubes they may want to use. Application of parallel processing can speed up this process. Despite the popularity and importance of data cubes, very little research has been carried out on the parallel computation of them. The only parallel algorithm for computation of data cubes, which I am aware of, is the algorithm proposed by Goil et. al.. Their algorithm works for cases where the data ii set fits in main memory, however, real world data sets rarely fit in main memory. The wide spread availability of inexpensive cluster machines makes it possible to use parallel processing for computation of data cubes, even in small size firms and as a result there could be a real demand for efficient parallel data cube construction algorithms. I have designed and implemented two parallel data cube computation algorithms (Parallel Partitioned-Cube algorithm and Parallel Single-pass Multi-way Array algorithm) based on sequential algorithms proposed in literature. The former algorithm is classified as a ROLAP (Relational OLAP) algorithm and the second one is considered as a MOLAP (Multi-dimensional OLAP) algorithm. iii Contents Abstract ii Contents iv List of Tables vi List of F i g u r e s viii Acknowledgements x Dedication xi 1 Introduction 1 2 Background 5 3 2.1 Problems with GROUP BY operator 2.2 The Data Cube Operator 11 2.3 Different type of aggregate functions 13 Sequential A l g o r i t h m s . 6 15 3.1 Hierarchies 17 3.2 Selecting queries to precompute 18 iv 3.3 Computing data cubes 22 3.4 PipeSort Algorithm 25 3.4.1 3.5 3.6 3.7 Algorithm : 27 Partitioned-Cube Algorithm 31 3.5.1 34 Algorithm Memory-Cube . . . Array-Based Algorithm 38 3.6.1 41 Single-pass Multi-way Array Algorithm Performance comparison 44 4 Parallel Algorithms 49 4.1 Parallel Partitioned-Cube Algorithm 49 4.2 Parallel Single-pass Multi-way Array Algorithm 56 5 Conclusion 64 Bibliography 66 v List of Tables 2.1 Sales of a department store 6 2.2 Sales of a department store grouped by Item and Date 7 2.3 Sales of a department store at different granularities 7 2.4 Sales of a department store at different granularities in a relational format 2.5 9 Sales of a department store at a granularity level not included in Table 2.4 10 2.6 Sales of a department store in cross-tabulation format 11 2.7 Sales of a department store in data cube format 12 3.1 Benefits of materializing different views at each round 23 3.2 Computing a 4-dimensional (30*24*2*158) Cube 45 3.3 Computing a 5-dimensional (30*24*2*158*360) Cube 46 4.1 Computing a 4-dimensional (30*24*2*158) Cube on 2 nodes using method 3 4.2 55 Computing a 5-dimensional (30*24*2*158*360) Cube on 2 nodes using method 3 55 vi 4.3 Computing a 4-dimensional (30*24*2*158) Cube on 2 nodes using method 4 4.4 62 Computing a 5-dimensional (30*24*2*158*360) Cube on 2 nodes using method 4 62 vii List of Figures 2.1 The syntax proposed for data cube operator 11 2.2 The data cube expression for the data cube shown in Table 2.7 . . . 12 3.1 Data Cube Lattice Structure 17 3.2 Hierarchical GROUP BY attributes 19 3.3 Hierarchical lattice structure 20 3.4 The greedy algorithm for selecting a set of views to materialize . . . 21 3.5 A data cube lattice structure with associated costs . . 22 3.6 Pipesort Algorithm 28 3.7 A layer of the search lattice after transformation 3.8 The layer of the search lattice after running weighted bipartite match- 3.9 . . 29 ing algorithm 29 Search lattice with associated costs 30 3.10 The minimum cost sort plan 30 3.11 The pipelines that are executed 30 3.12 Algorithm Partitioned-Cube 32 3.13 An example of running Partitioned-Cube Algorithm 33 3.14 Algorithm Memory-Cube 35 viii 3.15 Algorithm Paths 36 3.16 A chunked array 40 3.17 Minimum Memory Spanning Tree 43 3.18 Computing a 4-dimensional (30*24*2*158) Cube 46 3.19 Computing a 4-dimensional (30*24*2*158*360) Cube 46 4.1 Computing a 5-dimensional (30*24*2*158*360) Cube on two nodes using three different methods 4.2 53 Computing a 5-dimensional (30*24*2*158*360) Cube on four nodes using three different methods 4.3 53 Computing a 4-dimensional (30*24*2*158) Cube on different number of nodes using method 3 4.4 54 Computing a 5-dimensional (30*24*2*158*360) Cube on different number of nodes using method 3 54 4.5 Hypercube structure 58 4.6 Computing a 5-dimensional (30*24*2*158*360) Cube on two nodes using four different methods 4.7 60 Computing a 5-dimensional (30*24*2*158*360) Cube on four nodes using four different methods 4.8 60 Computing a 4-dimensional (30*24*2*158) Cube on different number of nodes using method 4 4.9 61 Computing a 5-dimensional (30*24*2*158*360) Cube on different number of nodes using method 4 61 ix Acknowledgements I would like to express my gratitude to my supervisor, Dr. Alan Wagner, for helping me in carrying out this research project and for reading the manuscript of my thesis and offering his valuable comments. I also would like to thank Dr. George Tsiknis for reading the manuscript of my thesis and providing me with his helpful comments. SOROUSH M O M E N - P O U R The University of British Columbia September 1999 x To my dear wife Nooshin who always supported me. xi Chapter 1 Introduction Unlike an operational database that maintains the daily used information about activities of an enterprise a data warehouse contains historical data about its past. Data warehouses are normally huge because they accumulate information over a long period of time. On-Line Analytical Processing, OLAP, is used to extract useful summary information from data warehouses. This summary information can be used in decision support systems to assist managers of an enterprise in business related decision making process. OLAP [1] enables data analysts to ask numerous speculative what if and why questions within the context of some specific historical perspective. One operation that has been recently proposed is data cube operator [2] which computes A^-dimensional aggregates that can be used in OLAP applications. Since its proposal, data cube has attracted a great deal of attention in both academic and industry research communities. Many research papers have been published about different issues related to data cubes and many commercial OLAP (On-Line Analytical Processing) systems have been released to market with data cube operations as their core functions. 1 Several algorithms have been proposed to compute data cubes more efficiently. P I P E S O R T and P I P E H A S H algorithms proposed by Agrawal et. al. [3], O V E R L A P algorithm proposed by Naughton et. al. [4], Partitioned-Cube algorithm proposed by Ross et. al. [5] and the Multi-way Array algorithm proposed by Naughton et. al. [6] are the most significant ones. A l l of these algorithms proposed for computing data cubes are designed for implementation on sequential machines, however, computing a data cube can be an expensive task. For some organizations it may take a very powerful computer working around the clock for a week to compute all the data cubes they may want to use. Application of parallel processing can speed up this process and makes it possible to precompute more data cubes more often. Despite the popularity and importance of data cubes, very little research has been carried out on the parallel computation of them. The only parallel algorithm for computation of data cubes, which I am aware of, is the algorithm proposed by Goil et. al. [7]. Their algorithm works for cases where the data set fits in main memory, however, real world data sets rarely fit in main memory. The goals of this research are to design and implement efficient parallel algorithms for computation of data cubes when the input data does not fit in main memory and experimentally evaluate the use of cluster machines for computing data cubes. Cluster machines are workstations connected via high speed interconnection networks such as Myrinet and are widely used to implement parallel algorithms. These machines can reach the performance of massive parallel machines with a much lower cost. The wide spread availability of these inexpensive machines makes it possible to use parallel processing for computation of data cubes, even in small size firms and as a result there could be a real demand for efficient parallel data cube 2 construction algorithms. I have designed and implemented two parallel data cube computation algorithms (Parallel Partitioned-Cube algorithm and Parallel Singlepass Multi-way Array algorithm) based on sequential algorithms proposed in literature. The former algorithm is classified as a ROLAP (Relational OLAP) algorithm and the second one is considered as a MOLAP (Multi-dimensional OLAP) algorithm. Neither the Parallel Partitioned-Cube algorithm nor the Parallel Single-pass Multi-way Array algorithm requires storing of the entire input data in main memory. Other than the problem of computing data cubes there are many other interesting data cube related issues that have appeared in the literature. Ho et. al. , studied range queries in OLAP data cubes [8] and partial-sum queries in OLAP data cubes [9]. A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. A partial-sum query obtains the summation over a set of specified cells of a data cube. Ross et. al. studied complex aggregation at multiple granularities [10]. By complex aggregation they refer to queries involving multiple dependent aggregates at multiple granularities. Sarawagi et. al. studied discoverydriven exploration of OLAP data cubes [11]. They proposed a new discovery-driven exploration paradigm that could mine the data for exceptions. Han studied the integration of data mining and OLAP technologies [12]. Roussopoulos et. al. proposed cubetree as a storage abstraction of the data cube [13]. Structure of this thesis is as follows. Chapter 2 presents some background material that is useful to understand other chapters. Chapter 3 describes in detail 3 different sequential data cube computation algorithms and presents the results of some experiments performed with theses algorithms. Chapter 4 explains two parallel data cube computation algorithms that I have designed and implemented. 3 Chapter 5 is the conclusion. Chapter 2 Background In business and scientific applications, it is common to summarize data values using different aggregation functions. For example, consider the sales database of a department store. This store has different departments such as electronics, home appliances and others where each department carries different items and the sale transactions are all stored in a relational database. Table 2.1 shows partial sales of the department store on Boxing week. We can extract .different summaries from this table. One possible summary is the sale of each item on each day. This summary can be generated using the following SQL query: SELECT i t e m , date , Sum(sale) FROM s a l e s GROUP BY item , date; The GROUP B Y operator divides tuples into different groups based on their values of GROUP B Y attributes. Tuples with the same value on all GROUP B Y attributes fall into the same group. Table 2.2 shows the result of the above query. 5 Table 2.1: Sales of a department store Item Customer Date Sony 25" T V Sony 25" T V Sony 25" T V Sony 25" T V JVC 21" T V JVC 21" T V JVC 21" T V Panasonic Hi-Fi VCR Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator 98/12/26 98/12/26 98/12/27 98/12/27 98/12/26 98/12/26 98/12/27 98/12/26 98/12/27 98/12/26 98/12/27 98/12/26 98/12/26 98/12/27 Smith Robinson Peterson Reynolds Boyd Miller Hansen Lewis Anderson Patterson Kumar Shaw Watson Kennedy Sale 700 700 700 700 400 400 400 250 250 1400 1400 600 600 600 Other than Sum there are three other aggregate functions in standard SQL: Avg, Max and Min. Some database management systems support other domain specific functions, such as statistical functions (Median, Standard Deviation, physical functions (Center of Mass, Angular Momentum, •'functions(Volatility, Alpha, 2.1 Variance, etc.), etc.) andfinancialanalysis Beta, etc.) [2]. Problems with G R O U P B Y operator There are some aggregate queries that are difficult to represent in standard SQL: histograms, roll-up totals and sub-totals for drill-downs and cross-tabulations [2]. Histograms are generated by aggregating over computed categories. For example if we want to extract the sales summary from the department store database based on the monthly sales of the store we need to categorize dates into months and group tuples based on this computed attribute and compute the aggregate function 6 Tab e 2.2: Sales of a department store grouped by Item and Date Item Date Sale Sony 25" T V Sony 25" T V JVC 21" T V JVC 21" T V Panasonic Hi-Fi V C R Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator Date 98/12/26 98/12/27 98/12/26 98/12/27 98/12/26 98/12/27 98/12/26 98/12/27 98/12/26 98/12/27 1400 1400 800 400 250 250 1400 1400 1200 600 Table 2.3: Sales of a department store at different granularities Item Total Sale Sale by Date by Date Sale and Item 98/12/26 Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator 1400 800 250 1400 1200 5050 98/12/27 Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator 1400 400 250 1400 600 4050 9100 7 for each group. If function values were allowed in the G R O U P B Y list the above query could be written as follows: SELECT item , Month(date) , Sum(sale) FROM s a l e s GROUP BY item ', Month ( d a t e ) ; The above S E L E C T query is not allowed in standard S Q L but it can be translated into another standard S Q L query: SELECT item , month, Sum(sale) FROM (SELECT item , Month(date) AS month , s a l e FROM s a l e s ) GROUP BY item , month; Sometimes we want to generate a report containing total and sub-totals at different granularities. For example, we may want to know the sales by date and item, sales by date, and total sale of the store. Table 2.3 shows the required information. The sale figures in 3rd column represent total sales at finest granularity, the 4th column contains totals at next coarser granularity while the 5th column includes total sales at the coarsest granularity. Moving from a granularity level to a finer granularity level is called drilling-down and moving from one granularity level to a coarser granularity level is called rolling-up. Table 2.3 is not relational, because it contains null values in its primary key attributes. Moreover, the number of columns is large since it is the power set of the aggregation attributes. This table can be represented in a relational format where the number of columns grows linearly, rather 8 Table 2.4: Sal es of a department store at different granularities in a relational format Item Date Sale Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator ALL Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator ALL ALL 98/12/26 98/12/26 98/12/26 98/12/26 98/12/26 98/12/26 98/12/27 98/12/27 98/12/27 98/12/27 98/12/27 98/12/27 ALL 1400 800 250 1400 1200 5050 1400 400 250 1400 600 4050 9100 than exponentially, with the number of aggregation attributes. Table 2.4 shows the same data in a relational format. This table can be generated by the following SQL statement: SELECT item , date , Sum(sale) FROM s a l e s GROUP BY item , date UNION SELECT ALL , date , Sum(sale) FROM s a l e s GROUP BY date UNION SELECT ALL , ALL , Sum(sale) FROM s a l e s ; 9 Table 2.5: Sales of a department store at a granularity level not included in Table 2.4 Item Date Sale Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator ALL ALL ALL ALL ALL 2800 1200 500 2800 1800 Table 2.4 is not symmetric. It does not contain the rows shown in Table 2.5. Theses rows can be added by appending the following clause to the above SQL statement: UNION SELECT item , ALL , Sum(sale) FROM s a l e s GROUP BY item; The symmetric aggregation table is called a cross-tabulation and can be represented as Table 2.6. Histogram, roll-up, drill-down and cross-tab queries all can be formulated in standard SQL but their representation is inconvenient. Writing a 5 dimensional cross-tab query in standard SQL requires us to union 32 different GROUP B Y queries where each GROUP BY query scans the table once and performs either a sort or a hash operation on it. Therefore a multi-dimensional cross-tab query is computationally intractable for large number of dimensions. 10 Table 2.6: Sales of a department store in cross-tabulation format Department Store Sales 98/12/26 98/12/27 Total Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator total(ALL) 1400 800 250 1400 1200 5050 1400 400 250 1400 600 4050 2800 1200 500 2800 1800 9100 GROUP BY ( { ( <column name> | <expression>) [AS C o r r e l a t i o n name>] [<collate clause>] ,...} [WITH (CUBE | R0LLUP)] ) Figure 2.1: The syntax proposed for data cube operator 2.2 The Data Cube Operator Gray et. al. [2] have proposed particular solutions for GROUP BY operator problems explained in section 2.1. They introduced Data Cube operator for generating crosstabulations . This operator is a generalization of GROUP B Y operator. An Ndimensional data cube consists of 2^ cuboids where each cuboid is the result of one of the 2 N possible GROUP BY queries that can be written with N attributes. Table 2.7 shows the result of Data Cube operator when it is applied to the department store database. The syntax proposed for data cube operator in [2] is shown in Figure 2.1. Figure 2.2 shows the data cube expression that can be written for Table 2.7 using this syntax. 11 Table 2.7: Sales of a department store in data cube format Sale Item Date Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator ALL Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator ALL Sony 25" T V JVC 21" T V Panasonic Hi-Fi VCR MayTag 29 cubic foot refrigerator Whirlpool 22 cubic foot refrigerator ALL 98/12/26 98/12/26 98/12/26 98/12/26 98/12/26 98/12/26 98/12/27 98/12/27 98/12/27 98/12/27 98/12/27 98/12/27 ALL ALL ALL ALL ALL ALL 1400 800 250 1400 1200 5050 1400 400 250 1400 600 4050 2800 1200 500 2800 1800 9100 GROUP BY Item , Date WITH CUBE Figure 2.2: The data cube expression for the data cube shown in Table 2.7 12 If the WITH clause is used with CUBE then all the cuboids of the data cube will be computed but if it is used with ROLLUP then only the following cuboids are generated: (flj2,-Jn-l,ALL), (fi,f ,-Jn-2,ALL,ALL), 2 /2, ...J -3,ALL, ALL, n ( / i , ALL,..., ALL), ALL), (ALL, ALL,..., ALL). The AS clause is used to support histograms. 2.3 Different type of aggregate functions Gray et. al. classified aggregate functions into three categories: Distributive, Algebraic and Holistic [2]. In order to define each category consider aggregating a two dimensional set of values {Xij\i = = 1,...,J}. Distributive: Aggregate function F() is distributive if there is a function G() such that F({X }) itj = G({F{{Xij\i = = 1,J}). Algebraic: Aggregate function F() is algebraic if there is an M-tuple valued function G() and a function H () such that F({X }) id = H({G({Xij\i = = 1,-,</})• Holistic: Aggregate function F() is holistic if there is no constant M, such that an M-tuple characterizes the computation of F({Xij\i COUNT{), MIN{), MAXQ and SUM() all of these functions except the COUNTQ, tion, G equals SUM(). = 1,...,/}). are all distributive functions. For F equals G. For the COUNTQ func- Average and Standard Deviation are examples of algebraic 13 functions, where for Average, the function C7() returns the s u m a n d count of the subset a n d H() adds the two quantities returned by G a n d then divides to compute the global average. Median() and ModeQ are two examples of holistic functions where it is not possible to compute t h e m without scanning the entire data. T h e r e are m a n y algorithms for efficient c o m p u t a t i o n of a d a t a cube w i t h a distributive or algebraic function as aggregating function. rithms are described i n chapter 3. Some of these algo- I a m not aware of any a l g o r i t h m for efficient c o m p u t a t i o n of d a t a cubes w i t h holistic aggregating functions, although we can always compute these data cubes by c o m p u t i n g each of their u n d e r l y i n g G R O U P B Y s independently. 14 Chapter 3 Sequential Algorithms OLAP queries are used to extract useful summary information from data warehouses. These queries usually include aggregation over several attributes of a table and are more complicated than OLTP (On-Line Transaction Processing) queries. Since data warehouses are larger than operational databases and OLAP queries are more complex than OLTP queries, it is important to use efficient algorithms to answer to theses queries. One way to achieve satisfactory performance is to precompute common queries. Common queries can be precomputed when the system is not being used and updated on a regular basis. Due to disk storage limitations it may not be possible to precompute all of the common queries. In this case we can precompute only a subset of them and compute the remaining queries upon request. Computing these queries can be speeded up by taking advantage of precomputed queries. For example, consider the sales database given in the previous chapter. In order to precompute all the GROUP BY queries resulted from a three dimensional cube operation it is necessary to compute the following GROUP BY queries: 15 1. GROUP BY Store, Rem, Date 2. GROUP BY Store, Item 3. GROUP BY Store, Date 4. GROUP BY Store 5. GROUP BY Item, Date 6. GROUP BY Item 7. GROUP BY Date .8. None There is a lattice structure described first in [14] among these queries. Figure 3.1 shows this structure where each node represents a GROUP BY query or in other words a view of the data cube. An edge from a node to another node with higher number of attributes implies the result of the former GROUP computed from the latter one. For example, GROUP BY query can be BY Item can be computed from both GROUP BY Store, Item and GROUP BY Item, Date. We may decide to materialize only a subset of all the views. For example, we may decide to precompute only GROUP BY Store, Item, Date (the core GROUP BY or the top view) and compute all the other GROUP BY queries by using the precomputed one. In this case, if the user asks for GROUP BY Store, Item, it is necessary to aggregate all the rows of the GROUP BY Store, Item, Date with the same value for Store and item attributes over Date attribute. 16 SID SI SD F i g u r e 3.1: 3.1 ID D a t a C u b e Lattice Structure Hierarchies I n m a n y cases a GROUP BY a t t r i b u t e has some h i e r a r c h i c a l structure. F o r e x a m p l e , the date a t t r i b u t e has the hierarchy: Day > Month > Year W e c a n also consider another h i e r a r c h y for date: Day > Week E x i s t e n c e o f s u c h h i e r a r c h i e s is a n i m p o r t a n t i s s u e t h a t s h o u l d b e t a k e n i n t o account w h e n c o m p u t i n g the d a t a cube. A s previously mentioned, two c o m m o n ope r a t i o n s i n O L A P are d r i l l - d o w n a n d r o l l - u p . W e have a l r e a d y seen a n a p p l i c a t i o n o f t h e s e o p e r a t i o n s w h e n we c o m p u t e a g g r e g a t e d d a t a a t d i f f e r e n t g r a n u l a r i t i e s b y choosing a different n u m b e r of GROUP BY 17 a t t r i b u t e s . I n t h i s case, a n e x a m p l e o f a drill-down operation is to start from GROUP BY store and move to a finer granularity level, e.g. GROUP BY store, date, and finally move to the finest granularity level, i.e. GROUP BY store, date, item: GROUP BY store > GROUP BY s t o r e , date > GROUP BY s t o r e , date, i t Another application of these operations is conceivable when we compute aggregated data at different granularities by choosing a different level of hierarchies for each of the GROUP BY attributes. In this case, an example of a drill-down operation is to start from GROUP BY date at the yearly level and move to a finer granularity level, e.g. GROUP BY date at the monthly level and finally moving to the finest granularity level, i.e. GROUP BY date at the daily level: GROUP BY year —-> GROUP BY month —-> GROUP BY day The lattice structure is more complex in the presence of hierarchies. For example, consider attribute hierarchies shown in Figure 3.2. Figure 3.3 shows the data cube lattice structure in the presence of these hierarchies. 3.2 Selecting queries to precompute As previously mentioned, one way to achieve satisfactory performance is to precompute common queries or in other words to materialize the views of the data cube associated with them, but, due to disk storage limitations, it may not be possible to precompute all the common queries. In this case, we should decide which queries to precompute or, in other words, which views to materialize. If we do not want to refer to the raw data we always have to materialize the top view because it can not be constructed by using the other materialized views. 18 item month week category \ none none Figure 3.2: Hierarchical GROUP BY attributes Harinarayan et. al. proposed a greedy algorithm in [14] for this problem. They assumed that 1. We wish to minimize the average time to materialize a view. 2. We can only materialize a fixed number of views. They claimed that this problem was NP-complete and therefore suggested a heuristic solution for it. The input to their algorithm is a data cube lattice with space costs associated with each view. Furthermore they assume that the cost in terms of space is the number of the rows in the view. Let C(v) denote the cost of view v, and k represent the number of views in addition to the top view that we want to materialize. The benefit of view v relative to some set S of views that has already been selected, is denoted by B(v, S) and is defined as follows: 19 F i g u r e 3.3: Hierarchical lattice structure 20 S = {top for i view}; = 1 t o k do b e g i n select a view v not S = S union i n S such t h a t B(v , S) i s maximized; {v}; end; r e t u r n S; F i g u r e 3.4: T h e greedy algorithm for selecting a set of views to materialize 1. For each view w which can be constructed from materialized view v(w < v) , define B w as: (a) L e t u denote the view w i t h least cost i n S such that w < u. Since the top view is i n S, there must be at least one such view i n S. (b) If C(v) 2. Define < C(u), B ( v , S ) = then B J2 <v w B w w = C{u) - C(v), otherwise B w = 0. . In other words the benefit of selecting a view v, is equal to the s u m of all the benefits that selecting v makes for each of the views w that can be constructed from v. T h i s i n d i v i d u a l benefit, i n t u r n , is measured by comparing the cost of v w i t h the cost of the cheapest view u, i n the set of all so far selected views S that w can be constructed from. If cost of v is less t h a n cost of u then selecting v makes some benefit w i t h respect to w, otherwise it makes no benefit. T h e greedy a l g o r i t h m proposed i n [14] is shown i n F i g u r e 3.4. T h e a l g o r i t h m starts w i t h the top view of the lattice a n d then it iterates a loop k times (k is number of the views, other t h a n the top view, to materialize). In each iteration of the loop the a l g o r i t h m selects a view v which has not been added to S before and materializing it would cause the m a x i m u m benefit to the views that can be 21 SID <5K SI2M S 100 VZ55 I0.05M Figure 3.5: A data cube lattice structure with associated costs constructed from v, including v itself. Table 3.1 shows a trace of running this algorithm on the lattice of Figure 3.5 with k = 2. In the first round SD is selected to be materialized, because the benefit of materializing it is the highest among all the candidates. In the second and third rounds, SI and / are selected respectively, so among all the views we only materialize SID (the top view), SD, SI and I. Harinarayan et. al. showed in [14] that the benefit of the greedy algorithm was at least 63% of the benefit of the optimal algorithm where the benefit of each algorithm was defined as the total benefit of selecting the views that were selected by the algorithm. The precise fraction is (e — l)/e where e is the base of the natural logarithms. 3.3 Computing data cubes There are many algorithms proposed for computing the data cubes. These algorithms fall into two broad categories: algorithms suitable for ROLAP (Relational On-Line Analytical Processing) systems and algorithms suitable for MOLAP (Mul- 22 Table 3.1: Benefits of materializing different views at each round First Round Second Round Third Round SI SD ID S D I None 4 x (6M - 2M) = 16.00M 4 x (6M - 36500) « 23.85M 4 x (6M - 5M) = 4.00M 2 x (6M - 100) « 11.99M 2 x (6M - 365) « 11.99M 2 x (6M - 0.05M) = 11.90M lx(6M-l) « 5.99M 2 x (6M - 2M) = 8.00M 2 x (6M - 5M) = 2.00M 2 x (36500 - 100) ss 0.07M 2 x (36500 - 365) « 0.07M 1 x (6M - 0.05M) = 5.95M 1 x (36500 - 1) « 0.03M 1 x (6M - 5M) = 1.00M 2 x (36500 - 100) « 0.07M 2 x (36500 - 365) « 0.07M 1 x (2M - 0.05M) = 1.95M 1 x (36500 - 1) ^0.03 tidimensional On-Line Analytical Processing) systems. ROLAP systems use relational tables as their data structure. In these systems each view of the data cube is represented by a table and each cell is stored as a.tuple. Some of the attributes of these tuples identify its position in the cube and the others contain data values associated with it. MOLAP systems store data in sparse arrays. Each cell of the cube is represented by an element of the array. Unlike ROLAP systems only the data values are stored. The position of a cell determines the values of its dimensional attributes. In the remaining sections of this chapter two ROLAP algorithms and one MOLAP algorithm will be studied. Sarawagi et. al. summarized in [3] some of the optimization techniques that ROLAP algorithms try to take advantage of: 1. Smallest-parent: This optimization technique was first proposed in [2]. According to this technique we should compute a GROUP BY from its smallest 23 parent. F o r example GROUP lowing GROUP BY Store c a n be computed from any the fol- BY queries: (a) GROUP BY Store, Item, Date (b) GROUP BY Store, Item (c) GROUP BY Store, Date O b v i o u s l y the first GROUP BY is larger t h a n the other two, therefore accord- ing to smallest-parent rule it should not be used . T h e size of the second a n d the t h i r d GROUP BY queries might be significantly different so the smaller of t h e m should be used to compute GROUP BY Store query. 2. Cache-results: T h i s o p t i m i z a t i o n technique requires us to compute a BY query while its parent is still i n memory. GROUP For example, i f GROUP Store, Item has been computed we should compute GROUP BY BY Store from it while it is still i n memory. 3. Amortize scans: T h i s technique tries to amortize disk scans by c o m p u t i n g as m a n y GROUP GROUP BY queries as possible, together i n memory. F o r example, if BY Store, Item has been computed a n d stored o n disk, we should compute GROUP GROUP BY Store a n d GROUP BY Item together i n one scan of the BY Store, Item. 4. Share-sorts: T h i s technique is used i n sort-based algorithms a n d is about sharing sorting costs across multiple GROUP BY queries. 5. Share-partitions: T h i s technique is used i n hash-based algorithms. W h e n the hash table does not fit i n m e m o r y d a t a is partitioned into partitions that fit 24 in memory and aggregation is performed for each partition separately. This technique attempts to share the partitioning cost across multiple GROUP BY queries. It is not always possible to use all of these optimization techniques simultaneously because they might be contradictory. For example, if GROUP BY Store, Date is smaller than GROUP BY Store, Rem, according to Smallest-parent rule we should compute GROUP BY Store from the first GROUP BY, but according to Cache-results rule we should compute it from the second GROUP BY if the second GROUP BY is still in memory. 3.4 PipeSort Algorithm PipeSort algorithm was proposed by Sarawagi et. al. in [3]. This algorithm uses share-sorts technique by using data sorted in a particular order to compute all the GROUP BY queries that are prefixes of that order. For example if the raw data is sorted in attribute order Store, Item, Date then all the following GROUP BY queries will be computed from it without an additional sort: 1. GROUP BY Store, Item, Date 2. GROUP BY Store, Item 3. GROUP BY Store This optimization can conflict with smallest-parent technique. For example, if the smallest parent of GROUP BY Store is GROUP BY Store, Date, according to smallest-parent technique we should compute it from this GROUP BY but according to share-sorts technique we should compute it from GROUP BY Store, Item if the 25 raw data is already sorted in Store, Item, Date attribute order. PipeSort uses a combination of share-sorts and smallest-parent optimizations. PipeSort also incorporates cache-results and amortize-scan optimizations by computing different GROUP BY queries in a pipelined fashion. For example, if the raw data is already sorted in attribute order Store, Item, Date then all the GROUP B Y queries that are prefixes of this order are computed in a pipeline: Raw Data —> GROUP BY S t o r e , Item, Date —> GROUP BY S t o r e , Item —> GROUP BY Store By scanning the raw data and aggregating rows with the same value for all the dimension attributes, rows of GROUP BY Store, Item, Date are computed. When a row of GROUP BY Store, Item, Date is ready then it is propagated up to compute GROUP B Y Store, Item and when a row of this GROUP BY is ready it is propagated up to compute GROUP BY Store. PipeSort algorithm needs an estimate of the number of unique values of each attribute. The input to the PipeSort is the search lattice structure proposed in [14]. Each node of the lattice represents a GROUP BY query. A directed edge connects node i to node j if the GROUP BY j can be computed from GROUP BY i and GROUP BY j has exactly one attribute less than GROUP BY i. GROUP BY i is called the parent of GROUP BY j. Each node may have several parents. The node at level 0 (the topmost level) of the lattice corresponds to the GROUP BY with no attribute and each node at level k represents a GROUP BY with k attributes. There are two costs associated with each edge e^-. The first cost S(eij) is the cost of computing GROUP BY j from GROUP BY i when i is not already sorted and the second cost A(eij) is the cost of computing GROUP BY j from GROUP BY i when i is already sorted. 26 The output of PipeSort algorithm is a subgraph of the search lattice where each node other than the root node has exactly one parent. The parent of each GROUP BY determines the GROUP BY that it is computed from. The root GROUP BY {GROUP BY All Attributes) is computed from the raw data. There is an attribute order associated with each node of the output subgraph. This attribute order determines the sort order in which the GROUP BY result will be sorted. If the attribute order of GROUP BY j is a prefix of the attribute order of its parent GROUP BY i then GROUP BY j can be computed without having to sort GROUP BY i, but if it is not a prefix of the attribute order of GROUP BY i then in order to compute GROUP BY GROUP BY i has to be sorted first. The edge that connects GROUP BY i to GROUP BY j where the sort order of j is a prefix of the sort order of i, is labeled as A and incurs cost A(eij). The edge that connects GROUP BYi to GROUP BY j where the sort order of j is not a prefix of the sort order of i, is labeled S and incurs cost S(eij). Obviously there is at most one edge labeled A originating from node i, because only one of the children of node i can be associated with a prefix order of node i, but there might be several edges labeled as S originating from node i. The goal of the algorithm is to find a subgraph of the input lattice structure that has the minimum total cost of edges. 3.4.1 Algorithm PipeSort algorithm is shown in Figure 3.6. The algorithm prunes the search lattice level by level starting from level k = 0 to level k = N — 1, Where N is total number of attributes. At each level k, it finds the optimum way of computing nodes at level k from nodes at level k + l. In order to do that PipeSort reduces the problem to a weighted bipartite matching problem. In order to solve the weighted bipartite 27 P i p e S o r t : (Input search l a t t i c e w i t h A and S edges c o s t s ) For l e v e l k = 0 t o N - 1 / * N i s the t o t a l number of a t t r i b u t e s * / Generate-Plan(k+1 —> k ) ; For each group-by g i n l e v e l k+l F i x the s o r t order of g as the order of the l e v e l k group-by t h a t i s connected t o g by an A edge; Generate-Plan(k+1—>k) Create k a d d i t i o n a l copies of each l e v e l k+l node; Connect each copy v e r t e x t o the same set of l e v e l k v e r t i c e s as the o r i g i n a l v e r t e x ; A s s i g n c o s t s A() t o edges from the o r i g i n a l v e r t e x and c o s t s S O t o edges from the copy v e r t e x ; F i n d the minimum cost matching on the transformed l e v e l k+l with l e v e l k; F i g u r e 3.6: Pipesort A l g o r i t h m matching problem, first, k additional copies of each node at level k + l are created, therefore each GROUP BY at level k + l which has k + l attributes will have the same number of nodes. F r o m each replicated node the same set of edges originate as from the original node. T h e edge emanating from the original node i at level k + l to node j at level k is labeled A a n d incurs cost A ( e ^ ) a n d the edge originating from each replicated node of i at level incurs cost 5(e^). + 1 to node j at level k is labeled S a n d W e then apply weighted-bipartite matching algorithm to the transformed subgraph. After p r u n i n g some of the edges that connect nodes at level k + 1 to nodes at level k, each node at level k is only connected to one node at level k + l t h r o u g h either an A edge or an S edge. In the first case (an A edge), the attribute order of the node at level k determines the order of the attributes of the node at level k + l a n d no sort is necessary. In the second case (an S edge), node at level k + 1 is resorted i n order to compute the node at level k. 28 the Figure 3.7: A layer of the search lattice after transformation S I SI 2 SI 7 ^ D •T SD 4 SD 13 ID 25 ID 34 Figure 3.8: The layer of the search lattice after running weighted bipartite matching algorithm Figure 3.7 shows a layer of the search lattice after transformation and Figure 3.8 shows the same layer of the search lattice after execution of the weighted bipartite matching algorithm. Figures 3.9, 3.10 and 3.11 respectively show the original search lattice, the minimum cost sort plan produced by PipeSort algorithm and the final set of pipes used to compute the data cube. 29 Alll S 2,2 SD 48,268 D 24,110 130,147 ID 720,6834 SI60354 SID 1440,15108 figure 3.9: Search lattice w i t h associated costs SD 31 SDI 4 F i g u r e 3.10: T h e m i n i m u m cost sort p l a n All ± 1 A I I ! SI ID I t ID SDI SDI SDI S SD 4i D A 1 i Raw Data Figure 3.11: T h e pipelines that are executed 30 3.5 Partitioned-Cube Algorithm Partitioned-Cube algorithm was proposed by Ross et. al. in [10]. This algorithm is based on two fundamental ideas that have already been used for performing operations such as sorting and joining very large tables: 1. Partition the large table into fragments that fit in memory. 2. Perform the operation over each fragment independently. Algorithm Partitioned-Cube is described in Figure 3.12. This algorithm depends on another algorithm called Memory-Cube to compute the data cube when the input relation fits in memory. Memory-Cube algorithm will be explained in section 3.5.1. The input of Partitioned-Cube algorithm is a set of tuples R, which might be stored in horizontal fragments, the dimension attributes {B\, ...,B }, m ag- gregation attribute A and the aggregate function G(.). The output of the algorithm is the data cube result for R over {B\, ...,B }. m The output is returned in two frag- ments F and D. Fragment F contains the GROUP BY tuples at finest granularity level and D contains the remaining tuples. Partitioned-Cube algorithm chooses an attribute Bj among dimension attributes {Bi,...,B } m of the data cube. This attribute is used to partition input relation R into sets of tuples {Ri,R }. The number of these fragments, n, is n bounded by both number of available buffers in memory and the domain cardinality of the attribute Bj. After the data is partitioned the algorithm Partitioned-Cube is applied on each fragment separately. The union of F^s returned from applying algorithm on each individual fragment results in F, the finest granularity GROUP BY tuples of relation R. Algorithm Partitioned-Cube is then applied on F using all the dimension attributes except Bj, the attribute that was used to partition relation 31 A l g o r i t h m P a r t i t i o n e d - C u b e ( R , { B l , . . . , Bm} , A , G) INPUTS: A set of t u p l e s R, p o s s i b l y s t o r e d i n h o r i z o n t a l fragments; CUBE BY a t t r i b u t e s { B l . , . . . , Bm}; a t t r i b u t e A t o be aggregated; aggregate f u n c t i o n G ( . ) . OUTPUTS: The data cube r e s u l t f o r R over { B l , . . . , Bm} i n two h o r i z o n t a l fragments F and D on d i s k . F c o n t a i n s the f i n e s t g r a n u l a r i t y data cube t u p l e s ( i . e . , grouping by a l l of { B l , . . . , Bm}), and D c o n t a i n s the remaining t u p l e s . (F and D may themselves be further h o r i z o n t a l l y partitioned.) METHOD: 1) i f (R f i t s i n memory) 2) then r e t u r n Memory-Cube(R , { B l , . . . , Bm} , A , G ) ; 3) e l s e {choose an a t t r i b u t e Bj among { B l , . . . , Bm}; 4) scan R and p a r t i t i o n on Bj i n t o s e t s of t u p l e s R l , . . . , Rn; 5) / * n <= c a r d ( B j ) and n <= number of b u f f e r s i n memory * / 6) for i = 1 . . . n 7) l e t ( F i , D i ) = P a r t i t i o n e d - C u b e ( R i , { B l , . . . , Bm} , A ; G) 8) l e t F = the union of the F i ' s ; 9) l e t ( F ' , D ' ) = P a r t i t i o n e d - C u b e ( F , { B l , . . . , B j - 1 , Bj+1 , 10) . . . , Bm} , A , G ) ; 11) l e t D = the u n i o n of F ' , D' and the D i ' s ; 12) r e t u r n (F , D ) ; } Figure 3.12: Algorithm Partitioned-Cube 32 \ SI \ I t ID D SD SID *• * 4 S ID (Partitioned by I, S projected oat) J R Partitioned by S) ID (In memory, I projected out) Figure 3.13: An example of running Partitioned-Cube Algorithm i?, where the result will be stored in F' and D'. The union of F', D' and ZVs results in D, the remaining tuples of the final output. There is a problem with line 11 of the description of the algorithm in [10] which is shown in Figure 3.12. We should exclude part of the £ V s from the union of F', D' and D^s before we assign it to D. This line should be: l e t D = the union of F ' , D' and ( ( u n i o n of the D i ' s ) minus t u p l e s t h a t do not i n c l u d e a t t r i b u t e B j ) ; The reason for this modification is that those tuples of ZVs that do not include Bj are being incorporated in the final result through F' and D'. We should not incorporate them into the final result twice. Figure 3.13 shows an example of running Partitioned-Cube algorithm. First the input relation, which does not fit in memory, is partitioned based on attribute S, and all the GROUP BYs that include attribute S are computed. Then the result of GROUP BY S , I, D is projected on attributes / and D and then it is partitioned based on attribute I and all the GROUP BYs that contain I but do not contain S are computed. Finally the result of GROUP 33 BY I, D, which fits in memory, is projected on attribute D and all the remaining GROUP BYs are computed. Dashed lines indicate that sorting is required before computing the GROUP BY and solid lines indicate that sorting is not required because the target GROUP BY is a prefix of the source GROUP 3.5.1 BY. Algorithm Memory-Cube The Memory-Cube algorithm was proposed by Ross et. al. in [10]. The only requirement for applicability of this algorithm is that the input relation fits in memory. This algorithm takes advantage of the Pipelining technique that is used in PipeSort algorithm. Unlike the PipeSort which tries to minimize the total cost of computing all the GROUP BYs, Memory-Cube tries to minimize the number of pipelines and hence the number of the sort operations that need to be performed in order to compute all the GROUP BYs. Algorithm Memory-Cube is described in Figure 3.14. The sort orders mentioned in algorithm Memory-Cube are sort orders of the pipeline heads that need to be executed. These pipelines are generated by another algorithm called Paths shown in Figure 3.15. The number of paths for k attributes that Paths algorithm generates is and this is the minimum number of paths that cover all the nodes in the search lattice. The following is the result of running algorithm Paths on attributes {A,B,C,D}: G ( l ) = D —->• NULL G(2) = C D > C—-> NULL D 34 A l g o r i t h m Memory-Cube(R , { B l , . . . , Bm} , A , G) INPUTS: A set of t u p l e s R, t h a t f i t s i n memory; CUBE BY a t t r i b u t e s { B l , . . . , Bm}; a t t r i b u t e A t o be aggregated; aggregate f u n c t i o n G ( . ) . OUTPUTS: The data cube r e s u l t f o r R over { B l , . . . , Bm} i n two h o r i z o n t a l fragments F and D on d i s k . F c o n t a i n s the f i n e s t g r a n u l a r i t y data cube t u p l e s ( i . e . , grouping by a l l of { B l , . . . , Bm}), and D c o n t a i n s the remaining t u p l e s . METHOD: S o r t R and combine a l l t u p l e s t h a t share a l l values of { B l , . . . , Bm}; / * Assume t h a t t u p l e s are s o r t e d a c c o r d i n g t o f i r s t s o r t order * / f o r each s o r t order { i n i t i a l i z e accumulators f o r computing aggregates at each granularity; combine f i r s t t u p l e i n t o f i n e s t g r a n u l a r i t y accumulator; f o r each subsequent t u p l e t { compare t w i t h p r e v i o u s t u p l e , t o f i n d the p o s i t i o n j of the f i r s t s o r t order a t t r i b u t e at which they d i f f e r ; i f (j i s g r e a t e r than the number of common a t t r i b u t e s between t h i s s o r t order and the next) then { r e s o r t the segment from p r e v i o u s t u p l e t ' at which t h i s c o n d i t i o n was s a t i s f i e d up t o the t u p l e p r i o r t o t a c c o r d i n g t o the next s o r t o r d e r ; } i f (grouping a t t r i b u t e s of t d i f f e r from those i n f i n e s t g r a n u l a r i t y accumulator) then { output and then combine each accumulator i n t o coarser g r a n u l a r i t y accumulator, u n t i l the grouping a t t r i b u t e s of accumulator match w i t h those of t ; / * the number of combinings depends on the s o r t order l e n g t h and on j */} combine current t u p l e w i t h the f i n e s t g r a n u l a r i t y accumulator;}} Figure 3.14: Algorithm Memory-Cube 35 Algorithm INPUTS: Paths({BI Bj}) CUBE BY a t t r i b u t e s {BI OUTPUTS: A minimal set that METHOD: if (j covers a l l the = 0) attribute else { let , ... G(j) of paths , Bj}; i n the search lattice nodes. then return a single node w i t h a n empty list; G(j let of 1) = Paths({Bl Gl(j G(j - the Gl(j 1) for in , ... and G r ( j - 1) Bj-1}); two r e p l i c a s attribute l i s t of e a c h node o f with B j ; e a c h p a t h N l —> Gr(j - 1) ... —> Np { remove node Np a n d t h e (if , denote 1); prefix - 1) any) f r o m G r ( j - add node Np t o Gl(j - edge i n t o Np 1); 1); add an edge f r o m node B j . N p t o node Np in Gl(j - 1);} r e t u r n the u n i o n of the r e s u l t i n g G l ( j and G r ( j - 1);} Figure 3.15: Algorithm Paths 36 1) G(3) = B . C . D — > B , C - - - > B — > NULL B.D — > D CD — > C G(4) = A . B . C . D — > A . B . C — > A.B — > A —>NULL A.B.D > A.D > D A. C D — > A.C — > C B. C.D — > B . C — > B B.D CD The attributes of the paths i n G(4) will be reordered to follow the prefix property: G(4) = A . B . C . D — > A . B . C — > A.B — > A — > NULL D.A.B — > D.A — > D CA.D — > C A —> C B.C.D — > B.C — > B B.D CD After reordering the paths, they finally will be sorted to be able to take advantage of c o m m o n sort orders: G(4) = A . B . C . D — > A . B . C — > A.B — > A — >NULL B.C.D — > B . C — > B B.D CA.D > CA > C 37 CD D . A . B — > D.A — > D It is now possible to use partial sorts that already exists. For example B.C.D and B.D have a prefix in common, i.e. B. When tuples that are already sorted based on B.C.D, are being scanned, each time a tuple with a new B value is encountered, all the tuples with the previous value of B will be sorted based on D attribute. In this way rather than re-sorting all the tuples based on B and D, groups of tuples that share the same value of B will be sorted based on D. 3.6 Array-Based Algorithm Unlike ROLAP systems MOLAP systems store data in multidimensional arrays. Each cell of the cube is represented by an element of the array. Unlike ROLAP systems where both the measure attributes and the dimensional attributes are stored, in MOLAP systems only the measure attributes are stored and the position of the cell determines the values of the dimensional attributes. There are not as many algorithms for computing data cube in MOLAP systems. One of them is the array-based algorithm proposed by Naughton et. al. in [6]. This algorithm uses the chunked array data structure to store the data cube. An n-dimensional chunked array is an array divided into small pieces called chunks. Each chunk is an n-dimensional array by itself and is stored on a single block of the disk. The array-based algorithm compresses each chunk if less than 40% of its cells are valid. The compression method that the array-based algorithm uses is called chunk-offset compression. In this compression method each valid element of a chunk is stored as a pair, (offset in chunk , data). Offset in chunk is an integer representing the offset of the element from the 38 beginning of the chunk in some standard order, such as row major order. To see how chunking can affect the way we compute GROUP BYs, lets consider a three dimensional array with dimensions A, B and C. In order to compute GROUP BY A , B we project all the cells of the array onto the AB plane. In the absence of chunks, if the array is stored in A, B, C order on disk, we have to keep the entire AB plane in memory, because we traverse the array in the same order and therefore the value of each AB element is determined only when wefinishtraversing the array. On the other hand, with chunks we only need to keep a sub-plane of AB corresponding to a single chunk, because assuming that we have direct access to each chunk we can traverse the array in such a way that we aggregate all the ABC chunks that project on the same AB chunk together. If we want to compute all the GROUP BYs rather than only one, obviously we can compute all of them separately from GROUP BY A , B , C, but this is not necessary. Having computed GROUP BY A , B, we can compute GROUP BY A or GROUP BY B far more efficiently from GROUP BY A , B than from GROUP BY A , B , C. We can use the same lattice structure proposed by [14] and embed a tree in this lattice and compute each GROUP BY from its parent in the tree. Now the question is which tree we should choose. In order to minimize memory requirement of computing the data cube, a spanning tree of the lattice structure can be used that is called minimum size spanning tree. In this tree the parent of each node n is the minimum size node n' from which n can be computed. We can define minimum size spanning tree because unlike in ROLAP systems where the size of each node of the lattice structure is not known in advance, in MOLAP systems the dimension sizes of the array and its chunk sizes are known. Therefore the exact size of each lattice node can be calculated. 39 c2 25 18 cO b2 bl 7 8 9 4 5 6 1 2 3 r 15 12 i WKm bo aD al a2 Figure 3.16: A chunked array Before explaining the array-based algorithm, lets review a basic version of it first. In the beginning we construct the minimum GROUP BYs of the cube. Each GROUP parent GROUP BYDi Di ...Di x BY Di Di ...Di x 2 k+l of Di Di ...Di . 1 2 k 2 k+l size spanning tree for the BY D D ...D h i2 is computed from its ik which has the minimum size. Chunks of GROUP are read along the dimension D{ k+1 When each chunk of D{ Di ...Di 1 2 k and aggregated to a chunk is complete, it is written on disk and its memory is used for the next chunk of Di Di ...Di . l Di^Di ...D{ 2 k 2 k Only one chunk of is kept in memory at any time. For example consider the array structure of Figure 3.16. This array is a 9 x 9 x 9 array with 3 x 3 x 3 array chunks stored in the dimension order ABC. Chunk numbers shown in Figure 3.16 indicate the order of the layout of array chunks on disk. In order to compute GROUP BY B , C the chunks are read in order from 1 to 27, and each 3 of them are aggregated to a BC chunk. The memory that is being used to compute a BC chunk will be released and reused after this chunk is 40 written on disk. The basic algorithm computes each GROUP BY independently. For example it scans ABC three times to compute AB, AC, and BC. Then it scans the smaller of AB and AC to compute A and the smaller of AB and BC to compute B, etc.. The algorithm is intelligent in choosing the smallest parent to compute each GROUP BY, and reusing the memory, but it is naive in the sense that it computes GROUP BYs independently. The actual array-based algorithm computes all the children of a parent in a single pass of it. 3.6.1 Single-pass Multi-way Array Algorithm When we compute multiple GROUP BYs at the same time, we need to allocate some memory to each of them. The amount of memory that is required, depends on the order in which the input array is scanned. The array-based algorithm uses a special logical order called dimension order to minimize this total amount of memory. To see how the order in which the input array is scanned determines the total memory required, consider the array structure of Figure 3.16. Assume we scan the array in ABC row major order. First we read chunk aObOcO of ABC. This chunk is aggregated along dimension A to chunk 60c0 of BC, along dimension B to chunk aOcO of AC and along dimension C to chunk aObO of AB. Then chunk albOcO of ABC is read and it is aggregated to chunks 60c0, alcO and albO. The third chunk of ABC, chunk a260c0, is read in and aggregated to chunks 60c0, a2c0 and a2b0. Now chunk 60c0 is complete and it is written on disk and its memory is reused for chunk 61c0. Chunk 61c0 is written on disk when chunks 4, 5 and 6 are aggregated and then its memory is reused for chunk 62c0, which is written on disk after chunks 7, 8 and 9 are aggregated. Chunk aOcO is complete when chunks 1, 4 and 7 are 41 aggregated along dimension B. Chunk aid) is written after chunks 2, 5 and 8 are aggregated and chunk a2cf) is ready when chunks 3, 6 and 9 are all aggregated along dimension B. Chunks of AB are complete only when all the chunks from 1 to 27 are aggregated along dimension C. In this example, to compute BC we need to keep 1 chunk of BC in memory, to compute AC we need to keep 3 chunks of AC in memory and for AB we need memory for 3x3 = 9 chunks of AB. In general we need |Z? ||C |u memory for BC, c c \Ad\\C \u memory for AC and |Ad||-Bd|u memory for AB, where \Xd\ indicates the c size of dimension X, \X \ stands for the chunk size of dimension X and u is the size C of each chunk element. The following rule states generalizes the idea: Rule 1: For a GROUP BY {D D ...D _ ) jl h jn the dimension order O = (DiD2-..D ), n 1 if (Dj Dj ...Dj _ ) n (DiD2-.-D ) of the array (D D ...D - ) l 1 2 n 1 2 n contains a prefix of with length p, 0 < p < n - 1, we allocate FJLi IAI x units of array element to GROUP BY (Dj Dj ...Dj _ ), 1 2 n 1 read in 1 n"=p+i where |Dj| is the size of dimension i and \d\ is the chunk size of dimension i. Based on the above rule, for many of the GROUP BYs the allocated memory is less than the actual size, since |Cj| is much smaller than \Di\ in most of the dimensions. The memory that is being saved in this way can be used to overlap computation of more GROUP BYs. Some kind of structure is needed to coordinate the overlapped computation. A spanning tree of the GROUP BYs' lattice can serve this purpose. For a given dimension order, different spanning trees require different amounts of memory. A minimum memory spanning tree will be defined next. A Minimum Memory Spanning Tree(MMST) for a cube (D\D2.-D ) n respect to a dimension order O = (DiD2-.D ) n 42 with has n + 1 levels with the root \ ABC 3*3*3 AB 9*9 A 9 AC 9*3 BC 3*3 B 3 C 3 ALL Figure 3.17: Minimum Memory Spanning Tree (Dj Dj ...Dj ) 1 2 n at level n. Any node N at level i, 0 < i < n, can be computed from all the nodes at level i + 1 that contain its dimensions. The parent of node N is the node at level i that minimizes the required memory to compute N according to rule 1. In other words, the parent of node N has the minimum common prefix with N among all the nodes at level i + 1 that contain node AT's dimensions. If there are more than one node that minimize the required memory, the one with minimum size is selected. Figure 3.17 shows the MMST for the cube ABC in the dimension order (A,B,C). The total memory required by all the nodes at each level of the MMST can be calculated using the following rule. It is assumed that the chunk size is the same for all dimensions, i.e., for all i, \d\ = c. Rule 2: The total memory requirement for level n—j of the MMST for a dimension order O = (Di,D2,--.,D ), is given by: n n?=7 I A I + c u , i n n s ' - 1 1 A D C + c u 43 +1,2)(n?=T 1 A 2 D C 2 +... + c(n - Different dimension orders of an array (DiD2-..D ) n Each of these MMSTs may have different MMSTs. may have different memory requirements. The optimal di- mension order is the dimension order whose MMST requires the minimum amount of memory. It is proved by Zhao et. al. in [6] that the optimal dimension order O is (D\D2-.-D ), n where \D\\ < |Z>21 < ••• < \D \. \Di\ denotes size of the dimension n Di. If the available memory is less than the memory required for the MMST for the optimal dimension ordering O, we can not allocate sufficient memory for some of the subtrees of the MMST. These subtrees are called incomplete subtrees. Naughton et. al. proposed a heuristic algorithm called Multi-pass, Multi-way, Array Algorithm in [6] for computing the cube in this situation. They used the heuristic of allocating memory to subtrees of the root from the right to the left order. 3.7 Performance comparison I have implemented PipeSort, Partitioned-Cube and Single-pass Multi-way Array algorithms using C++ programming language in a UNIX environment. Each of these algorithms was run on a PC with 266 MH Pentium processor with 128 MB of RAM. The time was measured from the point the input relation was read to the point the data cube was written on disk. In order to compare performance of these three algorithms I used real-world data on the cloud coverage of the globe presented in [15]. The data used corresponds to the measurement of the cloud coverage of the globe over a one month period, September 1985. The reason that this month was chosen is that the same data was used in [5], and I wanted to compare my experiment results with those in that paper. For each month there are two data sets available: one contains measurements made over the ocean, and another contains measurements 44 Table 3.2: Computing a 4-dimensional (30*24*2*158) Cube Number Pipe Sort Partitioned Cube Multi-way Array Algorithm Algorithm of Rows Algorithm 203074 406148 609222 812296 1015367 16.8 34.66 52.97 70.16 88.44 9.61 19.74 30.04 39.34 49.03 17.36 35.29 54.29 73.91 93.48 made over land. I chose the land data set. This data set consists of 1,015,367 rows which I divided into five nearly equal partitions. In each experiment I used one, two, three, four or all of the five partitions. There are about 20 different fields, among which the following were chosen as data cube dimensions, with cardinalities listed in parentheses: day(30), hour(24), sky brightness(2), latitude(180) and longitude(360). In each experiment I chose either the first four, or all the five attributes as cube dimensions. The aggregation attribute is a measure (between 0 and 8) of cloud coverage. Each row of the input data set was originally 56 character long, but for an implementation reason each row was expanded to 61 characters, so the expanded data file is about 62 MB. Tables 3.2 and 3.3 show the results of the experiments. Figures 3.18 and 3.19 show the same results in pictorial format. As tables 3.2, 3.3 show, Partitioned-Cube outperforms PipeSort on computing 4-dimensional cube, but they both perform almost the same on 5-dimensional cube computation. The reason may be the preprocessing that PipeSort performs on the input data to determine cardinalities of cube dimensions. PipeSort needs these cardinalities to estimate each GROUP BVs size and hence to assign a pair of costs to each edge of the search lattice. Partitioned Cube does not need to do that so 45 Table 3.3: Computing a 5-dimensional (30*24*2*158*360) Cube Number Pipe Sort Partitioned Cube Multi-way Array Algorithm Algorithm of Rows Algorithm 76.73 89.37 203074 68.79 139.52 155.06 169.05 406148 609222 236.03 247.01 212.56 288.34 328.57 317.36 812296 1015367 352.85 398.95 418.11 4-Dimensional(30*24*2*158) C u b e 100 n - Array Based 60 • 40 1 20 Pipe Sort - Partitioned Cube 203074 406148 609222 812296 1015367 Rows Figure 3.18: Computing a 4-dimensional (30*24*2*158) Cube Figure 3.19: Computing a 4-dimensional (30*24*2*158*360) Cube 46 \ it does not perform any preprocessing on the input data. As the number of cube dimensions increases, the time it takes to perform the preprocessing becomes less significant when compared to the total execution time of the PipeSort algorithm. The reason is that the time it takes to perform the preprocessing is 0(n) but the time it takes to compute GROUP BYs is 0(2 ). n Although Partitioned-Cube minimizes the number of sort operations and it takes advantage of partial sort order similarities among different GROUP BYs, it does not seem that it has a big advantage over PipeSort when the input data fits in memory and number of dimension attributes is not less than 5. PipeSort and Partitioned-Cube were not run on a data set that does not fit in memory, but in this case PipeSort has the advantage that it can handle the problem regardless of the distribution of data, but Partitioned-Cube may not be able to do so because it partitions the large data sets into smaller ones based on one of their dimension attributes. Partitioned-Cube recursively repartitions each partition until all of them fit in memory. Now the question is what happens if the distribution of values in the partitioning attribute does not allow us to partition the data set into partitions that fit into memory. For example consider gender attribute where there are only two possible values, e.g. male and female. One solution to this problem is to choose another attribute as partitioning attribute that has afinerdistribution. But what if all the dimension attributes have the same coarse distribution as gender attribute? Even if we can find an attribute with desired distribution in the first partitioning step, we may finally run into this problem when we want to choose a new attribute for future partitioning operations. PipeSort uses external sort algorithms if the input data set does not fit into memory. It does not rely on the distribution of data and so does not suffer from this problem. That means there are some cases that 47 Partitioned-Cube can not be used while PipeSort can always be used. The Multi-way Array algorithm is the only MOLAP algorithm that was implemented. This algorithm reads the input data in the same format as the two other ROLAP algorithms, but it writes its output in a different format than the other two algorithms. The output of Multi-way Array algorithm consists of a row for each valid cell of the data cube. For each cell its offset from the beginning of the chunk along with its value are stored, so the output of this MOLAP algorithm is more compressed than ROLAP algorithm outputs where both dimension attribute values and aggregation operation result are stored. Despite the advantage of producing smaller sized outputs, Multi-way Array algorithm did not perform well in any of the experiments, specially when the number of dimension attributes was higher. The reason may be that the Multi-way Array algorithm does not take advantage of pipelining, as a result increasing the number of dimension attributes has a more dramatic effect on its performance than it has on PipeSort or Partitioned-Cube algorithms. 48 1 Chapter 4 Parallel Algorithms Several parallel versions of the Partitioned-Cube and Single-pass Multi-way Array algorithms were designed and implemented. The reason that these two algorithms were chosen, was that the former had a good potential for parallelism and the latter was a good representative of MOLAP algorithms. Each of these algorithms was implemented in C++ using MPI as message passing library. They were run on a cluster of 16 Pentium 266 M H machines each with 128 MB Ram and 2 GB hard disk. 4.1 Parallel Partitioned-Cube Algorithm In this algorithm one of the nodes of the network is chosen as coordinator node. The input to and output of the algorithm are stored at this node. Usually data warehouses are centralized and so must be the result of any cube calculations. That is why having a coordinator node is a reasonable assumption. The input to Parallel Partitioned-Cube algorithm is a set of tuples i?, which may be partitioned in horizontal fragments, the dimension attributes {B\,B }, m 49 aggregation attribute A and the aggregate function G(.). The output of the algorithm is the data cube result for R over {B\,B }. m The output is returned in two fragments F and D. Fragment F contains the GROUP BY tuples at finest granularity level and D contains the remaining tuples. Like the sequential version, Parallel Partitioned-Cube algorithm chooses an attribute Bj among dimension attributes {Bi,B } m of the data cube. This at- tribute is used to partition input relation R into sets of tuples {Ri,R }- The n number of these fragments, n, is bounded by both number of available network nodes and the domain cardinality of the attribute Bj. Unlike the sequential version each fragment is sent to one of the nodes by coordinator node and then algorithm Memory-Cube is applied on all of the fragments concurrently. The result of MemoryCube algorithm at each node is sent back to the coordinator node. The union of FiS obtained by applying algorithm Memory-Cube on each individual fragment, results in F, the finest granularity GROUP BY tuples of relation R. Then algorithm Parallel Partitioned-Cube is applied on F using all the dimension attributes except Bj, the attribute that was used to partition relation R, and the result will be stored in F' and D'. Like the sequential version the union of F', D' and tuples of ZVs that include attribute Bj results in D, the remaining tuples of the final output. It is assumed that we can always find an attribute Bj that can be used to partition the input relation into fragments that fit into memory. The same assumptions were made in the sequential version of the algorithm by Ross et. al. in [10]. Three different versions of the Parallel Partitioned-Cube algorithm were implemented using the following three different methods: Method 1 : The coordinator node only partitions the input relation if it does not fit into memory, otherwise it applies the Memory-Cube algorithm and returns 50 the results. The coordinator node stores fragments locally on disk while performing partitioning process. Then it reads fragments back from the disk and sends each of them to one of the network nodes. Method 2 : The coordinator node partitions the input relation even if it does fit into memory. The coordinator node stores fragments locally on disk while performing partitioning process. Then it reads fragments back from the disk and sends each of them to one of the network nodes. Method 3 : The coordinator node only partitions the input relation if it does not fit into memory, otherwise it applies the Memory-Cube algorithm and returns the results. The coordinator node does not store fragments locally on disk. It maintains a buffer in memory for each fragment and sends the contents of the buffer to corresponding node when it becomes full and reuses the buffer for the rest of the partitioning process. The difference between methods 1 and 2 is in the criteria for partitioning the input relation. In method 1, the input relation is partitioned only if it does not fit into memory. In method 2, whether the input relation fits into memory or doesn't, it is partitioned and its fragments are sent to different nodes. The rationales for these two methods are as follows. After algorithm Parallel Partitioned-Cube is applied on the input relation R once, thefinestgranularity GROUP BY, F, is used as the input relation to a recursive call of the algorithm. F is likely to be significantly smaller than R, therefore it may fit in memory. Method 1 does not partition F if it fits in memory. This method is designed based on the assumption that the overhead of partitioning a fairly small relation and sending out its fragments to different nodes and then collecting back the results from these nodes dominates the 51 benefit of parallel processing of the relation. Method 2 is designed based on the opposite assumption. Method 3 uses the same criteria as method 1 for partitioning the input relation, but it behaves differently while it is partitioning the relation. In method 1 fragments of the input relation are stored locally during the partitioning phase and only after this phase is complete they are sent to other nodes of the network. Method 3 does not store fragments locally. It keeps a piece of each fragment in a buffer. When this buffer becomes full, its content is sent to the corresponding node of the network. The rational for this method is the fact that storing fragments on disk and then reading them back into memory can cause a significant overhead. This overhead can be avoided if we use method 3. In order to compare the relative performance of these algorithms experiments were performed and the total execution time of each algorithm was measured. To measure the execution time of an algorithm a timer class was used that worked like a stop watch. The timer class has methods to reset, start, stop and resume the timer. This timer was started at the very beginning of the program and stopped at the very end. Since the tested algorithms were all parallel, the maximum reported value among all the nodes was considered as the total execution time. Separate timers were used for CPU, I/O, communication and total execution time when it was necessary to break down the total execution time into its components. Figures 4.1 through 4.2 compare the execution time of each of these methods under different circumstances. There is not a significant difference between the performance of Methods 1 and 2. This suggests that it does not matter whether we partition and distribute small F relations that are resulted from the union of JFJ'S obtained by applying 52 g 200 | 150 .H=p- wm ,§ 100 rrr 1 r 609222 •-j— I 812296 — QMelhod 1 • - I i • Method 2 j • Method 3 1015367 Figure 4.1: Computing a 5-dimensional (30*24*2*158*360) Cube on two nodes using three different methods Figure 4.2: Computing a 5-dimensional (30*24*2*158*360) Cube on four nodes using three different methods algorithm Memory-Cube on each individual fragment, or process these small F relations in their entirety locally. Method 3 outperformed both methods 1 and 2 in all the experiments. This is due to the savings in I/O time that is made in this method. Writing fragments on the disk and reading them back into memory is a time consuming task. It may take up to 50% of the total execution time. The only disadvantage of this method over the other methods is its need to have a buffer in main memory for each fragment. But this memory overhead is well justified by the better performance of this method. This method was chosen to perform more experiments. Figures 4.3 and 4.4 show how this method scales up with respect to number of the nodes of the network. Method 3 shows a speed up close to linear from 1 to 2 and 2 to 4 nodes, but speed up decreases as number of nodes increase. The primary reason for this problem is the way the input relation R is partitioned. This relation 53 60 o <1> 4> E 40 20 1 0 • LZ l 203074 406148 1: ;|H L 812296 609222 L1' 1 ID 1 Node • 2 Nodes • 4 Nodes • 8 Nodes • 16 Nodes 1015367 Rows Figure 4.3: Computing a 4-dimensional (30*24*2*158) Cube on different number of nodes using method 3 500 400 300 200 100 0 BUT— • 203074 406148 o • • • • — rn 609222 EtL 812296 1 Node 2 Nodes 4 Nodes 8 Nodes 16 Nodes 1015367 Rows Figure 4.4: Computing a 5-dimensional (30*24*2*158*360) Cube on different number of nodes using method 3 is partitioned into almost equal fragments among 2 and 4 nodes, but in 8 and 16 node cases it is not partitioned evenly among all of the nodes and that is because of the limitation of the hash function that was used and the domain cardinality of the partitioning attribute. This suggests that using a good hash function and choosing an attribute with large enough cardinality is critical in achieving a good speed up. Tables 4.1 and 4.2 show how the total execution time breaks down into CPU, I/O and Communication times when method 3 is used on two nodes. In both 4-dimensional and 5-dimensional cases CPU takes between 55% to 54 Table 4.1: Computing a 4-dimensional (30*24*2*158) Cube on 2 nodes using method 3 ' Number CPU Time I/O Time Communication Time of Rows Percentage Percentage Percentage . 203074 406148 609222 812296 1015367 59.78 59.08 60.32 60.45 59.39 38.70 39.34 38.00 38.11 39.00 1.52 1.58 1.68 1.44 1.61 Table 4.2: Computing a 5-dimensional (30*24*2*158*360) Cube on 2 nodes using method 3 Number CPU Time I/O Time Communication Time of Rows Percentage Percentage Percentage 203074 406148 609222 812296 1015367 59.48 55.67 58.46 57.58 56.93 39.59 42.95 40.45 41.06 41.60 55 0.93 1.38 1.09 1.36 1.47 61%, I/O between 38% to 43% and communication less than 2% of the total execution time. This suggests that method 3 is more CPU bound than I/O bound and its communication overhead is minimal. 4.2 Parallel Single-pass Multi-way Array Algorithm In this algorithm one of the network nodes works as coordinator node. The input and output of the algorithm are stored at this node. In the beginning of the algorithm this coordinator node reads the input data into memory and distributes it evenly among all the nodes of the network. This phase could be eliminated if the input relation was already partitioned and distributed among the nodes. For ease of implementation it is assumed that each node has enough free memory to receive and hold one fragment of the input data in main memory. Unlike parallel Partitioned-Cube algorithm, parallel Multi-way Array algorithm does not use any of the dimension attributes for partitioning purpose. Instead it just sends a group of rows to a node when it reads as many rows as the number of rows per partition. This makes the partitioning process faster and creates more equal partitions than in the parallel Partitioned-Cube algorithm. When the partitioning process is complete then each node extracts unique values of each dimension attribute of the raw data fragment it has in memory and sends out the set of unique values to the coordinator node. The coordinator node places all the unique value sets for an attribute together and sorts them. Then it sends back the sorted global unique value set for that attribute to each node and each node finds the position of each of the unique values it has extracted in the global sorted list of unique values. This process is repeated for all of the dimension attributes. After this phase is completed each node computes the core cuboid 56 (cuboid resulted from computing GROUP B Y all of the dimension attributes) for the fragment of the input data it has in memory. The core cuboids are stored on disk for further processing. Once core cuboids are computed, input data fragments are not needed any more, so the memory that was allocated to them can be released. The next phase of the algorithm is to read the core cuboid and compute the other cuboids of the data cube. This is done at each node separately, therefore the resulting cuboids at different nodes need to be combined. After all the cuboids other than the core cuboid are combined and stored, chunks of the core cuboid are sent directly to the coordinator node and combined and stored there. The computation and combination of different cuboids of the data cube are implemented using four different methods. Method 1 : Cuboids are computed at different nodes independently and are combined at the end by sending them directly to the coordinator node and merging them at this node. Method 2 : Cuboids are computed at different nodes independently and are combined at the end by sending them indirectly to the coordinator node and merging them at this node. Method 3 : Cuboids are computed at different nodes and combined at coordinator node at the same time by sending each computed chunk directly to the coordinator node and merging these chunks at this node. Method 4 : Cuboids are computed at different nodes and combined at coordinator node at the same time by sending each computed chunk indirectly to the coordinator node and merging these chunks at this node. 57 0 9 00 Q- F i g u r e 4.5: H y p e r c u b e structure In methods 1 a n d 2 there is no interaction among different nodes while they are reading their own core c u b o i d a n d c o m p u t i n g the other cuboids. I n these two methods each node stores chunks of different cuboids o n disk locally a n d sends these chunks to coordinator node once all the chunks o f all the cuboids are computed. T h e difference between these two methods is i n the way the c o m p u t e d chunks are sent to the coordinator node. In m e t h o d 1 these computed chunks are sent directly from each node to the coordinator node, while i n m e t h o d 2 chunks are sent to the coordinator node i n several steps. T h e number of these steps is equal to d — log(N) where N is number of the nodes of the network. In m e t h o d 2 nodes of the network form a d-dimensional hypercube structure (see F i g u r e 4.5). A d-dimensional hypercube consists of N = 2 d node is assigned a label 0 , 2 n nodes where each — 1. T w o nodes communicate w i t h each other i f a n d only i f their corresponding binary representation of their labels differs i n exactly one bit. Therefore each node communicates w i t h d = log(N) other nodes. In the first step of sending chunks to the coordinator node, each node w i t h 1 in the first bit of its binary number sends its c o m p u t e d chunks to the neighbor node w i t h 0 i n the first bit of its binary number. T h e receiver nodes combine received chunks w i t h their own c o m p u t e d chunks. In the second step of sending chunks to 58 the coordinator node, each receiver node of the previous step w i t h 1 i n the second bit of its b i n a r y number sends its c o m p u t e d chunks to another neighbor phase 1 receiver node w i t h 0 i n the second bit of its b i n a r y number. T h e receiver nodes combine received chunks w i t h the chunk results of the previous step. T h i s process will be repeated d = log(N) times where N is number of nodes of the network. After the final step node 0, the coordinator node, will have the combined chunks of all the other nodes a n d can store t h e m as final results on disk. T h e reason that hypercube structure is used explicitly rather t h a n i m p l i c i t l y by using standard gather operation is the fact that i n the gather operation all of the nodes send their d a t a to a single node based on a hypercube structure a n d it is up to this node to combine the received d a t a to form the desired result where as here each node that receives some d a t a combines it w i t h its own d a t a before it sends it to another node, therefore the c o m b i n i n g process is distributed among all of the receiver nodes a n d is performed faster, In methods 3 a n d 4 each node sends a chunk either directly or indirectly to coordinator node as soon as that chunk is computed. Therefore no nodes other t h a n coordinator node stores chunks locally. Like the previous 2 methods the difference between these two methods is i n the way the c o m p u t e d chunks are sent to the coordinator node. In m e t h o d 3 these c o m p u t e d chunks are sent directly from each node to the coordinator node, while in m e t h o d 4 chunks are sent to the coordinator node i n the same fashion as i n m e t h o d 2. T h e rationales for these four methods are as follows. In methods 1 a n d 2 c u b o i d chunks are sent i n bulk to the coordinator node but i n methods 3 a n d 4 they are sent one at a time, therefore c o m m u n i c a t i o n overhead is lower i n methods 1 a n d 2 t h a n methods 3 a n d 4. O n the other h a n d methods 1 a n d 2 store c o m p u t e d chunks 59 350 300 - is. - I fl fl sSlMlill f" 203074 406148 609222 812296 L U H • Method 1 • Method^ n Method \ J" • Method 4 fl 1015367 Rowi Figure 4.6: Computing a 5-dimensional (30*24*2*158*360) Cube on two nodes using four different methods 203074 406148 609222 812296 1015367 Figure 4.7: Computing a 5-dimensional (30*24*2*158*360) Cube on four nodes using four different methods locally on disk while methods 3 and 4 send them to the coordinator node while computing them, therefore methods 3 and 4 do not spend I/O time for writing and reading back computed chunks at each node. Methods 1 and 3 send chunks directly to the coordinator node while methods 2 and 4 send them indirectly based on the hypercube structure. The advantage of sending chunks indirectly to the coordinator node over sending them directly is to distribute the combining task among different nodes and therefore saving in the time it takes to combine all the chunks. On the other hand sending chunks indirectly to the coordinator node causes higher communication overhead. Figures 4.6 and 4.7 compare the execution time of each of these methods under different circumstances. These four methods do not show significant differences in computation of data cube on 2 nodes. In computation of data cube on 4 nodes, method 4 outperforms the 60 100 i 5. 60 » f 40 20 0 n ! • I v.., 203074 "L l i m , 406148 Brn, 609222 iff,-• I H 812296 • 1 t • 1 Node • 2 Nodes B4 Nodes • 8 Nodes o 16 Nodes 015367 Rows Figure 4.8: Computing a 4-dimensional (30*24*2*158) Cube on different number of nodes using method 4 500 S 400 8. 300 | 200 f 100 0 H Hi i n 203074 IT—; (IiiTlf! 406148 609222 11 812296 H1 Node • 2 Nodes Q4 Nodes • 8 Nodes • 16 Nodes 1015367 Rows Figure 4.9: Computing a 5-dimensional (30*24*2*158*360) Cube on different number of nodes using method 4 other three methods in most experiments. This is mostly due to the saving in I/O time that is achieved in this method. Writing chunks on the disk and reading them back in memory can take about 10% of the total execution time. The other reason is the saving made in combination time in method 4. This method was chosen to perform more experiments. Figures 4.8 and 4.9 show how method 4 scales up with respect to number of nodes in the network. Method 4 shows some speed up from 1 to 2 and 2 to 4 nodes, but speed up decreases dramatically and even turns into slow down as the number of the nodes increases. This may partially be due to the communication overhead that is associated with this method. Tables 4.3 and 4.4 show how the total execution time breaks down into CPU, 61 Table 4.3: Computing a 4-dimensional (30*24*2*158) Cube on 2 nodes using method 4 Number CPU Time I/O Time Communication Time of Rows Percentage Percentage Percentage 203074 406148 609222 812296 1015367 66.30 65.22 63.31 63.75 63.59 29.39 30.65 32.39 31.96 32.23 4.31 4.13 4.30 4.29 4.18 Table 4.4: Computing a 5-dimensional (30*24*2*158*360) Cube on 2 nodes using method 4 Number CPU Time I/O Time Communication Time of Rows Percentage Percentage Percentage 203074 406148 609222 812296 1015367 59.74 57.49 56.26 56.47 56.77 27.81 29.25 30.35 30.04 29.46 62 12.45 13.26 13.39 13.49 13.77 I/O and Communication times when method 4 is used on two nodes. v. In 4-dimensional case C P U takes between 63% to 67%, I/O between 29% to 33% and communication about 5% of the total execution time. In 5-dimensional case C P U takes between 56% to 60%, I/O between 27% to 31% and communication about 13%of the total execution time. This suggests that the role of the communication time becomes more important as number of the cube dimensions increases. This is consistent with the poorer speed up of method 4 when it was used to compute a 5-dimensional cube than the case it was used to compute a 4-dimensional cube (see Figures 4.8 and 4.9). 63 Chapter 5 Conclusion In the previous chapters a detailed description of three different sequential algorithms for computing data cubes along with performance comparison of these algorithms were presented. Two of these algorithms, PipeSort and Partitioned-Cube, were R O L A P and the third one, Multi-way Array, was a M O L A P algorithm. Experiments split in showing the best between the two R O L A P algorithms but the M O L A P algorithm was always outperformed by the two R O L A P algorithms. Partitioned- Cube and Multi-way Array algorithms were chosen for parallel implementation. Several parallel versions of each of these two algorithms were implemented. For each algorithm the best implementation was chosen to perform more experiments. Experiments showed that Parallel Partitioned-Cube algorithm performed better than Parallel Multi-way Array algorithm and it scaled up better with the number of the nodes of the network. This suggests that Partitioned-Cube algorithm is a more suitable candidate for incorporating parallel processing although in order to achieve the best results of this algorithm, it should be made sure that in the distribution phase the input relation is evenly distributed among the nodes of the 64 network. T h i s can be done by choosing a n appropriate partitioning attribute a n d a p p l y i n g a hash function w i t h enough discriminating capability. 65 Bibliography [1] E. F. Codd: Providing OLAP to user-analysts: An IT mandate. Technical Report, E. F. Codd and Associates: 1993 [2] Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and SubTotal. ICDE 1996: 152-159 [3] Sunita Sarawagi, Rakesh Agrawal, Ashish Gupta: On Computing the Data Cube. Research Report, IBM Almaden Research Center [4] Prasad Deshpande, Sameet Agarwal, Jeffrey F. Naughton, Raghu Ramakrishnan: Computation of Multidimensional aggregates. Technical Report-1314, University of Wisconsin-Madison, 1996 [5] Kenneth A. Ross, Divesh Srivastava: Fast Computation of Sparse Datacubes. VLDB 1997 [6] Yihong Zhao, Prasad Deshpande, Jeffrey F. Naughton: An Array-Based Al- gorithm for Simultaneous Multidimensional Aggregates. SIGMOD Conference 1997: 159-170 66 [7] Sanjay Goil, Alok N. Choudhary: High Performance Data Mining Using Data Cubes On Parallel Computers. IPPS/SPDP: 1998 [8] Ching-Tien Ho, Rakesh Agrawal, Nimrod Megiddo, Ramakrishnan Srikant: Range Queries in OLAP Data Cubes. SIGMOD Conference 1997: 73-88 [9] Ching-Tien Ho, Jehoshua Bruck, Rakesh Agrawal: Partial-Sum Queries in OLAP Data Cubes Using Covering Codes. IEEE Transactions on Computers 47(12): 1326-1340 (1998) [10] Kenneth A. Ross, Divesh Srivastava, Damianos Chatziantoniou: Complex Aggregation at Multiple Granularities. EDBT 1998: 263-277 [11] Sunita Sarawagi, Rakesh Agrawal, Nimrod Megiddo: Discovery-Driven Exploration of OLAP Data Cubes. EDBT 1998: 168-182 [12] Jiawei Han: Towards On-Line Analytical Mining in Large Databases, newblock SIGMOD Record 27(1): 97-107 (1998) [13] Nick Roussopoulos, Yannis Kotidis, Mema Roussopoulos: Cubetree: Organization of and Bulk Updates on the Data Cube, newblock SIGMOD Conference 1997: 89-99 [14] Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Implementing Data Cubes Efficiently. SIGMOD Conf. 1996: 205-216 [15] C. J. Hahn, reports from S. G. Warren, ships and land J. London: stations over http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html, 67 Edited synoptic cloud the 1994. globe, 1982-1991. [16] Prasad Deshpande, Karthikeyan Ramasamy, Amit Shukla, Jeffrey F. Naughton: Caching Multidimensional Queries Using Chunks. SIGMOD Conference 1998: 259-270 [17] Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish. Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, Sunita Sarawagi: On the Computation of Multidimensional Aggregates. VLDB 1996: 506-521 [18] Rakesh Agrawal, Ashish Gupta, Sunita Sarawagi: Modeling Multidimensional Databases. ICDE 1997: 232-243 [19] Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Index Selection for OLAP. ICDE 1997: 208-219 [20] Jeffrey D. Ullman: Efficient Implementation of Data Cubes Via Materialized Views. KDD 1996: 386-388 [21] Amit Shukla, Prasad Deshpande, Jeffrey F. Naughton, Karthikeyan Ra- masamy: Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies. VLDB 1996: 522-531 [22] Goetz Graefe: Query Evaluation Techniques for Large Databases. Computing Surveys 25(2): 73-170 (1993) [23] Damianos Chatziantoniou, Kenneth A. Ross: Querying Multiple Features of Groups in Relational Databases. VLDB 1996: 295-306 68
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Parallel computation of data cubes
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Parallel computation of data cubes Momen-Pour, Soroush 1999
pdf
Page Metadata
Item Metadata
Title | Parallel computation of data cubes |
Creator |
Momen-Pour, Soroush |
Date Issued | 1999 |
Description | Since its proposal, data cube has attracted a great deal of attention in both academic and industry research communities. Many research papers have been published about different issues related to data cubes and many commercial OLAP (On-Line Analytical Processing) systems have been released to market with data cube operations as their core functions. Several algorithms have been proposed to compute data cubes more efficiently. PIPESORT and PIPEHASH algorithms proposed by Agrawal et. al., OVERLAP algorithm proposed by Naughton et. al., Partitioned-Cube algorithm proposed by Ross et. al. and the Multi-way Array algorithm proposed by Naughton et. al. are the most significant ones. All of these algorithms are designed for implementation on sequential machines, however computing a data cube can be an expensive task. For some organizations it may take a very powerful computer working around the clock for a week to compute all the data cubes they may want to use. Application of parallel processing can speed up this process. Despite the popularity and importance of data cubes, very little research has been carried out on the parallel computation of them. The only parallel algorithm for computation of data cubes, which I am aware of, is the algorithm proposed by Goil et. al.. Their algorithm works for cases where the data set fits in main memory, however, real world data sets rarely fit in main memory. The wide spread availability of inexpensive cluster machines makes it possible to use parallel processing for computation of data cubes, even in small size firms and as a result there could be a real demand for efficient parallel data cube construction algorithms. I have designed and implemented two parallel data cube computation algorithms (Parallel Partitioned-Cube algorithm and Parallel Single-pass Multi-way Array algorithm) based on sequential algorithms proposed in literature. The former algorithm is classified as a ROLAP (Relational OLAP) algorithm and the second one is considered as a MOLAP (Multi-dimensional OLAP) algorithm. |
Extent | 4798937 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-06-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051488 |
URI | http://hdl.handle.net/2429/9746 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1999-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1999-0570.pdf [ 4.58MB ]
- Metadata
- JSON: 831-1.0051488.json
- JSON-LD: 831-1.0051488-ld.json
- RDF/XML (Pretty): 831-1.0051488-rdf.xml
- RDF/JSON: 831-1.0051488-rdf.json
- Turtle: 831-1.0051488-turtle.txt
- N-Triples: 831-1.0051488-rdf-ntriples.txt
- Original Record: 831-1.0051488-source.json
- Full Text
- 831-1.0051488-fulltext.txt
- Citation
- 831-1.0051488.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051488/manifest