The Summarization of Hierarchical Data with Exceptions by Shaofeng B u M . E n g . , X i ' a n Jiaotong University, P .R .China , 2001 B .Eng . , X i ' a n Jiaotong University, P .R .China , 1998 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department of Computer Science) We accept this thesis as conforming to the reguired standard The University of British Columbia October 2004 © Shaofeng B u , 2004 ^ THE U N I V E R S I T Y OF BRIT ISH C O L U M B I A F A C U L T Y OF G R A D U A T E S T U D I E S ^ Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: /Ac £c<S?7/ria . / - }2*f7s o ^ H'Cr0<]rck.7O^L 6rXca^oM^ Degree: A/Wif fry- o*y Sc r&K UL. Year: 3-ot><f Department of . C^Byy^'^(Arts^ The University of British Columbia Vancouver, BC Canada grad.ubc.ca/forms/?formlD=THS page 1 of 1 last updated: 20-M-04 A b s t r a c t In many applications of O L A P or data warehouse, users need to query data of interest, such as a set of data that satisfies specific properties. A normal answer to such query just enumerates a l l the interesting cells. This is the most accurate but not the most informative method. Summarizations need to be done in order to return more concise descriptions of these interesting cells to the users. M D L approach has been applied on the hierarchical data to get concise descriptions. However in many cases the descriptions are not concise enough to the users. Another method, G M D L , can generate much shorter descriptions, but the G M D L descriptions are not truly pure. The motivation of our research is to overcome the disadvantages in the above methods. In this thesis, we bring up a methodology that focuses on generating the sum-marization wi th exceptions of the hierarchical data. We extend the M D L approach to include some exceptions in the description. The exceptios are some uninteresting cells. The result shows that the description wi th exceptions is pure, which means that the description only covers "interesting cells". We call this new approach M D L E , i.e. M D L wi th exceptions. Our new approach aims to find the shortest description wi th exceptions to cover al l "interesting cells". Firstly, we study two simple cases that can be solved in polynomial time and we give the algorithms. Secondly, we prove that M D L wi th exceptions is an N P - H a r d problem in general cases and we propose three heuristics. Finally, we show some experiments that we have done to compare M D L E wi th M D L and G M D L . The experiment results show that M D L E generates more concise descriptions than M D L and meantime M D L E gets shorter descriptions than G M D L when the white-ratio is low or there are some red cells. i i Conten ts Abstract ii Contents iii List of Figures vi Acknowledgements viii 1 Introduction 1 1.1 Motivat ing Example and Problems Statement 2 1.2 Contributions 6 1.3 Outlines of Thesis 7 2 Background and Related Work 9 2.1 Rectangle Covering Problem 9 2.2 M D L and G M D L for Hierarchical Data 10 2.2.1 M D L - T r e e Algor i thm 10 2.2.2 G M D L - T r e e Algor i thm 11 2.3 Automatic subspace Clustering 12 2.4 Concise Descriptions 12 i i i 3 Complexity Analysis . 14 3.1 One-Dimensional Hierarchy: A Tractable Case 14 3.1.1 Basic Definitions 14 3.1.2 Problem Statements 18 3.1.3 Algor i thm for M D L with Exceptions 20 3.2 NP-Completeness for 2-Dimensional Hierarchy 26 3.2.1 Basic Definitions 27 3.2.2 Problem Statements 29 3.2.3 NP-Completeness Proof Part I: C W E Biparti te Graph . . . . 31 3.2.4 NP-Completeness Proof Part II: M D L E 37 4 Heuristics and Experiments 46 4.1 A Tractable Case: N o Shared Holes 46 4.1.1 Preliminaries 46 4.1.2 Algor i thm 53 4.1.3 Mul t i -Level Hierarchy Also Hard 59 4.2 Heuristics 61 4.2.1 Basic Definitions 61 4.2.2 A Greedy Heuristic 62 4.2.3 Dynamic Programming 65 4.2.4 Quadratic Programming 72 4.3 Experimental Results 80 4.3.1 M D L vs M D L wi th Exceptions 80 4.3.2 G M D L vs M D L wi th Exceptions 83 4.3.3 Comparison in 3-Dimensional Case 84 4.3.4 Hole Ratio 87 iv 5 Conclusions and Future Work 88 5.1 Summary and Conclusion 88 5.2 Future Work 90 5.2.1 Summarization of Holes 90 5.2.2 Representation of Holes 91 5.2.3 Approximation Rate 92 Bibliography 93 v L is t of F igu res 1.1 Motivat ing Example 4 3.1 A n Example for 1-Dimensional Hierarchy 19 3.2 Example for Bot tom-Up Algor i thm 24 3.3 A n Example for 2-dimensional 2-level hierarchy 29 3.4 Complete Edge-Weighted Biparti te Graph 32 3.5 Reduction From Clique 36 3.6 C E W Biparti te Graph :G=(V,E) 44 3.7 Correspondent Data Cube 44 4.1 Data Region without Shared Holes 55 4.2 Mul t i -Leve l Hierarchical Da ta Cube 60 4.3 Example for Heuristics 66 4.4 Example for Quadratic Programming 76 4.5 M D L vs. M D L wi th exceptions 81 4.6 G M D L v s . M D L wi th exceptions 84 4.7 G M D L wi th red cells 85 4.8 Experiments of 3-Dimensional Hierarchy 86 4.9 Hole Rat io 87 v i Example of Summarization on Holes v i i Acknow ledgemen ts I would like to express sincere gratitude to my supervisor, Dr . Raymond T . Ng, for his helpful suggestion and constant encouragement throughout the course of my research work. Wi thout his guidance and inspiration, this research work would never have been possible. I highly appreciate the help from Dr . Laks V . S . Lakshmanan. Without the fruitful discussions wi th h im and his excellent advice, this research work would not be done in time. I would also like to thank Dr . W i l l Evans, who was a member of my supervisory committee, for his kindness and valuable instructions. I would like to thank Dr . George K. Tsiknis for his reviewing on my thesis. His comments and suggestions are very helpful and valuable. I would like to thank all the members in database system lab, especially Dr . Ramesh Ganesh and Xiaodong Zhou who gave me lots of good advices in my research work. I am very grateful to Dr . M i n g Huo who gave me many good suggestions for my thesis writing. Finally, I would like to express my deeply thankfulness to my parents. W i t h -out their endless support, understanding, patience and suggestion, I could not over-come the difficulties in my life and study. S H A O F E N G B U The University of British Columbia October 2004 v i i i C h a p t e r 1 I n t roduc t i on W i t h the increase of data accumulated in the application databases and the demand to analyze transaction data for decision support, data warehousing and on-line ana-lytical processing(OLAP) have been studied and widely used in the last decade. In contrast to the on-line transaction processing(OLTP) applications, data warehous-ing and O L A P systems deal wi th historical, summarized and consolidated data, which are more meaningful and useful than detailed, individual records[18] [3]. Da ta analysis applications generally aggregate data across several different dimensions in order to find useful or meaningful, unusual or hidden trends or patterns [6], and the data in such applications is stored under the multidimensional model. For ex-ample, in a hospital data warehouse, the data for patients might include birthday, sex, occupation and birthplace as interesting dimensions. The data warehousing and O L A P applications present the interesting data to users by a multidimensional model, which is a data cube. Dimensions can often be represented in hierarchical structure[3]. Birthplace can be a city-province-country hierarchy. Bir thday is a day-month-year hierarchy. In hierarchy each ancestor or upper level node is the summarization of its 1 descendant nodes or lower level nodes. Given a multidimensional hierarchical data cube, how to represent an interesting data set in this cube by a more concise and meaningful method is a good research problem and helpful to find unusual or hidden patterns from the data set. For example in a 2-dimensional data cube which is in the form of a 10 x 10 matrix, there are 10 leaf nodes in each dimension and the data cube contains 100 data cells. A user's query needs to select al l the data cells that satisfy the user defined property p. Suppose the query returns 80 single data cells to the user. The problem is how to express all these interesting data cells. Shall we enumerate them or do some summarization? The summarization of the data has two advantages. Firs t , the length of the expression is much shorter. Second, the expression contains the hierarchy information and is more meaningful to the user. 1.1 Motivating Example and Problems Statement Data summarization has been studied in [1] [11] [19] [14]. The goal of the summariza-tion is to use a concise description to express a set of data. In a multidimensional data cube, a data cell is defined as a tuple that contains a leaf node in each dimen-sion. A data region is defined as a tuple that contains some internal non-leaf nodes in some dimensions. Bo th data cells and data regions are axis-parallel rectangular regions. When users apply any query on the data cube, the system returns al l the data cells that satisfy the properties defined by users. A l l these data cells are inter-esting cells to the user and are marked as blue cells, while the rest of the cells are non-blue cells. A summary is a non-redundant covering of interesting cells including maximal axis-parallel rectangular regions [1]. We also call such a covering a description for al l interesting cells. In the hierarchical data cube, a description contains data cells 2 and data regions. The length of a description is defined as the cardinality of this description, which is the number of the included data cells and the included data regions. M i n i m u m Description Leng th (MDL) was firstly proposed in [8]. M D L is to find the most concise description for a set of data cells. A n algorithm for M D L approach in the hierarchical data cube, i.e. M D L - T r e e algorithm, has been proposed in [11]. M D L - T r e e algorithm is to find the unique description wi th the minimum length for a set of interesting data cells in the pure hierarchical data cube. The authors also generalized M D L approach to G M D L to decrease the length of the description. In G M D L approach, data cells are grouped into three categories, i.e. "interesting cells", "don't care cells" and "undesirable cells". Contrast to the M D L description, which only includes "interesting cells", a G M D L description can include some "don't care cells" to cover all the "interesting cells". The G M D L description has shorter length than the M D L description. Before we bring out the statements of the problems wi th which this thesis is concerned, let us study an example. Example 1.1.1. In order to see how we can summarize interesting data cells in a multidimensional hierarchical data cube, we begin with an example from sales department. The hierarchical structure is the same as in [11]. Clothes and location are the two dimensions. The measure is st i l l the sale dollar amount. Each dimension has the hierarchical structure showed in Figure 1.1. For hierarchical data, a data cell corresponds to a tuple of leaf level values in each dimension[ll] , such as (Vancouver, skirts) and (Chicago,jackets). A data region is a tuple that contains internal non-leaf nodes in some dimensions. (Vancouver men's) and (northwest, jackets) are data regions. A l l the data cells that satisfy the 3 user denned property are interesting cells and we call them 'blue cells' and mark them with '0 ' m the figure. A l l unmarked cells are uninteresting cells and are called non-blue cells. clothes Vancouver Edmonton San Jose San Francisco Minneapolis Chicago Boston Summit Albany New York ():():() RI : R6 R3 n n R4 R5 i R7U R9 u 32 13 KS a R12 R13 RIO R l l Figure 1.1: Motivating Example Suppose the company manager still wants to summarize all "hot" items this year. The manager still defines a cell as blue if the sales amount in this year is at least 2 times that of last year. The data of 2004 is different from the data of 2002 in [11]. We find (Vancouver,skirts),(Boston,ties) and (New York,dress shirts) are not blue any more, which means these three items are not 'hot' items this year and 4 do not satisfy the property denned by the manager. So we cannot get Ri,R%, Rg in Figure 1.1 for 2004. M D L approach merges children regions to their parent region if all these children regions contain only blue cells. For example, (Minneapolis, ties) and (Chicago,ties) are merged into region R$, i.e. (midwest,ties). For the data of this year, the M D L approach generates a covering containing 10 regions, i.e. R2,R4,Rs,Re,R7, Rs,Rio, Rn,Ri2,Ri3, and 8 single blue cells that are not covered by these 10 regions. This covering contains 10 regions and 8 cells and we say the length of this covering is 18. Now the question here is whether we can do better. Can we generate a shorter covering? We know the reason that R±, R3 and Rg are not in the M D L covering is that (Vancouver, skirts), (Boston,ties) and (New York,dress shirts) are not blue this year. We define {R± — (Vancouver, skirts)} as al l the data cells in region Ri except (Vancouver,skirts), and we similarly define {R3 — (Vancouver, skirts)} and {RQ — (Boston,ties) — (NewYork,dressshirts)}. Therefore we can generate a new covering containing 9 regions, i.e. Ri, R2, R3, R4, R5, Re, R7, Rs, R 9 , wi th the exceptions of (Vancouver,skirts), (Boston,ties) and (New York,dress shirts). Now the new covering includes 9 regions and 3 cells (as exceptions). The length of the new covering is 12. • From this example we know that if we include some non-blue cells in the covering then we can generate a shorter covering for the blue cells. A t the same time, this k ind of covering is more informative and meaningful to users. For instance, in Figure 1.1, we use {Ri — (Vancouver, skirts)} to express the blue cells instead of enumerating al l of them. Therefore users can know all the data cells in R\ satisfy the query except (Vancouver, skirts). This is much more intuitive and meaningful than only enumerating all blue cells. 5 In this thesis, we focus on the multidimensional data cube wi th hierarchy in each dimension. We call this k ind of data cube the multidimensional hierarchical data cube. The hierarchy associated wi th data cube is a tree hierarchical structure, which is the common hierarchy in many real application systems. A description wi th some exceptions is to cover al l the blue cells by including some non-blue cells. We cover all blue cells by a union of regions and blue cells wi th the exception of some non-blue cells. This description is st i l l a covering for the blue cells. The included non-blue cells are also called holes in the covering. Given a description wi th exceptions, the length of this description is defined as the cardinality of the description, i.e. the number of the contained regions and blue cells plus the number of included non-blue cells. M D L with except ions(MDLE) is to find the shortest description with exceptions. 1.2 Contributions This thesis studied the problem on how to do summarization wi th exceptions on pure hierarchical data, which is called M D L wi th exceptions for pure hierarchical data. M D L summarization has been studied and M D L - T r e e algorithm have been proposed in [11]. This thesis extends M D L covering to include holes in order to shorten the length of the covering. The major contributions of this thesis are: • We prove the M D L with exceptions is an N P - H a r d problem. Even though M D L summarization can be solved in time linear in the size of the hierarchies[11] [19] we can see that when we just include data cells as exceptions in the de-scription this problem becomes N P - H a r d . • Two tractable cases are studied and the algorithms are given. We show two 6 simple cases that are tractable and solvable. We study these two cases and an-alyze why they can be solved in polynomial time. We also propose algorithms for each case. • We propose three heuristics. We show some examples to illustrate the reason why this problem is hard to solve. We have three heuristics. Experiments results are showed to compare our heuristics with M D L - T r e e algorithm and G M D L approach[ll] . These three heuristics generate much shorter descrip-tions than M D L - T r e e algorithm. They also work better than G M D L when the white ratio is not too high and work much better than G M D L when there is a fraction of red cells. 1.3 Outlines of Thesis The following parts of this thesis is organized as: • Chapter 2 - Background and Related Work. • Chapter 3 - Complexity Analysis. In this chapter we discuss the complexity of M D L E problem and we prove that this is an N P - H a r d problem. We also study the 1-dimensional case which is tractable. • Chapter 4 - Heuristics and Experiments. We study more about this N P - H a r d problem and propose three heuristics: Greedy, Dynamic Programming, and Quadratic Programming. We also show the comparison amongst these three heuristics, M D L - T r e e algorithm and G M D L - T r e e algorithm by experimental results. 7 • Chapter 5 - Conclusions and Future Work. We give the summarization of this thesis and propose some ideas for further work. C h a p t e r 2 Background and Related Work 2.1 Rectangle Covering Problem The M D L covering problem is similar to the rectangle covering problem which is to find the minimum number of axis-parallel rectangles to cover a 2-dimensional recti-linear polygon. Rectangle covering problem has been studied by many researchers and it is shown an N P - H a r d problem in [16]. The rectangle covering for polygons without holes has been studied by [4]. In [10] the authors propose an O ( y l o g n ) factor approximation algorithm for covering a rectilinear polygon wi th holes using axis-parallel rectangles. Our problem is different from the rectangle covering problem. This thesis studies the hierarchical data. Therefore, the "rectangle regions" in the M D L wi th exceptions can only be a tuple that contains a node from the hierarchy in each dimension other than an arbitrary region. The other difference is that we deal wi th multidimensional not only 2-dimensional data. 9 2.2 M D L and G M D L for Hierarchical Data In [11], the authors proposed a generalized M D L approach for summarization, which is called G M D L . In many cases, it is hard to define strict determinations for query properties. The user may define what k ind of cells are definitely "interesting" and what k ind of cells are definitely "undesirable". A l l other cells are called "don't care" cells. The interesting cells are named as "blue cells". The undesirable:cells are called "red cells". The "don't care" cells are "white cells". G M D L tries to shorten the length of the covering by allowing some white cells in the covering. B y adding white cells, some regions in M D L covering can be merged into big regions and then the total number of the regions decreases. Firstly, the authors of [11] applied G M D L approach on the spatial data and proposed four G M D L algorithms to summarize spatial data wi th white cells. Sec-ondly, this paper studied hierarchical data and brought out a M D L - T r e e algorithm whose running time is linear to the sizes of the hierarchies. Then based on M D L -Tree algorithm, this paper proposed G M D L - T r e e algorithm for hierarchical data. From the experiments we know that the G M D L approach generates more concise covering for blue cells. In this thesis we focus on the M D L wi th exceptions on pure hierarchical data. Now let us discuss more about M D L - T r e e algorithm and G M D L - T r e e algorithm for hierarchical data as they were presented in [11]. 2.2.1 M D L - T r e e A l g o r i t h m M D L - T r e e algorithm finds the shortest M D L covering for a set of blue cells. M D L -Tree algorithm visits nodes in each hierarchy. Then regions in lower level are merged into regions i n higher level if a l l children regions of a node contain only blue cells. 10 From Example 1.1.1 we already know that the covering generated by M D L -Tree algorithm is st i l l not short enough. In this thesis we extend M D L - T r e e algo-r i thm to allow some holes in a covering. For example, if a node has 10 children cells and 9 of those children cells are blue. The M D L - T r e e algorithm can not merge its 9 children to a bigger region because one of its children is not a blue cell. Therefore the covering for those 9 blue cells returned by MDL-Tree algorithm enumerates those blue cells and the length of this covering is 9. Bu t M D L wi th exceptions ( M D L E ) can generate a shorter covering by including that non-blue child as a hole in the covering. The covering is in the form of this node excepting its non-blue child cell and the length of this covering is 2. 2 .2 .2 G M D L - T r e e A l g o r i t h m In [11], the authors proposed G M D L approach to generate more concise covering than M D L approach. G M D L approach allows the covering for blue cells to contain some white cells and then more regions can be merged into bigger regions. It is good to get shorter coverings for users, but the covering becomes not pure or not precise by including some white cells. M D L E is different from The G M D L approach. M D L E also allows non-blue cells in the covering, but the non-blue cells are in the exception part of the covering. So the user knows clearly about which cells are included in the covering and the covering is pure and precise. O n the other hand, the non-blue cells i n M D L E covering can be white cells or red cells. Therefore, if a node has 10 children cells, which contains 9 blue cells and 1 red cell, then the covering generated by G M D L - T r e e algorithm enumerates those 9 blue cells. M D L with exceptions ( M D L E ) generates a shorter covering by including that red cell as a hole in the covering. The covering is in the form of this 11 node except its red child cell. The length of this covering is 2. 2.3 Automatic subspace Clustering The problem of automatic subspace clustering has been studied in [1]. The au-thors proposed an algorithm, C L I Q U E , to find clusters embedded in subspaces of high dimensional data. C L I Q U E also applied MDL-based pruning to decide which subspaces are interesting. The cluster descriptions in the form of D N F expressions returned by C L I Q U E have been minimized for easy understanding. The work in [1] is related to our work. The difference is that we study the pure hierarchical data. 2.4 Concise Descriptions In [13], the authors formalized the M D L approach to generate the concise descrip-tions for subsets of structured sets. They argued that the M D L problem for simple hierarchies(1-dimensional hierarchy) can be solved in polynomial time and they proposed an algorithm for this case. Then they proved that the M D L problem for multidimensional hierarchies is N P - H a r d . Our problem has two important different aspects from [13]. Firstly, the length of a description in [13] is defined as the number of nodes used in the description. For example, in a 2-dimensional hierarchical data cube, 'a' is a node in one dimension and ' 1 ' is a node in the other dimension. The description to cover (a, 1) in [13] is S\ = a • 1. The length of S\ is 2 because S\ contains two nodes. In the user's view, the important thing is how many data cells are included in the description. We need only to use (a, 1) to express this data cell and the length is 1. In this thesis the length of a description is defined as the number of axis-parallel 12 rectangular regions, which are data cells or data regions in a data cube, contained in this description. Secondly, product expressions are allowed in [13]. For example, in a 2-dimension hierarchical data cube, {a, b, c, d} is from one dimension and d is the parent of {a, b, c}, while {1,2,3} is from the other dimension and 3 is the parent of {1,2}. In [13] S = (a + b) x 2 is a valid expression to cover two cells (a, 2) and (6,2). The length of S in [13] is 3 because S uses three nodes, i.e. a, b and 2. Bu t in user's view, the expression including (a + b) is not pure hierarchical data any more, because (a + b) treats the data as spatial data again, which has been studied in [1]. In this thesis, we study pure hierarchical data and we do not allow product expressions in the descriptions. 13 C h a p t e r 3 C o m p l e x i t y A n a l y s i s In this chapter we wi l l give the formal definitions of M D L wi th exceptions for pure hierarchical data. We study the 1-dimensional hierarchy firstly and give an algorithm to get the optimal description in this case. Then we prove M D L wi th exceptions is an N P - H a r d problem in a 2-dimensional hierarchical data cube. 3.1 One-Dimensional Hierarchy: A Tractable Case 3.1.1 Bas ic Definitions Definition 3.1.1. A n I — level hierarchy is in the form of h = {/ in, • ••> ^2n2> •••> h i , hnt)-• / i n is the root; • for any 1 < i < I, n ; is the number of nodes in level i; • for any 2 < i < /, hij has one parent hi_i^ in level i — 1. • Let us use Cld(hij) to represent the children of and Des(hij) to present all the descendants of hij. 14 Definition 3.1.2. A node /K7 , i<i<U<?'<«; is a leaf iff Cld(hij) = 0 • For any leaf node h^ we have Cld(hij) = Des(hij) = 0. Without loss of generality, let us assume only the nodes in level I do not have children. This is also the very common hierarchical structure in many real applications. In fact, if a node in level 1 to I — 1, for example h^, has no children, we can add a new node as its child and repeat this adding procedure t i l l al l nodes in level 1 to Z — 1 have at least one child. The new added nodes w i l l not affect the hierarchical structure. Under this assumption all and only the nodes in level I are leaves. Let us use Leaves(h) to present the set of leaves of h, then Leaves(h) — {hij\l < i < I,1 < j < rii,Cld(hij) — 0} = {/i/j | l < j < n/} The data in a data warehouse or an O L A P system is generally modelled in a multidimensional structure [3] [6], commonly called a data cube. In this thesis we only study data cubes wi th a hierarchical structure in each dimension. Definition 3.1.3. In a 1-dimensional hierarchical data cube wi th hierarchy h: • Da ta Cel l is a leaf node, such as hy., for some 1 < k < nf, • Data Region is a non-leaf node, such as hij, for some I <i <l — 1,1 < j < . • Definition 3.1.4. Given a set of data regions and data cells, A, we have the fol-lowing definitions: 1. The length of A is denned as the number of regions and cells in A, that is \A\ = \{hij\hij G A}\ 15 2. Cell(A) is the set of all data cells covered by A. Cell(A) = {hik\hik G A or 3hij G A, htk G Des(hij)} • Now we define the operations between.two sets of data regions and data cells. We use '+ ' to represent the union operation and ' —' to represent the difference operation. Definition 3.1.5. Given two sets of data regions and data cells, A, B, we have the following definitions: 1. A = B Cell(A) = Cell{B) 2. Cell(A + B) = Cell(A) + Cell(B) = {hik\hlk G Cell(A) U Cell(B)} 3. Cell(A -B) = Cell(A) - Cell{B) = {hlk\hik G Cell(A), hlk <£ Cell(B)} • Blue cells in a data cube are all those cells that satisfy properties defined by users, or we call them interesting cells to users. A l l other data cells are called non-blue cells. When we give an interesting set D, we always call the data cells in D as blue cells and data cells not in D as non-blue cells. We use a property 'color' to represent whether a cell is blue. For example, hik.color = blue means cell hik is a blue cell. Now let us discuss more about a data region . We can cover the blue cells in hij by: 1. enumerating all the blue cells, which is in the form of Si = {hik\hik G Cell(hij), hik.color = blue} 16 and the length of Si is Si = \{hik\hik G Cell(hij),hik.color — blue}] or 2. using with exceptions of all non-blue cells in Cell(hij), which is in the form of 52 = - {hik\hik G Cell(h^), htk.color ^ blue} and the length is of £2 is | S 2 | = 1 + \{hik\hik G Cell(hij),hik.color ^ blue}\ The difference between the length of S\ and the length of £2, i.e. | j — IS2I, is the number of regions and cells can be saved to cover the blue cells by S2. We call this difference as the benefit of hij. Definition 3.1.6. Given a data region hij, 1 < i < nj_i, its benefit is defined as: Benefit(hij) = \{hik\hik G Cell(hij),hik.color = blue}\ -(\{h>ik\hik € Cell(hij),hlk.color ^ blue}\ + 1) = \{hik\hik G Cell(hij),hik.color = blue}\ -\{hik\hik S Cell(hij),hik.color ^ blue}\ - 1 • Example 3.1.1. The example in Figure 3.1 is a 1-dimensional hierarchy with three levels. The root is hu = Canada which has three children BC,Alberta, Ontario in the second level. This hierarchy has ten leaves in the third level. The leaves are data 17 cells, such as Vancouver, Victoria,Toronto.... The non-leaf nodes are data regions, such as Canada, Alberta. Given a set A = {Victoria, Alberta,Toronto}, the length of A is 3, and we have Cell(A) = {Victoria, Edmonton,Calgary,Toronto} If B = {Vancouver, Edmonton, Waterloo} then Cell(A + B) = {Victoria, Vancouver, Edmonton, Calgary, Toronto, Waterloo} Cell(A — B) = {Victoria, Calgary, Toronto} Given a subset of leaves, D, wi th seven elements, D — {Vancouver, Richmond, Victoria, Edmonton, Toronto, Waterloo, London} We mark these seven cells as blue and the other three cells as non-blue. So BC has three children which are blue. In order to cover these three blue cells in BC, we can • enumerate al l blue cells as { Vancouver,Richmond, Victoria} and the length is 3, or • use a description wi th exceptions as {BC-Burnaby} and the length is 2. So we only need 2 regions/cells in {BC-Burnaby} to cover those 3 blue cells, and we can save 1 region/cell in the covering. Therefore Benefit(BC) = 3 — (1 + 1) = 1. • 3.1.2 Problem Statements D e f i n i t i o n 3.1.7. A description wi th exceptions for an interesting data set D in a hierarchical data cube h is in the form of S = U\ — U2 where U\ is a set of data 18 Canada Richmond Burnaby L o n d o n W a t e f l o o Figure 3.1: A n Example for 1-Dimensional Hierarchy regions and data cells, f/2 is a set of data cells, and S covers exactly the data cells in D, i.e. Cell(S) = D. • From the definition of the description wi th exceptions in Definition 3.1.7, £/ 2 can only contain data cells. A description for D is a covering of al l data cells in D. The data cells in C/2 are called holes in this covering of D. We only enumerate al l the holes in a covering instead of doing any summarization on holes. Definition 3.1.8. For a description S — U\ — f72 for D in h, its length is defined as \S\ = \Ui\ + |172|, and its benefit is defined as Benefit(S) = \D\ — \S\. • \D\ is the number of interesting cells. The length of S is the number of regions and cells in S to cover the blue cells in D. So the benefit of S is the number of regions and cells that are saved to cover the interesting cells D by the description S. Let us use Sopt to denote the description wi th the minimum length amongst all the descriptions wi th exceptions for D, and we say that SoPt is an optimal descrip-tion for D in h. For a given instance, \D\ is constant, so based on Definition 3.1.8 19 Sopt has the maximum benefit too. D e f i n i t i o n 3.1.9 ( M D L w i t h E x c e p t i o n s : M D L E ) . M D L wi th exceptions is to find an optimal description Sopt, which has the minimum length and the maximum benefit, for D in h. • E x a m p l e 3.1.2. Let us st i l l use the example in Figure 3.1. The interesting data set D is D = {Vancouver, Richmond, Victoria, Edmonton, Toronto, Waterloo, London) In order to cover D, we can use a description as Si = {(BC — Burnaby) + Edmonton + (Ontario — Hamilton)} = {BC + Edmonton + Ontario} — {Burnaby + Hamilton} The length of S i is | 5 i | = 3 + 2 = 5, so Benefit^) = \D\ - \Si\ = 7 - 5 = 2. Or we can use another description, S2 = Canada — {Burnaby + Calgary + Hamilton} The length of S2 is | 5 2 | = 1 + 3 = 4, so Benefit(S2) = \D\ - \S2\ = 7 - 4 = 3. S2 is the description wi th the minimum length for D in h. • 3.1.3 A l g o r i t h m for M D L wi th Except ions Before we go to the algorithm for M D L wi th exceptions, let us see how to compute the benefit of a region based on its children's benefits. A n y data region h^, for some 1 < i < I — 2, is a node in level 1 to I — 2. So hi j's children are in some level from 2 to I — 1 and all its children are not leaves. 20 Prom Definition 3.1.6 we know Benefit(hij) = ]{hik]hik G Cell(hij), hik.color = blue}] -\{hlk\hik G Cell(hij),hik.color ^ blue}] - 1 = ^2 \{hk]hik G Cell(hi+liC),hlk.color = blue}] hi+i,c€Cld(hij) - ^ \{hik\hik G Cell(hi+ltC),hlk.color ^ 6Zue}| - 1 ^;+i,c€CZd(/iij) = (\{hlk\hik G CeZ/(/ii+i iC), hik.color = 6/ue}| hi+i,c(=Cld(hij) - | { / i i f c | / i / f c G CeU(hi+liC), hik.color ^ Wue}| - 1) + |C7d(/iy)l - 1 = 1 ^ ( ^ ) 1 - 1 + Benefit(hi+hc) (3.1) ^i+i,c€C(d(ftij) From Equation 3.1 we can see that the benefit of a node in level 1 to / — 2 can be computed from its children's benefits. The number of hifs children is ]Cld(hij)\. After we aggregate its children regions to h^, all those children are covered by hij. So the new gained benefit is \Cld(hij)\ — 1. The sum of this new gained benefit and all benefits from children regions is the benefit for h^, which has been exactly presented in the Equation 3.1. From Definition 3.1.6 and Equation 3.1 we have the following bottom-up algorithm for M D L E problem in a 1-dimensional hierarchy. Algorithm 3.1.1. Bot tom-Up Algor i thm for M D L with exceptions in 1-dimensional hierarchy. Input: a hierarchy h and a data set D; Output: the optimal description for D; 21 1. Initial two stacks T and V; /*T is the stack of descriptions and V is the stack of benefits of descriptions associated with nodes. We use T(h{j) to present the description and V(hij) to present the benefit of the description associated with node h^ */ 2. Browse the hierarchy h by post-order; (a) For each node 1 < j < ni-\\ i . Compute the benefit based on Definition 3.1.6 and push Benefit(hi_ij) into stack V; i i . If (Benefit(hi_hj) > 0) push Si_ij — hi_ij - {hik\hik e Cell(hi-ij),hik.color ^ blue} into stack T ; Else push Sz-ij = {hik\hik € Cell(hi^ij),hik-color = blue} into stack T ; E n d If (b) For each node hij, 1 < i < I — 2,1 < j < ni i . Compute the benefit by Equation 3.1: Benefit(hij) — \Cld(hij)\ — 1 + Benefit(hi+\<c) i i . If {Benefit{hij) > 2~2hi+1,ceCid(hij)kV{hi+1,e)>o ^ ( k + i , c ) ) push Benefit(hij) into stack V; push Sij = h^ — {hik\hik G Cell(hij), hik ^ blue} into stack T ; Else P u s h Efc I + 1 , E 6CW(fc«)&V(fc i + 1 , c )>o V(hi+i,c) into stack V; push Sij = Ufc i + l i f cecw(fc y) T ( ^ + i , c ) into stack T ; 22 E n d If i i i . Popup Ziy's children descriptions and values from T and V; 3. return T(hu)-• Example 3.1.3. The example in Figure 3.2 is a 4-level hierarchy. A l l blue cells are marked by Q- The following table is the procedure to build the optimal description. region benefit value in V description in T s 3 - 1 - 1 = 1 1 s — d t 1 - 1 - 1 = - 1 -1 e u 3 - 1 - 1 = 1 1 u- j X 1 + (-1) + 1 + 3 - 1 = 3 3 x - d - f - j V 3 - 1 - 1 = 1 1 v — n w 1 - 3 - 1 = - 3 -3 0 y 1 + (-3) + 2 - 1 = - 1 1 (v — n) + o z 3 + ( - ! ) + 2 - 1 = 3 4 (x-d-f - j) + (v -n) + o • node s. Benefit(s) — 1, we can use (s — d) to cover the three blue cells, {a, b, c}. So T(s) = s - d and V ( s ) = 1; • node t. Benefit(t) = —1, if we use (t — / ) to cover the blue cell {e}, the length is 2; if we just use e then the length is 1. So T(t) = e; • node x. Benefit(x) = 3, this is the benefit for the description S\ = x—d—f—j. The sum of its children's positive values in V is 2, and this is the benefit for the description S2 — (s — d) + e + (u — j). Therefore we can gain more benefit 23 S t U V w a b c d e f g h i j k l m n o p q r o o o o o o o o o o o Figure 3.2: Example for Bot tom-Up Algor i thm wi th S i , which aggregates children regions to x. It is clear that the optimal description for a: is (x — d— f — j); • node y. Benefit(y) — —1, and the sum of its children's positive values in V is 1. This means we can gain more benefit if we just use the optimal descriptions of its children to express y instead of aggregating children regions to y. So the optimal description for y is the union of its children's optimal description, that is (v — n) + o; • node z. Benefit(z) — 3, and the sum of its children's positive values in V is 4, so the optimal description for z is the union of its children's optimal description, that is (x — d — / — j) + (v — n) + o. Th i s is also the optimal description for z wi th respect to the set of blue cells in Figure 3.2. • Example 3.1.3 indicates that the bottom-up algorithm in Algor i thm 3.1.1 generates an optimal description for each non-leaf node. We have the following theorem about the optimality of Algor i thm 3.1.1. 24 Theorem 3.1.1. Algor i thm 3.1.1 generates an optimal description for data set D. • Proof. We show that description generated in Algor i thm 3.1.1 is optimal for each node. • For a node in level 1 — 1, i.e. / i / _ i j , l < j < ni-\\ hi-ijs children are all leaves. We can use two methods to cover the blue cells in C e / / ( / i / _ i j ) . 1. enumerating the blue cells and we can not get any benefit; 2. by hi-ij wi th the exception of all non-blue cells, which is in the form as ~ {hk\hik G Cell(hi_ltj), hik.color # blue} and the benefit is Benefit(hi_ij). Step2(a) of Algor i thm 3.1.1 stores the description wi th more benefit. The description pushed into T is the optimal description for • For a node in level 1 to / — 2, i.e. hij, 1 < i < I — 2,1 < j < rii: we have already generated optimal descriptions for its children. We can use two methods to cover the blue cells in Cell(hij) too. One is including h^ i n the description. The other is not including h^. 1. Including h^: we only permit data cells in U2 (Definition 3.1.7) so the description is S = h^ — {hik\hik G Cell(hij),hik-color ^ blue} and the benefit is Benefit(hi-\j). 2. Not including h^: we do not aggregate the children regions to h^. From the Definition 3.1.1 we know the hierarchy is a tree structure and any two nodes in the same level do not have overlapping children sets. So we can 25 deal wi th hij 's children region independently and use the union of chil-dren's description to express hij. The description in stack T is the optimal description for each child. — \Jh kecid(hi:i) ^Cw+i.c) is the descrip-tion for this case and has benefit T,hi+1,ceCi<Hhij)hVlh-i+1,e)>oV(hi+i,c). Step2(b) of Algor i thm 3.1.1 stores the description wi th the more benefit be-tween case 1 and case 2. The description pushed into T is the optimal descrip-tion for h^. Therefore, Algor i thm 3.1.1 creates an optimal description for each node and the output for hu is an optimal description for D in h. • T h e o r e m 3.1.2. The running time of M D L with exceptions for a 1-dimensional hierarchy h is 0 ( | / i | ) . • Proof. In Algor i thm 3.1.1, for each node in level I - 1 we need to find its children to compute the benefit. So all leaves wi l l be visited once. Each node in level I — 1 to 1 has two 'push' and 'pop' operations with stack T and stack V. Therefore Algor i thm 3.1.1 can be done in 0 ( | / i | ) . This means the time complexity of M D L E is linear to the number of nodes in h. • 3.2 NP-Completeness for 2-Dimensional Hierarchy We have seen that M D L E in a 1-dimensional hierarchy is tractable and we proposed an algorithm to get a description with the minimum length. In this section we discuss the case when the data cube has a 2-dimensional hierarchical structure. 26 3.2.1 Basic Definitions We have shown the definition for 1-dimensional hierarchy in Section 3.1.1. The related concepts of 2-dimensional hierarchy are similarly defined in this section. Definition 3.2.1. A 2-dimensional hierarchical data cube is in the form of H —< Hi,H2 > where both H\ and H2 are hierarchies defined in Definition 3.1.1. • Definition 3.2.2. In a 2-dimensional hierarchical data cube H =< H\,H2 >, and for some x\ E H\, x 2 G H2, we have • (2:1,2:2) is a data cell iff x\ E Leaves(H\) and X2 € Leaves'Hz); • (#1,2:2) is a data region iff x\ Leaves(H\) or X2 £ Leaves ( i / 2 ) . • Definition 3.2.3. Given a data region (£1,2:2) in H —< H\,H2 >, Cell(x\,X2) is the set of all data cells covered by (2:1,2:2), Cell(x\,X2) = {(2:1,2:2)12:1 e Leaves(Hi),x'2 G Leaves(H2),x[ G Des(x\),x'2 G Des(x2)} • In order to prove the NP-Completeness of M D L E for a 2-dimensional hier-archy, we only consider the 2-level hierarchy. We can represent a data cube wi th 2-dimensional 2-level hierarchical structure by a matrix and call one dimension as row and the other as column. We define the 2-dimensional 2-level hierarchical data cube in a more simple way in Definition 3.2.4. Definition 3.2.4. Given a 2-dimensional 2-level hierarchial data cube H = R x C , R — { r i , r 2 , :.,rnr}, r0 is the parent of rt,l < i < nr, C = { c i , C 2 , . . . , c n c } , c 0 is the parent of Cj, 1 < j < nc. For some r; G R, Cj G C, 27 • dij = (r;, C j ) is a data cell if rj £ R,Cj G C; • d^- = (rj, Cj) is a data region if r , = ro or Cj = Co; • In a 2-dimensional 2-level hierarchical data cube, only ro and Co are not leaves. A l l nodes in R and C are leaves. The nodes in R are called rows. The nodes in C are called columns. The benefit of a row in H — R x C is similarly defined as Definition 3.1.6. Definition 3.2.5. A row region rj G R is defined'as r , = (r;,co) = {{ri,Cj)\cj G C } , and the benefit of r$ is: Benefit{ri) = \{dij\dij £ Cell(ri,Co),dij.color = blue}\ — \{dij\dij G Cell(ri, co), d^.color ^ blue}\ — 1 = \{dij\dij.color = blue, 1 < j < nc}\ — \{dij\dij.color 7^ blue, 1 < j < nc}\ — 1 • The benefit of a column is similarly defined as the benefit of a row in Defi-nit ion 3.2.5. Example 3.2.1. The example in Figure 3.3 is a 7 x 7 2-dimensional 2-level hierar-chical data cube. The hierarchical structure have been showed in Figure 3.3. The data cells marked wi th 'O' a r e blue cells. This data cube has 49 data cells and the interesting data set D has 28 cells. Rows are R = {a,b,c,d,e, f,g}. Columns are C = {1 ,2 ,3 ,4 ,5 ,6 ,7} . ro is 'i', which is parent of all nodes in R. CQ is '8', which is 28 1 2 3 4 5 6 7 o o o o o o o o o o o o o o o o o o o o o o o o o o o o Figure 3.3: A n Example for 2-dimensional 2-level hierarchy parent of all nodes in C. (a, 2) is a data cell, (i, 2) and (6,8) are data regions, (i, 1) contains 5 blue cells and 2 non-blue cells. So Benefit(i, 1) = 5 — 2 — 1 = 2. • 3.2.2 P r o b l e m Statements Problem 1: M D L with Exceptions The definition of the description for a data set D in a 2-dimensional 2-level hierar-chical data cube is similarly defined as the description in a 1-dimensional hierarchy which is given in Definition 3.1.7. The only difference is that the data cell and data region are in the forms defined in Definition 3.2.4. Definition 3.2.6. Given a 2-dimensional 2-level hierarchical data cube H and an 29 interesting data set D, the description 5 for D is in the form as S = U1-U2 = \J(ri,cJ)-{J(rk,cl), 0 < i < nr, 0 < j < nc 1 < k < nr ,1 < I < nc and S covers exactly all the data cells in D, i.e. Cell(S) = D. • The length and benefit of a description is similarly defined as Definition 3.1.8. The problem of M D L E , M D L with exceptions, is defined in the same way as Definition 3.1.9. Example 3.2.2. Let us st i l l use the example in Figure 3.3. Firs t ly we compute the benefits for each row and column and show them in the following table. rows length benefit columns length benefit (L8),(g,8) 3 2 3 2 (c,8),(d,8),(e,8) 4 0 (i,2),(i,3),(i,4),(i,6),(i,7) 4 0 (a,8),(b,8) 5 -2 (i,5) 5 -2 The optimal description wi th the minimum length is: Sopt = {(c,8) + (d,8) + (e,8) + (t,2) + (i,3) + (t >4)} - { ( c , 2) + (c, 3) + (c, 4) + (d, 2) + (d, 3) + (d, 4) + (e, 2) + (e, 3) + (e, 4)} +(/,l)+'(<?,l) + ( / ,6) + (s,7) The length of Supt is 19. So the benefit of S is Benefit(Sopt) = \D\ - \S0pt\ = 28 - 19 = 9 Sopt is the description wi th the minimum length and the maximum benefit • 30 From this example we can see (/, 8), (5,8) and have the most benefits in a l l rows and columns, but they are not in the optimal description. Even though rows (c, 8), (d, 8), (e, 8) and columns (i, 2), (i, 3), (i, 4) do not have positive benefit (benefits are 0), we can get more benefits by selecting all of them. Therefore a row/column wi th positive benefit may be not included in the optimal description and a row/column wi th non-positive benefit may be included in the optimal description. In a 1-dimensional hierarchy, when we generate the description for a region from its children regions, we aggregate all its children (Step2(a) in Algor i thm 3.1.1) or just unite the optimal descriptions of the children regions wi th positive benefits (Step2(b) in Algor i thm 3.1.1). Each way keeps the optimality in the children regions. But in a 2-dimensional 2-level hierarchical data cube, the intersections between rows and columns makes it impossible to keep the optimality from lower level regions. This induce the M D L E problem to an NP-Complete problem in a 2-dimensional hierarchical data cube. We show the proof in the next section. 3.2.3 NP-Completeness Proof Part I: C W E Bipartite Graph In this part, we show the proof that M D L E problem is an NP-Complete problem. We firstly prove that it is NP-Complete to find a maximum induced subgraph in a complete edge-weighted ( C E W ) bipartite graph, which has 2-value weights on edges. We prove the NP-Completeness of C E W from a known NP-Complete prob-lem: C L I Q U E . Then we prove the M D L E problem is NP-Complete from C E W . P r o b l e m 2 : C E W B i p a r t i t e G r a p h D e f i n i t i o n 3.2.7. The complete edge-weighted ( C E W ) bipartite graph wi th 2-level weights is in the form of Ge = (V,E),V = V1UV2, for any Vi E Vi,Vj G Vi, e — 31 Weights on Edges e d e f g a -1 1 -1 1 f b -1 1 -1 -1 g c -1 -1 1 -1 Figure 3.4: Complete Edge-Weighted Bipart i te Graph (vi,Vj) G E, the weights on edges have two values, i.e. wt(e) £ { t u i , ^ } , and at least one of wi,w2 is negative. • Definition 3.2.8. For any subgraph G'e = (V',E') in Ge induced by V C V, the weight of G'e is defined as: wt(G'e) = Me) • eeE' Definition 3.2.9. The maximum induced subgraph problem in C E W bipartite graph is to find a subgraph G'e = (V, E') induced by V C V such that (wt(G'e) + \V'\) is maximized. • Example 3.2.3. Figure 3.4 is an example of C E W bipartite graph. Ge — (V, E), V = Vi U V2, Vi = {a, b, c}, V2 — {d, e, /, g} and the weights on edges have been shown in the figure. If V = {a, b, e}, then the weight of the induced subgraph G'e is wt(G'e) = 2, and wt{G'e) + \V'\ = 2 + 3 = 5. If V" = {a, b, e, g}, then the weight of the induced subgraph G" is wt(G") = 2, and wt(G'e) + \V"\ = 2 + 4 = 6. • 32 After we check all the induced subgraphs of Ge, we know that G" is the maximum induced subgraph in Ge. In fact, because of interactions between the two level weights on edges and the cardinality of vertices set, it is hard to find a maximum induced subgraph in a C E W bipartite graph. Now we prove the induced subgraph problem in C E W bipartite graph (Defi-nit ion 3.2.9), to find a subgraph G'e induced by V in Ge such that wt(G'e)+\V"\ > K, is an NP-Complete problem. In order to prove this, we make a reduction, which is triggered by the reduction in the proof of B E B problem in [7] [12], from the clique problem which is a well known NP-Complete problem. Theorem 3.2.1. Clique is a known NP-Complete problem[16][5]. Instance: Graph G = (V,E), positive integer K < \V\. Question: Does G contain a clique of size K or more, i.e., a subset V CV with \V| > K such that every two vertices in V are joined by an edge in E? • Lemma 3.2.1. The induced subgraph problem in a C E W bipartite graph is an NP-Complete problem. Instance: A complete edge weighted ( C E W ) bipartite graph Ge = ( V i , V2, E), wt(e) G {101,102}, and at least one of w\,W2 is negative; an integer K. Question: Does Ge contain an induced subgraph G' — (V',E') such that \V'\ + wt(G') > Kl • Proof. The reduction is from the clique problem. Given a graph G = (V,E) and a positive integer K < \V\. Construct a complete bipartite graph Ge = (V,V',E') wi th V = V and the set of edges E' = {(v,u)\v G V,u G V ' } . The weight function wt is defined as: 33 For any e = (v, u) G E: 0 v — u wt{e) = I 0 e£E -1 e$E We now show that G contains a clique of size K or more if and only if Ge contains an induced subgraph G'e = (V", E") such that \V"\ + wt(G'e)\ > 2K. Firstly, if Ge contains an induced subgraph G'e = (V",E") = (Vb,Vb,E") such that \V"\ + wt(Ge) > 2K or more (from the construction of Ge we know that the two vertices sets V and V have the same properties, so if any v G V is contained in Ge, the correspondent vertex v' G V is included in Ge too. That is why V = Vf, U Vh ). The weights on edges in G'e has two cases: 1. V e e £ > t ( e ) = 0 The induced subgraph by Vb in G is a clique. Because no edge in Ge has weight —1, this means all nodes in Vb are connected in G by the definition of weight function wt(e). \V'\ + wt(Ge) = 2\Vb\ > 2K. The clique induced by Vb in G is of size K or more. 2. 3eG E')Wt{e) = -l If e' = (v', u') G E', wt(e') = —1, then v', v! is not connected in G. Let us delete v' from Vb and the three edges (v',v'),(v',u'),(u',v') from E'. wt(v',v') = 0, wt(v', u') — wt(u',v') — —1. A t the same time, we need delete all other edges incident on v', and those edges have weight 0 or —1. Therefore the total weight of all deleted edges is —2 or less. So the left over biclique st i l l keeps \V'\ + wt(E') >= ( |V ' | - 2) + (wt(E') - ( -2)) > 2K. After we delete al l the 34 edges wi th weight —1, the left over edges only have weight 0. It comes back to case 1. Therefore if we can find an induced subgraph G'e — (V", E") in Ge such that | V | + wt(G'e) > 2K, we can also find a clique in G whose size is K or more. O n the other hand, let us suppose G contains a clique of size K or more. Let us use Vc to present the set of vertices in this clique, so |V^| > i ^ . The induced subgraph in Ge is G'e = (V, E') = (Vc, V'c,E'), V'c = Vc,E' = {(v,u)\v EVc,u£ Vf). Because Vc induces a clique in G, any two nodes in Vc are connected. Therefore all edges in E' have weight 0. So for G'e we have \V'\ + wt(G'e) = 2x\Vc\+ £ wt(e) vevc,uev^ ,e=(v,u)eE = 2x\Vc\+ J2 ( ° ) v€Vc,u£V£,e=(v,u)€E = 2 x \VC\ > 2K Clique is an NP-Complete problem, so it is N P - H a r d to find a subgraph G'e induced by V" in Ge such that \V"\ + wt(G'e) > K. Given an induced subgraph G'e = (V",E") in Ge there exists a polynomial time algorithm that can check whether | V " | -f- wtiG'f) > 2K. So this problem is in N P . Now we have proved that this problem is an NP-Complete problem. • E x a m p l e 3.2.4. A n example for the proof of Lemma 3.2.1 is given in Figure 3.5. Graph G = (V, E), V = {1,2,3,4}, E = {(1,2), (1,4), (2,4), (2,3), (3,4)}. We can construct a complete bipartite graph Ge = (V, V, E') and only edges (1,3) and (3,1) have weight —1, al l other edges have weight 0. 35 i o O i Edges with Weight 0 4 O O 4 O 3 O 2 Edges with Weight -1 (3,1),(U) (4,4),(4,1),(4,2),(4,4) (3,3),(3,2),(3,4) (2,2),(2,1),(2,3),(2,4) (1,1),(1,2),(1,4) G =(V,E) Ge =CV.V.E") Figure 3.5: Reduction From Clique In G, vertices {1,2,4} induce a clique of size 3. Then in graph Ge, the subgraph G'e induced by V = {1,1 ,2 ,2 ,4 ,4} has six vertices and all edges have weight 0. So for G'e, \V'\ + wt(G'e) > 6 = 2 x 3. O n the other hand, if we find an induced subgraph G'e — ( V , E') in Ge such that | V | + wt(G'e) > 2K, let us discuss two cases: • Ve e E',wt(e) = 0. For example G'e is induced by V — {2 ,3 ,4 ,2 ,3 ,4} , \V'\ = 6, and all edges in E' have weight 0, so wt(G'e) = 0, + wt(G'e) = 6. This means i n G there exists one clique of size 3, which contains vertices {2,3,4}. • 3e G E',wt(e) = —1. For example G'e contains all vertices of Ge. \V'\ + wt(G'le) = 8 + (-2) = 6. The weights of edge (1,3) and edge (3,1) are - 1 . Now let us delete vertices {3,3} from G'e, and all the edges incident on {3,3}, which are (3,3), (3,1), (3,2), (3,4) and (1,3), (2,3), (4,3). The total weight of those deleted edges are —2. The left over subgraph is G"e = (V",E"), and V" = V- {3,3} = {1,2 ,4 ,1 ,2 ,4} , wt(G") = wt(G'e) - ( -2) = - 2 - ( -2) = 0, 36 so \V"\+wt(G'Z) = 6 + 0 = 6 Now G" only contains {1 ,2 ,4 ,1 ,2 ,4} , so {1,2,3} induces a clique of size 3 in graph G. • 3.2.4 NP-Completeness Proof Part II: M D L E Before we show the' proof of the NP-Completeness of M D L E , let us discuss about the form of the description for a data cube first. From Definition 3.2.6 we know a description for the data set D in a 2-dimensional 2-level hierarchical data cube H is in the form of: S = U1-U2 = \J(ri,cJ)-\J(rk,cl), 0 < i < nr, 0 < j < nc 1 < k < nr, 1 < I < nc Based on the data regions in U\, the descriptions can be classified into two cases: 1. Ui = {(ro,co)} and S = ( r 0 , c 0 ) - {dij\dij € Cell(r0,co),d^.color ^ blue}. The whole region, (ro, Co), wi th the exceptions of all the non-blue cells are the blue cells in D. For a given instance it is directly to get this description and its benefit is constant. 151 = 1 + \{dij\dij G Cell(ro, Co), dij.color ^ blue}] Benefit(S) = | D | - \S\ = \{dij\dij £ Cell(ro,CQ),d^.color = blue}\ — \{dij\dij € Cell(ro, Co), dij.color ^ blue}\ — 1 37 2. U\ = {(ri, Cj)\l < i < nT, 1 < j < nc} and the description is in the form of S = Ui-U2 U r ' + U ci TiZRuRxCR CJ&CUCXCC — {dij\dij G Cell(R\) U Cell(Cx),dij.color ^ blue) + {dij\dij £ Cell(R{) U Cell(Ci),dij.color = blue} (3.2) S has three parts. Firstly, S contains al l the rows in Ri and columns in C\. Secondly, S excepts the non-blue cells contained in R\ and C\. Now 5 covers all and only the blue cells contained in R\ and C\. Finally, by adding the blue cells not contained in R\ and C\, S covers all the blue cells in D. The intersections between rows and columns wi l l make the selections of rows and columns complicated(as showed i n Example 3.2.2). For a given instance it is hard to find a description in this form wi th the most benefit directly. In the following part of Section 3.2.4, we only consider the description in the form of Equation 3.2. Based on Lemma 3.2.1, we prove the M D L E problem is N P - H a r d by a re-duction from C E W subgraph problem. Theorem 3.2.2. The following problem is NP-Complete: Instance: A data cube H = R x C, an interesting data set D and a positive integer K. Question: Does there exist a description S in the form of S = |J 7 - j + (J Cj - {dij\dij eCell(Ri)U Cell(Ci),dij.color ^ blue} + {dij\dij £ Cell(Ri) U Cell(C\), dij.color = blue} which has benefit K or more? 38 • Proof. Given an instance of complete edge-weighted ( C E W ) bipartite graph, Ge = {V, E), V = V\ U V2,e £ E,wt(e) = ± 1 and a positive integer K', we have known from Lemma 3.2.1 that it is NP-Complete to find a subgraph Ge = ( V , E') induced by V in Ge such that \V'\ + wt{B) > K'. We construct a data cube H' based on graph Ge: 1. H' — Vi x V2, H' is a | V i | x |VaI data cube: it has | V i | rows and \V2\ columns; 2. for e = (vi,Vj),Vi £ V\,Vj € V2, the weight function wt(e) is: if wt(e) = 1, mark d^ = (vi,Vj) in H' as a non-blue cell; if wt(e) — —1, mark dij = (vi,Vj) in H' as a blue cell. Prom the weight function wt(e) we know that a blue cell in H' corresponds to an edge wi th weight — 1 in Ge and a non-blue cell in H' corresponds to an edge wi th weight 1 i n Ge. For a row/column i n data cube H', v € V\ U V2, the benefit of v can be computed in the following way: Benefit(v) = \{dij\dij £ Cell(v),dij.color = blue}\ — \{dij\dij £ Cell(v),dij.color ^ blue}\ — 1 = |{e = (v,u)\e £ E,wt(e) = - 1 } | - | { e = (v,u)\e £ E,wt(e) = 1}| - 1 = { - « r t ( e ) } - { ^ « r t ( e ) } - l e=(j;,«)6B,iut(e)=—1 e=(i),u)€E,iut(e)=l = - l - E (^e) (3-3) e=(ti,u)GJ5 For any Vi £ Vi , t ; j £ V2, the description wi th exceptions to cover the blue 39 cells in Vi and Vj is = Vi + — {djj|djj G Cell(vi) U Cell(vj), dij .color ^ blue} The total benefit of f j & U j is the number of blue cells in Cell(vi) U Cell(vj) minus the length of S^, Benefit(vi8zvj) = \{dki\dki £ Cellivi) U C'ell(vj),dki.color = blue}\ ~(\{dki\dki G Cell(vi) U Cell{vj),dM.color ± blue}\ + 2) = (|{dfi|dj/ G Cell(vi),du.color = blue}\ -\{du\dn G Cell(vi),du.color ^ 6Zue}| - 1) + (|{c/fcj|d/cj G Cell(vj),df-j.color = blue}\ — \{dkj\dkj G Cell(vj),dkj.color ^ blue}\ — 1) -1 d^ G Cell{vi) Pi Cell(vj), dij.color = blue d^ G Cell(vi) f l Cell(vj), dij.color ^ 6hze = Benefit(vi) + Benefit(vj) { —1 d^ £ Cell(vi) nCell(vj), d^.color = blue 1 G Cell(vi) f l Cell(vj)vj, dij.color ^ blue From Equation 3.4 we can know that, given a row G V\ and a column G V2, the intersecting cell, i.e. d^, w i l l be counted twice in Bene f it(vi) + Bene fit(vj). So total benefit of Vi and i>j can be computed in two ways based on the color of d^: • if dij.color = blue, then dij appears twice in the positive part of Benefit(vi) and Benefit(vj). So the total benefit is Benefit(vi) + Benefit(vj) — 1; • if dij.color 7^ blue, then d^ appears twice in the negative part of Benefit(vi) and Benefit{vj). So the total benefit is Benefit(vi) + Benefit{yj) + 1. 40 Given a description S' generated from R[,C[ where R[ G V i , C[ G V2, S' is in the form of S' = (j n + [j Cj - {dij\dij G Cell{R[) U Cell(C[), <kj.color ^ blue) +{dij\dij £ Cell(R[) U Cell(C[), dij.color = blue, } The benefit of S' is generated by using rows in R[ and columns in C[ to cover the blue cells in those rows and columns. So based on Equation 3.3, Equation 3.4 and the construction of H', the benefit of S' can be rewritten as: ( Let V = Rf1UC'1,E' = {{v,u)\v,u G V'},V" = V — V',E" = {(v,u)\v,u G V"}) Benefit(S') = ^ Benefit{r7) + ^ Benefit(cj) + E (-1)+ E w dij£Cell(ri)nCell(cj),dij.color=blue dij£Cell(ri)nCell(cj),dij.color^blue = Benefit(v) + ^ • wt{e) + ^ wt(e) f G V ' v,ueV',e=(v,u),wt(e)=-l v , u 6 V ' ,e={v ,u) ,wt{e)=l = E {-1 - E W T ^ + E ™'(E) V u e v e=(v,u)eE Ve=(v,u)eE' = -\v\-{2 E WT(E)+ E ™'(E)} V e = ( u , u ) € £ ' i ) £ l " , i i 6 V - V ' , e = ( » , u ) e £ + E WT(E) Ve=(v,u)eE' = -\v\-{ E WT )^+ E (^E)> = -\v\- Yl wt(e) vev ,uev,e=(v,u)eE = -(\v\-\v\)-{ E WT(E)- E W T ^ v,u€V,e=(v,u)eE v,u£V-V ,e=(v,u)€E •= -urn E (^E)> + Y, wtw (3-5) » , « e V " , e = ( » , « ) 6 £ " 41 From Definition 3.2.8 the total-weight of Ge — (V, E) is: wt(Ge)= wt(e) (3-6) e=(v,u),e(zE The total weight of a subgraph G'e = *(V",.E") in graph Ge is: e=(v,u),e€E" Applied Equation 3.6 and Equation 3.7 to Equation 3.5 we can easily get that: Benefit's') = -{\V\ + wt(Ge)} + {\V"\ + wt{G'e)} (3.8) Therefore if we can find R[, C[ that can generate a description wi th benefit K" = — {\V\ + wt(Ge)} + K' or more for D, then we can easily build the subgraph G'e = ( V " , E") in Ge induced by V" = V — R[ U C[ such that \V"\ + wt(G'e) > K'. O n the other hand, if we can find a subgraph G'e = (V", E") in Ge induced by V" = V - R[ U C[ such that \V"\ + wt(G'e) > K', from Equation 3.8 we can build a description based on R[,C[ that has benefit K" or more in the form of: S = (J n + [J Cj - {dij\dij £ Cell(R[) U Cell(C[),dij.color ^ blue} + {dij\dij <£ CeZ/(i?i) U Cell(C[), d^.color = blue} Because it is NP-Complete to fine an induced subgraph G'e in Ge such that \V"\ + wt(G'e) > K', it is NP-hard to find the description wi th benefit K or more for data set D in data cube H. Now we show that the M D L E problem is in N P . Given a data cube H — R x C , an interesting data set D and a positive integer K, for any description S for D, the algorithm wi l l first check whether this 42 description has covered all the blue cells and does not cover any non-blue cells. Then the benefit of S w i l l be computed and the algorithm w i l l check whether the benefit is greater than K. This can be performed in polynomial time. So this problem is also in N P . Therefore this is an NP-Complete problem. • Example 3.2.5. A n example is given in Figure 3.6 and Figure 3.7. In Figure 3.6, the complete bipartite graph G — (V, E) has weights on edges. \V\ — 9,wt(G) = —4. Let K = 9, that is we want to find an induced subgraph G' = (V, E') such that \V'\ + wt{G') > 9. We can construct a 2-dimensional 2-level data region D as in Figure 3.7. The data cells are marked by 'O' are blue cells and correspond to the edges wi th weight — 1 in graph G. The unmarked data cells are non-blue cells and correspond to the edges wi th weight 1 in graph G. K' = -{\V\ + wt(G)) + K = - ( 9 - 4) + 9 = 4, that is to find a description wi th benefit 4. We can find a description wi th benefit 4 for D as : S = a + d + 1 + 3 + (6,5) - {(a, 1) + (a, 3) + (d, 3)} Benefit(S) = \D\ - \S\ = 12 - 8 = 4 S contains rows Ri = {a, d} and columns C\ — {1,3}. Based on this descrip-tion, we can construct the induced subgraph G' = (V',E'),V = V — Ri U C\ = {b, c, 2,4,5}, wt(G') = 5 - 1 = 4, so \V'\ + wt(G') = 5 + 4 = 9. • The problem in Theorem 3.2.2 is the decision problem. The optimization ver-sion of this problem is to find a description wi th the maximum benefit (equivalently, wi th the minimum length) for a data set. 43 1 2 3 4 5 o o o o o o o o o o o o Data Cells marked by O are blue cells. These cells correspond to the edges with weight - 1 . Unmarked cells non-blue cells. These cells correspond to the edges with weight 1. Figure 3.7: Correspondent Data Cube 44 Corollary 3.2.1. M D L E is an N P - H a r d problem, i.e.it is N P - H a r d to find a descrip-tion wi th exceptions of the maximum benefit (equivalently,of the minimum length) for a 2-dimensional 2-level hierarchical data cube. • We have shown that in a 2-dimensional 2-level data cube, the M D L E problem is already N P - H a r d . So for the general case, in a multi-dimensional hierarchical data cube, this problem is also an N P - H a r d problem. Corollary 3.2.2. M D L E is an N P - H a r d problem in a multidimensional hierarchical data cube. • 45 Chapter 4 Heur i s t i cs and E x p e r i m e n t s In previous chapter, we have analyzed the complexity of the M D L E problem and we have proved that it is an N P - H a r d problem. In this chapter we study one tractable case and then propose three heuristics for the M D L E problem. We also compare these three heuristics for M D L E with M D L - T r e e Algor i thm and G M D L method. 4.1 A Tractable Case: No Shared Holes Before we go to heuristics for M D L E , let us study a tractable case. 4.1.1 Preliminaries In a 2-dimensional 2-level hierarchical data cube, given a row rj and a column Cj, the intersecting cell is the data cell covered by rj and Cj, i.e. dij = (rj, Cj). From Equation 3.4 in the proof of Theorem 3.2.2, we know that, • if d^ is a blue cell, then the total benefit of r, and Cj is Benefit(ri) + Benefit(cj) — 1 46 • if dij is a non-blue cell, then the total benefit of rj and Cj is Benefit(ri) + Benefit(cj) + 1 We call the intersecting cell of a row and a column as the shared cell of this row and this column. For example, d^ — (fi,Cj) is the shared cell of row rj and column Cj. A shared cell can be a blue cell or a non-blue cell, which is the reason that the selections between rows and columns are hard. If a shared cell is a non-blue cell then we call it a shared hole. Given an interesting data set D in a 2-dimensional 2-level hierarchical data cube H = R x C, a description S for D is (R\ G R, C\ G C): S — [ J ti + [ J Cj - {dij\dij e Cell(Ri)uCell(Ci),d^.color ^ blue} +{dij\dij £ Cell(Ri) U Cell(C\), dij.color = blue} A row rj is in S iff rj G R±. S covers the rows in Ri C R and the columns in C i C C. If rows in Ri and columns in C\ do not share any non-blue cells, i.e.Vdjj G Cell(Ri) n Cell{C{),dij = blue, then we say 5 is a description without shared holes. If the optimal description for D in H is a description without shared holes then we say that D does not contain shared holes. In the following part of Section 4.1 we only discuss the case that does not contain shared holes. This means for anyVj G R\ and Cj G C\, d^ = ( r j ,C j ) is a blue cell and the total benefit of rj and rj is Benefit(ri&crj) = Benefit(ri) + Benefit(rj) — 1 (4.1) From Equation 4.1 we know that in a data cube without shared holes, the selection 47 of a row and a column decreases the sum of the benefit of this row and the benefit of this column by 1. Therefore we can get the following theorem: Theorem 4 . 1 . 1 . In a 2-dimensional 2-level data cube without shared holes, the rows and columns contained in SoPt have non-negative benefits. • Proof. From Equation 3.5 we know the benefit of SoPt is (RoPt is the set of rows covered by Sopt and CoPt is the set of columns covered by SoPt)-Benefit(SoPt) = E Benefitfa) + E Benefit(cj) + E ( -D dijECell(Ropt)f~\Cell(Copt),<Uj color=blue = 53 Benefit(ri) + 53 Benefit(cj) n € f i o P t cj£CoPt -\RoPt\ x \C0pt\ (4.2) Suppose there is a row wi th negative benefit in Sopt, such as ra e RoPt and Benefit(r0) < 0. Let R[ = RoPt — To- The new description S' generated from R[ and CoPt is S' = (J n + |J C j - {dijldij e Cell(R[) U Cell(C0pt),dij.color ^ Mue} ri6i?' i Cj&CoPt + {dij\dij ^ Cel^R'i) U Cel^Copt), dij.color = blue} 48 The benefit of S' is Benefit(S') = ] T Benefit^) + ^ Benefit(cj) dij€:Cell(Rti)CtCell(Covt),dij.coloT=blue = Y^ Benefit(ri) + Benefit(cj) — \R[\ x |Copt| Ti£R[ CjGCopt = Y2 Benefit(ri) — Benefit(r0) + Benefit(cj) Ti&Ropt Cj&Copt -{\Ropt\- 1) X \CoPt\ . -= Benefit(Sopt) ~ Benefit(R0) + \C0pt\ (4-3) Because Benefit(r0) < 0, —Benefit(R0) + \CoPt\ > 0, therefore Benefit(S') > Benefit(SoPt)- This contradicts wi th the definition of the optimal description that Sopt is the description with the maximum benefit. Therefore the assumption that Sopt contains a row with negative benefit is not true. So Sopt does not contain rows wi th negative benefits. For the same reason, Sopt does not contains columns with negative benefits • In a 2-dimensional 2-level hierarchical data cube, any two rows intersect wi th the same set of columns. Therefore we can deal wi th two rows wi th the same benefits similarly. From Equation 4.2 in the proof of Theorem 4.1.1, only the rows and columns wi th non-negative benefits can be selected into the optimal description. So we have the following lemma: Lemma 4.1.1. If two rows, ri,r2, have the same non-negative benefits, then we can merge ri,r2 into a group. Bo th or none of them are selected into the optimal description, i.e. r\ G Ropt r2 € i ?o P t (We have the similar conclusion for columns.) • 49 Proof. 1. r i £ ifopt ==>• ^ 2 G -Ropt: Let i ? i = i?Opt — r i ' For the optimal description SoPt, we have Benefit(Sopt) = 5Z Benefit(ri)+ 53 Benefit(cj) - \RoPt\ x ]CoPt| = 53 Benefit(ri) + Benefit(r{) + 53 Benefit(cj) ri&R'x CjeCopt + x |Ccv| = 53 Benefit^) + 53 Benefit(cj) - \R[ \ x | C o P t | +Benefit(n) - | C 0 p t | (4.4) Let us use 5 ' to present the description generated from R[ and Copt, then Benefit(S') = 53 Benefit^) + 53 Benefit^) - x | C 0 p t | (4.5) Appl ied Equation 4.5 to Equation 4.4, we have Benefit(Sopt) = Benefit(S') + Benefit(r\) — \CoPt\ Because Sopt is the optimal description, Benefit(Sopt) > Benefit(S'). There-fore we have Benefit(n) - \C0pt\ > 0 (4.6) Benefit(r\) = Benefit^), so Benefit^) — \CoPt\ > 0. Now let us suppose r 2 is not in the optimal description. Let R" — Ropt + ^2> 50 then the benefit of the description generated from R" and Cupt is Benefit{S") = ^ Benefit^) + ^ Benefit^) - \R"\ x \C0pt\ n€R'{ cjeCopt = ^ Benefit(ri) + Benefit(r2) + E Benefit(cj) -{\RoPt\ + 1) x \C0pt\ = Benefit(Sopt) + (Benefit(r2) - \CoPt\) > Benefit(Sopt) (4.7) We need to discuss two cases about Equation 4.7 here: • Benefit(S") — Benefit(SoPt)'- S" has the same benefit as Supt- It does not matter whether r2 is in Supt, Sopt remains the same benefit. So r2 can also be included in the optimal description; • Benefit(S') > Benefit(Sopt)' S' has more benefit than Supt , which is contradict wi th the definition of optimal description, therefore the as-sumption that r2 is not in the optimal description is not true. So r2 is in the optimal description. 2. r2 G RoPt ==>• T\ £ R-Opt can be similarly proved. • From Lemma 4.1.1, in a 2-dimensional 2-level hierarchical data cube we can group rows with the same non-negative benefits into a group. We have the following definitions for a group. Definition 4.1.1. For a group of rows, g, (without of loss of generality, we only consider groups wi th rows here), we have the following definitions: 51 1. Caxdinality of g is denned as the number of contained rows: Isl = \in\n eg}\ 2. Benefit of g is defined as the benefit of a row in g: Benefit(g) = Benefit(rk\rk E g) 3. Total benefit of g is the sum of benefits of al l rows in g: TotalBenefit(g) — Benefitfri) = \g\ x Benefiting) • After we group all rows into different groups based on the benefits of rows, we need to find a method to select groups into the optimal description. The following theorem shows a rule to select groups depended on the benefits of groups. Theorem 4.1.2. Given two groups of rows, <?i, <?2, and Benefiting)) > Benefit^), if rows contained in gi are covered by SoPt then rows contained in g\ are covered by Sopt too; if rows contained in g\ are not covered by SoPt, then neither the rows contained in 52-92 Q Ropt 51 Q Ropt < 91 2 ROpt 92 % Ropt • Proof. Suppose r\ G gi, r2 G 52- From the Definition 4.1.1 we know Benefit(ri) > Benefit(r2), so Benefit{r{) - \Copt\ > Benefit^) — \Copt\-• 92 ^ RoPti so r 2 G RoPt- From Equation 4.6 we have Bene fit (r2) — \Copt\ > 0, so Benefit(ri) — \Copt\ > 0 too. Based on Equation 4.7 ri is included in RoPt-From Lemma 4.1.1, we know gi C Ropt\ 52 • 9i % Ropt, from Equation 4.6, we have 0 > Benefit{r{) — \Cupt\ > Benefit(r2)— \Copt\- Therefore r2 is not included in the optimal description, neither g2. Q 4.1.2 A l g o r i t h m Based on Lemma 4.1.1, Theorem 4.1.2 and Definition 4.1.1, we have the following algorithm to find optimal descriptions for a 2-dimensional 2-level hierarchical data cube: Algori thm 4.1.1. Ma t r ix F i l l ing Algor i thm Input:a 2-dimensional 2-level data cube H = Rx C without shared holes, and an interesting data set D. 1. Compute the benefit for each row/column; 2. V r i , r j S R, if Benefit^) = Benefit(rj) > 0, then put n,rj into one group (group columns in the same way); 3. Order row groups in descendant order of benefit, p is the number of row groups. GR = { 5 r j l < i < p, V I < k < I < p,Benefit(gTk) > Benefit(gri)} Column groups are ordered in the same way. q is the number of column groups. GC = {gCj\l <j< q,Vl <k<l< q,Benefit(gCk) > Benefit{gCl)} 53 4. Build a (p + 1) x (q + 1) matrix M as: Mij = ^2(total benefit of row groups from 1 to i) + Y^(total benefit of column groups from 1 to j) — |blue cells covered both by row groups and column groups| = ^ TotalBenefit(gTk) + Y^ TotalBenefit(gCl) l<k<i l<l<j ~ Y M X Y \9ci\ l<k<i l<l<3 0 i,j = 0 M i _ i o + Total Benefit(gr) j = 0 (4.8) Moj-i + Total Benefit(gCj) i = 0 ^ M i _ i j + Mij-i - Mi-xj-i ~ \9n\ x \gCj\ ixj^O Output: Mimaxjmax = max{Mij,0 < i < p,0 < j < q} is the benefit of the optimal description. The optimal description covers the rows in Ropt = {rk\rk G 9n, 1 < i < W E } , columns in groups CoPt = {c/|q G gCj,l <j< jmax}-Sopt = (J Benefit(ri) + [J Benefit(cj) ~{dki\dki G Cell(R0pt) U Cell(C0pt),dki-color ^ blue} +{dki\dki <£ Cell(Ropt) U Cell(CoPt),dki.color — blue} • Example 4.1.1. The example in Figure 4.1 is a 2-dimensional 2-level hierarchical data cube without shared holes. A l l interesting cells are marked as O - We compute the benefits for rows and columns, and then order and group them by benefits: 54 group rows length benefit group columns length benefit 9n (g,9),(h,9) 2 5 9a 2 5 9r2 (f,9) 3 3 9c2 (i.2) 3 3 9r3 (e,9) 4 1 9c3 (i,3),(i,4) 4 1 2 3 4 5 6 7 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Figure 4.1: Data Region without Shared Holes This example has 3 row groups and 3 column groups. We can fill a 4 x 4 matrix M to get the optimal description. ^ 0 5 8 10 ^ M = 10 13 14 12 13 15 15 11 14 15 14 8 Mop = 0: means we do not select any row or column and only enumerate all blue cells. So Ri — C\ = 0 and the benefit is 0; MD,I = 5: means we select the columns in gci, i.e. (i, 1) and the benefit is 5; M 0 ) 2 = 8: means we select the columns in gCl and gC2, i.e. {(i, 1), (i, 2)} and the benefit is M 0 ) i + Benefit(i, 2) = 5 + 3 = 8; 55 • Mo,3 = 10: means we select the columns in gCl, gC2 a n d 9c3, i-e. {{i, 1), (i, 2), (i, 3), (i,4)} and the benefit is M0,2+Benefit(i,3)+Benefit(i,4:) = M0,2+TotalBenefit(gC3) = 8+1x2 = 10 • M 2 ] i = 15: means we select the rows in gri,9r2 a n d t n e columns in gci. R\ = { ( / , 9), {g, 9), (h, 9)}, C i = {(i, 1)}. Therefore the benefit of this selection is 53 Benefit(n) + Benefit{i, 1) - | i ? i | x | C i | = 5 + 5 + 3 + 5 - 3 = 15 A t the same time we can see that A f i , i + M 2 , 0 - A f i , 0 - Isral x | f l c i | = 13 + 13 - 10 - 1 x 1 = 15 Therefore M 2 , i = M i , i + M 2 , 0 - M i , 0 - |ff r 2 | x |aC l|. From the matrix M, we know that the maximum value is 15 and we can construct the optimal descriptions from M. For Mmax = M 2 ) i = 15, all rows in groups gri, gT2 and columns in groups gci are in the optimal description SoPt-SoPt = (h,9) + (g,9) + (f,9) + (i,l) - { ( 5 , 7 ) + (/i,8) + (/,7) + (/,8) + (a , l )} +(a, 2) + (d, 2) + (d, 3) + (d, 4) + (e, 2) + (e, 3) + (e, 4) + (e, 5) The length of SoPt is 17. The data set D has 32 data cells. So the benefit of this description is 32 — 17 = 15 = M 2 , i . M 2 ] i = Mz,\ = M 2 ] 2 = 15, so the optimal description is not unique. This example has three optimal descriptions and each of them has benefit of 15. • Time Complexity:We can see that the run time of this algorithm is polynomial. Let \R\ = nr, \C\ = nc. To order and group all the rows and columns can be done in 0(nr l g n r + n c l g n c ) . A n d to fill the matrix can be done in 0(nT x n c ) . 56 Theorem 4.1.3. Algor i thm 4.1.1 generates an optimal description for data set D in the 2-dimensional 2-level data cube H = Rx C without shared holes. • Proof. Firstly, rows wi th the same benefits are grouped into a group based on Lemma 4.1.1. From Theorem 4.1.2 we know the groups wi th more benefit are selected before the groups wi th less benefit are selected. Therefore we order row groups in the descendant order of benefit. Columns are grouped and ordered in the similar way. Given a group of rows, gTi, the groups of rows that have more benefits than gTi are {gr/_\l < k < i}. If gri is selected then all groups in {<7rJl < k < i} should be selected too. For the same reason, if a group of columns, gCj, is selected, al l groups in { g C l | l < I < j} should be selected too. Let Ri = {rp\rp £ grk, 1 < k < i} and C i = {cq|c 9 6 J C ( , 1 < ( < j } , the description generated from R± and C i has benefit as Benefit(rp) + 53 Benefit(cq) - \Ri \ x [ C i | = 53 TotalBenefit{grk)+ 53 TotalBenefit(gCl) l<k<i l<l<j ~ 13 \9rk\ X 5] l & l l l<fc<i l<l<j In matrix M, we use Mij to save the benefit of the description when we select the rows in groups { g y j l < k < i} and the columns in groups {gci |1 < I < j}. Therefore, only if we find the maximum value amongst al l elements in M we can generate a description with the maximum benefit for D. Now we show that Equation 4.8 i n Algor i thm 4.1.1 generate the correct benefits. Algor i thm 4.1.1 has four equations to fill matrix M: • i,j=0. This means we do not select any row or column and only enumerate the 57 blue cells in D, so we cannot get any benefit from this description. Therefore Mij = 0; j=0. Th is means we only select rows in groups g r j t , l < k < i. So the ben-efit of this description is ^21<k<iTotalBenefit(gTk). We also have M j - ^ o = J2i<k<i-iTotalBenefit(9rk), so Mij = Mi0= TotalBenefit(gk) = M J - ^ Q + TotalBenefit(grt) l<k<i i=0. This case can be similarly proved as above case when j = 0; j x j ^ 0. We select the rows in groups gTk,l < k < i and the columns in groups gCl, 1 < I < j, and the benefit is Mij = 53 TotalBenefit(grk) + 53 TotalBenefit(gcl) l<k<i i<i<j ~ Yl I^J x 53 l5c<l l<fe<i 1<'<J A t the same time, we have Mj_ij = 53 TotalBenefit(grk) + 53 TotalBenefit(gcl) l < f c < i - l l < K j - 5Z i ^ j x 531^11 l < f c < i - l l<l<j Mij-i = 53 TotalBenefit(grk) + 53 TotalBenefit(gcl) l<k<i l</<7—1 - 5Z I*-*!x l5c<l l<fc<i 1<Z<J7 —1 Mi_\j_\ = 53 TotalBenefit(grk) + 53 TotalBenefit(gCi) i < f c < i - i i < i < j — l - 5Z l5rfc I x E l^ ci I l < f c < i - l l</<i—1 58 We can get Y] TotalBenefit(grk) l<k<i + Y2 TotalBenefit(gCl) i<i<j - Y x E I ^ I i + i * i i x \sci l<k<i l<l<j So we have Therefore, after we find M i m a x j m a x = max{Mij10 < i < p,0 < j < q} then let RoPt = {rkVk G 9n,l<i< imax} and let C0pt = {Q|Q G 1 < j < j w } , so we can construct a description wi th benefit M,- ,• as 4.1.3 M u l t i - L e v e l Hierarchy A l s o H a r d From Algor i thm 4.1.1 We know that the M D L E problem can be done in polynomial time in a 2-dimensional 2-level hierarchical data cube without shared holes. Bu t how about a multilevel or multidimensional hierarchical data cube? Now let us analyze an example first. ~{dki\dki G Cell(Ropt) U Cell(CoPt),dki-color ^ blue} +{dki\dki i. Cell(Ropt) U Cell(C0pt),dki.color = blue} S is an optimal description for D in H. • 59 s t u a b c d e f g h i j k l m n o p q r o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Figure 4.2: Mul t i -Level Hierarchical Data Cube E x a m p l e 4 .1 .2 . For the example in Figure 4.2, the optimal description is ( l ) W ) - { ( l , / ) + ( l ,0 + ( l , r ) } + (2, s) + (5,2) + (6, s) + (3, a) + (4, b) - {(2, / ) + (5, / ) + (6, e) + (6, / )} + (7, g) + (7, h) + (7, i) + (7,j) + (3,1) - {(3, g) + (3, h) + (3, i) + (4, j)} + (2,u) + (3 ,« ) + (4,u) + (5,m) + (6,n) - {(2,r) + (3, q) + (4, q) + (4,r)} In the proof of Lemma 4.1.1 and Theorem 4.1.2, an important property we used is that al l rows intersect wi th the same columns, i.e. al l rows in RoPt intersect wi th all columns in Copt- Bu t in this example, row (l,v) and row (2,s) intersect wi th different sets of columns. A t the same time, row (6, s) and row (2, t) have the same benefit, column (7,g) and column (7, a) have the same benefit, but (6, s) and (2,t) are not in one group, and neither (7,g) and (7, a). Because the optimal description only contains (6, s), (7, g). • The consequence of losing this property is that we cannot order and group 60 rows and columns by benefit any more. Then Lemma 4.1.1 and Theorem 4.1.2 are not true any more. Therefore Algor i thm 4.1.1 does not work either. So i n a multilevel or multidimensional hierarchical data cube, M D L E remains N P - H a r d . 4.2 Heuristics 4.2.1 Basic Definitions We already have definitions for the 1-dimensional hierarchical data cube and the 2-dimensional hierarchical data cube. In this section we present some definitions for a more general case: the multi-dimensional multi-level hierarchical data cube. Definition 4.2.1. A multi-dimensional hierarchical data cube is in the form of H =< Hi, H2, . •., Hn > where Hi,l < i < n , is an Z* level hierarchy defined in Definition 3.1.1 • Definition 4.2.2. In data cube H =< Hi,H2,... ,Hn>, • Da ta cell is defined as (xi,x2, • •., xn) if Va:*, x\ € Leaves(Hi); • Data region is defined as (xi,x2,..., xn) if Bxi, Xi £ Leaves(Hi); • Definition 4.2.3. A parent region is a region that only has a non-leaf node in one dimension and this non-leaf node is the parent of leaves. It can be defined as: {(xi,x2,... ,xn)\3i,Vx' £ Cld(x{),x' G Leaves{Hi),\/l < j < n,j ^ i,Xj G Leaves(Hj)} • 61 4 .2 .2 A G r e e d y H e u r i s t i c Greedy method is a very common heuristic for N P - H a r d problems. From the pre-vious chapter we know that M D L E tries to minimize the length of the description to describe a given data set and maximize the benefit of the description. So the main goal of M D L E is to decrease the regions and cells used to cover the interested data set. The benefit of a description is the number of saved regions and cells if we use this description to cover the interesting data set. We can order regions i n the descendant order of the benefit. Each time the region wi th the most benefit is selected t i l l a l l the left over regions have negative benefits. After the selections of regions we mark all the non-blue cells in these regions as blue, then we apply M D L - T r e e algorithm in [11] on the data cube. M D L - T r e e algorithm generates the unique optimal M D L description for blue cells in the data cube. This M D L description with the exceptions of the non-blue cells in the selected regions is the M D L E description generated by our greedy heuristic. A l g o r i t h m 4 .2 .1 . Greedy Algor i thm Input: a multi-dimensional data cube H and an interesting data set D Output: a description S for D 1. Initialize S = D,E — S' = E' = 0; initialize an empty hash table Q; /*S and S' are sets of blue cells; E and E' are sets of holes. S — E is a description for D. Each time we select a region x, S' contains the blues in S but not in x. E' contains holes in E and x. S' — E' is the new description by selecting x. */ 2. Compute the benefit of each parent region; 62 Insert regions into hash table Q: for each region, the hash key is computed from its benefit and the number of holes in it. /*A region with more benefit always has greater hash key; for two regions with the same benefits, the region containing less holes has the greater hash key. */ 3. Select the region, x, wi th the greatest hash key in Q, and let S' — S — {x'\x' £ x, x'.color = blue} E' = E+ {x" £ x,x".color ^ blue} 4. If \S'\ + \E'\ < \S\ + \E\ then . Let S = S',E = E'; E n d If 5. Delete x from Q; 6. If Q is not empty then go to Step.3; 7. For each cell in E, find the corresponding cell in H and mark it as blue. A p p l y M D L - T r e e algorithm on new blue cells in H and the description generated by M D L - T r e e is stored in S; 8. S - E is the M D L E description for D in H. • Complexity: Based on Definition 3.1.1, n ^ _ i is the number of nodes in level l{ — 1 of hierarchy Hi and n\. is the number of nodes i n level lj of hierarchy Hj. The number of parent regions in a data cube H is 63 For any hierarchy Hi each node in level k has a parent in level k — 1, so we have ni^i < n ^ . Let us use N — Yli<i<n nk *° P r e s e n t the number of data cells in data cube H. E {^-i x II nh} l<i<n l ^ J ^ n J ^ < E ^ x n l<i<n 1< J < « J 7 ^ * = E n »J l<i<n l<j<n = n x J V We use a hash table to store the regions and the hash key is computed by the benefit of this region. The region wi th more benefit has greater hash key, and the hash table size is less than n x N. Therefore the time complexity of greedy algorithm is 0(n x N) in the worst case. • We notice that we restrict the candidate regions to parents of leaves. We can also extend the candidates to grandparents, great grandparents or higher level regions. B u t we need more work to order them. Because a grandparent region has more benefit than one of its children does not mean that we really should choose the grandparent before we choose its children. In order to compensate, we apply M D L - T r e e algorithm on the data cube to merge the parents regions to upper level after we finish the selection. Example 4.2.1. For the example in Figure 4.3, first we compute the benefits for parent regions. Then we select regions in the descending order of benefit. 64 region length benefit holes region length benefit holes (e,6) 1 3 - (d,10) 2 2 (d,5) (e,l) 2 1 (a,l) (e,2) 2 1 (b,2) (e,3) 2 1 (b,3) (a ,H) 2 1 (a,8) (e,8) 2 1 (a,8) (c,10) 3 0 (c,4),(c,5) (e, 6) has the most benefit in all regions so we select it first. The second region is (d, 10). Now the current S is: S = (e,6) + ( d , 1 0 ) - ( d , 5 ) +(a, 2) + (a, 3) + (b, 1) + (b, 5) + (c, 1) + (c, 2) + (c, 3) +(a,7) + (a,9) + (&,8) + (c,8) + (d,8) The length of S" is 15, \S'\ = 15. Now the next region is (e, 1). If we select (e, 1) then we can delete (6,1) and (c, 1) from S but we need to add (e, 1) and hole (a, 1), so the length of the description does not decrease. Therefore we cannot select (e , l ) , neither (e, 2),(e,3). After the selection of (a, 11), (e, 8) the algorithm stops, and the output of the algorithm is S = ( d , 1 0 ) - ( d , 5 ) + (o,2) + (o,3) + (b, l ) + (b,5) + ( c > l ) + (c,2) + (c,3) +(e,6) + ( a , l l ) + (e ,8 ) - (a ,8 ) The length of S is 13 and the total benefit is 7. • 4.2.3 D y n a m i c P r o g r a m m i n g Dynamic programming is also a widely used method to approximately solve N P -Complete problems. The core part of dynamic programming is that a problem can be decomposed to several sub-problems and answers to those sub-problems lead to 65 1 2 3 4 5 6 7 8 9 o o o o o o o o o o o o o o o o o o o o Figure 4.3: Example for Heuristics an solution to this problem [17]. The multi-dimensional hierarchical data cube has such a property that we can use dynamic programming as a good heuristic method. Given a data cube H —< Hi, H2, •. •, Hn > and an interesting data set D, for any data region x — (xi,x2, • • •, xn), if Xi is not a leaf, i.e. X{ £ Leaves(Hi), we can generate a description for x from the descriptions of the children regions of Xi, that is S(x) = |J S(xik) xikecid(xi) and the length of S(x) can be computed by (4.9) |5(x)| = | |J S(xik)\= £ \S(xlk)\ xik€Cld(xi) xik€Cld(xi) S(x) is a description generated from children regions and we call S(x) as a local solution for x. For data region x we can also use the following description to cover the blue cells in x, S — x — {x'\x' G Cell(x), x'.color ^ blue} (4-10) 66 The length of S is SI |S| = 1 + \{x'\x' € Cell(x",x .color = blue}\ S is to cover al l blue cells in a; by a; with the exceptions of al l non-blue cells in x. We call S the global solution for x. For each region x, we find all local solutions and the global solution, then we use the solution wi th the minimum length as the description for x. This procedure is showed in the following algorithm. Algorithm 4.2.2. Dynamic Programming Input: H =< Hi, H2, • • . , Hn >, and an interesting data set D; 1. B u i l d an n dimensional matrix M — \H\ \ x \H2\ x . . . x \Hn\, \ H{\ is the number of nodes in hierarchy Hf, B u i l d an n dimensional matrix P in the same way. 2. Label the nodes in each hierarchy Hi by post-order: the first leaf is labelled as '1 ' and the root is labelled as \Hi\. We use the label of a node as the index of this node in the corresponding dimension of a matrix. /*For example, for region x = (2:1,2:2, • • • ,xn), the value of x in M is the ele-ment M (index (x\), index (x2), • •. ,index(xn)). In order to present the values in M more clearly, we only use M(x\,X2, • • • ,xn) to present the value in M forx. */ 3. For any data cell x — [x\, x2,..., xn), if x is a blue cell, then set M(x\,x2, • • •, xn) = 1, otherwise M(xi,X2,... ,xn) = 0; 4. For any data region x = (x\, x2, • • •, xn), we get the value of M(x\,x2,..., xn) by the following methods: 67 (a) for any 1 < i < n, if Xi £ Leaves(Hi), compute the length of the local solution(Equation 4.9) on dimension i: , X2, • •> Xik , . . . , Xn Xik€Cld(xi) (b) compute the length of the global solution(Equation 4.10) of x: 1 + \{x'\x' G Cell(x),x'.color ^ blue}\ (4.12) M(xi,X2,... ,xn) is the minimum value of Equation 4.11 and Equation 4.12. 5. We get the value of P(xi,x2, • • •, xn) based on the value of M(x\,X2,. • •,xn): i if M{x\, X2, • • •, xn) = Equation 4.11 P(xi,x2,.- • ,xn) = < g if M(x\, X2, xn) = Equation 4.12 • Complexity: We can see the two matrixes both have x | i?2| x . . . x \Hn\ elements. So the total complexity of the dynamic programming is 0 ( | i J i | x | i?2| x ...x\Hn\). • The matrix M stores the length for each region (xi, X2,.. • xn), and the matrix P stores the way to construct the description for the region. After we compute the length of the description for D, we need to construct the description based on matrix P. Algorithm 4.2.3. Description: Input: H, D, P, x = (x±, x2, • • •, xn) Output: a description for region x = {x\, x2,..., xn) 1. if P(xi,X2,..., xn) =' g', the description is S = x — {x'|a;' G x, x'.color ^ blue} 68 2. if P(x\,X2,..., xn) — i', 1 < i < n, the description is S= Description(xi,X2, • • • , X i k , . . . ,xn) xikeCld(xi) • Complexity: This algorithm just splits the region to get the optimal descriptions. The complexity is 0(max\Hi\). • E x a m p l e 4 .2 .2 . We stil l use the example in Figure 4.3. In the data cube H — (Hi,H2),let us assume Hi = {a,b,c,d,e} and H2 = {1 ,2 ,3 ,4 ,5 ,10 ,6 ,7 ,8 ,9 ,11 ,12} . We bui ld two 5 x 12 matrixes M and P and we generate the elements by the method showed in Algor i thm 4.2.2. The matrixes M and P are presented in the following tables. M (index) (index) (node) 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 10 6 7 8 9 11 12 1 a 0 1 1 0 0 2 1 1 0 1 2 4 2 b 1 0 0 0 1 • 2 1 0 1 0 2 4 3 c 1 1 1 0 0 3 1 0 1 0 2 5 4 d 1 1 1 1 0 2 1 0 1 0 2 4 5 e 2 2 2 1 1 8 1 1 2 1 5 13 P (index) (index) (node) 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 10 6 7 8 9 11 12 1 a 2 g 2 2 b 2 2 2 3 c 2 2 2 4 d g 2 2 5 e g g g 2 2 2 g 1 g 1 2 2 We label nodes in each hierarchy by post-order. The label for each node is the index of this node in the corresponding dimension in matrixes M and P. For example, the index for node 10 is 6 and the index for node e is 5. M(5,6) is the length for region (e,10). 69 (d, 10). d is a leaf node and 10 is a non-leaf node. We compute the local solution for node 10: 53 M ( e , i ) = 4 iecw(io) The global solution is 1 + \{x\x G CeZZ(d, 10),x.coZor ^ 6Zue}| = 1 + 1 = 2 Therefore M ( 5 , 6 ) = 2 and P(5 ,6) = ' g'. e, 10). Both of e and 10 are non-leaf nodes. So we need to compute two local solutions. 53 Af (t, 10) = Af (a, 10) + M(b, 10) + M ( c , 10) + A f (d, 10) = 9 ieCld{e) 53 Af (e, j ) = Af (e, 1) + Af (e, 2) + A f (e, 3) + Af (e, 4) + A f (e, 5) = 8 jecw(io) The global solution is 1 + \{x\x G Cell(e, 10),x.color ^ blue}\ = 1 + 9 = 10 Therefore A f (e, 10) = 8. 10 is a node in H2, so P(e, 10) = ' 2'. (e, 12). Both of e and 12 are non-leaf nodes. So we need to compute two local solutions. 53 Af (i, 12) = A f (a, 12) + Af (6,12) + Af (c, 12) + A f (d, 12) = 17 ieCZd(e) 53 M(e,j) = Af(e,10) + A f ( e , l l ) = 13 jecw ( i2) The global solution is 1 + \{x\x G Cell(e, 12).x.color ^ blue}\ = 1 + 16 = 17 Therefore Af(e, 12) = 13 and P(e, 12) = ' 2'. 70 After we create the matrix P , we can create the description for D. • P(e,10) = ' 2', S(e,10) is the union of S(e, 1), S(e, 2), S(e, 3), S(e, 4) and5(e ,5) . P ( e , l ) = P(e ,2) = P(e,3) = ' </, S ( e , l ) = (e , l ) - ( a , l ) , S(e,2) = (e,2) -(6,2), S(e,3) = ( e , 3 ) - ( 6 , 3 ) ; P(e ,4) = P(e ,5) = ' 1', 5(e,4) = (d,4), 5(e,5) = (6,5); Therefore 5(e, 10) = (e, 1) - (a, 1) + (e, 2) - (6,2) + (e, 3) - (6,3) + (d, 4) + (6,5) • P(e , 11) = 2, 5(e, 11) is the union of S(e, 6) , S{e, 7),S(e, 8) and 5(e, 9) P(e , 6) = P(e , 8) = ' g', S(e, 6) = (e, 6) , S(e, 8) = (e, 8) - (a, 8). P(e ,7) = P(e ,9) = ' 1', 5(e,7) = (a, 7), 5(e,9) = (a,9). Therefore 5(e, 11) = (e, 6) + (e, 8) - (a, 8) + (a, 7) + (a, 9) • P(e , 12) = ' 2', S(e, 12) is the union of S(e, 10) and S(e, 11), that is S(e,12) = ( e , l ) - ( a , l ) + ( e , 2 ) - ( 6 , 2 ) - ) - ( e , 3 ) - ( 6 , 3 ) + (d,4) + (6,5) + (e,6) + (e,8) - (a, 8) + (a, 7) + (a, 9) • From Example 4.2.1 and Example 4.2.2 we know the greedy heuristic and the dynamic programming heuristic generated different descriptions for the data cube in Figure 4.3. The optimal description for this data cube is: S0pt = ( e , l ) - ( a , l ) + ( e , 2 ) - ( 6 , 2 ) + ( e , 3 ) - ( 6 , 3 ) + (d,4) + (6,5) + (e,6) + (a, 11) + (e,8) - (a, 8) The length of the optimal description is 12 and the benefit is 8. Let us compare the descriptions in Example 4.2.1 and Example 4.2.2 wi th the optimal description. 71 • (e, 10): the optimal description for this region is (e, 1) — (a, 1) + (e, 2) — (6, 2) + (e,3) — (6,3) + (d, 4) + (6,5). The dynamic programming heuristic generated this description. In the greedy heuristic, the region selected in each loop, which has the most benefit amongst the parent regions, is the best choice for the current step, but maybe it is not the best choice for the whole cube. In this example, the greedy heuristic selected (d, 10), which has benefit 2, into the description. Bu t if we select (e, 1), (e, 2) and (e, 3), which have the same benefits of 1, then the total benefit is 3. Therefore the greedy heuristic may create an unoptimal description. • (e, 11): the optimal description for this region is (e, 6) + (e, 8) — (a, 8) + (a, 7) + (a, 9). The greedy heuristic generated this description. The dynamic programming heuristic compares the local solution(Equation 4.9) wi th the global solution(Equation 4.10) and selects the description wi th the minimum length. For (e, 11) the optimal description is the combination of a row (a, 11) and columns (e,6),(e,8), which is not contained in Equation 4.9 and Equation 4.10. This is the reason why the dynamic programming heuristic generates a description that has longer length than the optimal description sometimes. 4.2.4 Quadrat i c P r o g r a m m i n g Preliminaries Quadratic Function is in the form of f(x) = ^xTHx + cx 72 x is a vector of variables, H is a matrix. Given a function defined in a constrained space, this function is a convex function iff f(ax + (1 — ay)) < afix) + (1 — a)f(y),0 < a < l,x,y G constrained space If f(x) is a convex quadratic function, then the quadratic programming is to minimize the value of f(x) subject to linear equality and inequality constraints[9]. This kind of problem is called the convex quadratic programming and it is in the form of Minimize f(x) = ^xTHx + cx s.t. Ax >b Bx = e x>lb x < ub In order to make sure that the quadratic function / has a minimal value in the constrained space, the function / should be a convex function. A matrix H is positive semi-definite iff for al l x G Rn, xTHx > 0. It has been proved that a quadratic function f(x) is convex iff H is positive semi-definite[2]. Quadratic Programming for M D L E In order to use quadratic programming to solve M D L E problem, we need to build the quadratic function and the linear constraints from the data cube H and the interesting data set D. Firs t ly let us study the 2-dimensional 2-level hierarchical 73 data cube denned in Definition 3.2.4. H = Rx C and D is an interesting data set. A description S for D in the form of S = |J n + (J Cj - {dij\dij G Cell(Ri) U Cell{Cx), dij.color ^ blue} +{dij\dij G Cell(Ri) U Cell{C\), dij.color = blue} The benefit of S is Benefit(S) = ^ Benefit^) + ^ Benefit(cj) + E rfij €Cell(Ri )nCell(Ci ),dij .color ytblue + E dij(iCell(Ri)nCell(Ci),dij.color=blue Let us use X * to present whether r* G i ? i for any r* G i?: if r* G Ri then Xi = 1, otherwise X * = 0. Similarly we use Yj to present whether Cj G C\ for any C j G C. Therefore we have the following equations: (nr = | . R | , n c = \C\) • the sum of benefits of rows in i ? i can be expressed as ^ Benefit(ri)) — ^ 1 x Benefit(ri) + ^ 0 x Benefit(ri) n e i i i rieRi r^Ri = ^ x Benefit(ri) l < i < n r • for those blue cells in Cell(R\) U Cell(C{), i.e. dij = (ri,Cj),ri G i ? i,Cj G C i , we have Xi = 1, l j = 1, therefore E (-!)= E ( - l ) x X i X dij&Cell(Ri)nCell(Ci),dij .color=blue ri£Ri ,Cj gC i ,dij=(n ,Cj) ,d{j .color=blue For those blue cells not i n Cell(R\) U Cell(C\), Xi x Yj = 0, therefore Y (-l)xXixYj=0 dij£Cell(Ri)nCell(Ci),dij.color=blue 74 Combine these two equations, we can get E (-i) = ( - i ) x E x i * y i dij eCell(Ri)nCell(Ci),dij .color=blue dtj .color=blue Similarly we can have the following two equations: 53 Benefit(cj) = 53 ^i x Benefit(cj) Cj€Ci l < j < « c 53 a) = i x 53 XixYj dijeCell(Ri)C\Cell{Ci),dij.color^blue dij .color jtblue The benefit of S can be recomputed as Benefit(S) = 53 Xi x Benefit^) + 53 Yj * Bene fit (CJ) l<i<nr 1 < 7 < ™ C + l x 53 XixYj + (-l)x E dij .color^blue dij .color=blue The benefit function in Equation 4.13 is a quadratic function. Xi, Yj are the variables. The constraints are 0 < Xi < 1,0 < Yj < 1. The goal of M D L E is to maximize Equation 4.13, which is to minimize / = —Beneifit(S). Therefore we can build the quadratic programming model as Minimize f = —Benefit(S) s.t. 0 < Xi < 1 0<YJ<1 Example 4.2.3. Let us look the example in Figure 4.4, the benefits of rows and columns are showed in the following table: 7 5 1 2 3 4 o o o o o o o Figure 4.4: Example for Quadratic Programming row X length benefit column Y length benefit (a,5) 3 -1 (d.l) Yi 1 2 (b,5) x2 3 -1 (d,2) Y2 3 -2 (c,5) 2 1 (d,3) Y3 2 0 (d,4) YA 3 -2 Let us use X\,X2,X3 to represent the rows, Y\,Y2,Y$ and Y4 to represent the columns as showed in the table. We can get a description wi th the most benefit when we solve the following quadratic programming problem: Minimize f = -( -X\ - X2 + X3 +2Yi - 2Y2 - 2Y4 -XXYX + XXY2 - XXY3 + XiY4 -X2YX + X2Y2 - X2Y3 + X2Y4 —X3Y1 — X3Y2 + X3Y3 — X3Y4 ) s.t. 0 < Xi < 1 0 < Yj < 1 Transform / into the matrix form / = —(^xTHx + cx): 7 6 a; is a 7 x 1 matrix: x = X i X2 X3 Yt Y2 Y3 YA c is a 1 x 7 matrix: - 1 - 1 1 2 - 2 0 H is a 7 x 7 matrix: V 0 0 0 - 1 1 - 1 1 0 0 0 - 1 1 - 1 1 0 0 0 - 1 - 1 1 - 1 - 1 - 1 - 1 0 0 0 0 1 1 - 1 0 0 0 0 - 1 - 1 1 0 0 0 0 1 1 - 1 0 0 0 0 For this example, when Xx = 0, X2 = 0, X3 = 1, Y i = 1, Y2 = 0, y 3 = 1, Y4 = 0, the function / has the maximal value of 3 and the corresponding description is S = ( C ) 5) + (e , l ) + ( e , 3 ) - ( c , 3 ) • Heuristic Algori thm From Example 4.2.3 we know that we can use quadratic programming as a heuristic for M D L E problem. If the quadratic function is a convex function on the constrained space, we can reach a global optimality and find the minimum value for this function. Otherwise, the global optimality and the minimum value cannot be guaranteed. 77 In Example 4.2.3, / = — (^xTHx + cx) is not a convex function and H is not a positive semi-definite matrix. So even quadratic programming is a good model for the 2-dimensional 2-level hierarchy, the global optimality cannot be reached for all cases. B u t we can use quadratic programming as a heuristic method for M D L E problem. O n the other hand, in Equation 4.13 we know that Xi x Yj represents cell dij — (ri,Cj). A data cell in a 3-dimensional hierarchical data cube is in the form of Xi x Yj x Zk and the benefit of a description is not a quadratic function any-more. Therefore in the algorithm, we only apply quadratic programming on the data region x = (x\, x2,..., xn) such that x only contains non-leaf nodes in two di-mensions. We can express the benefit of a description for x in the quadratic function as Equation 4.13. Algorithm 4.2.4. Quadratic Programming: Input: H — {H\,H2,..., Hn}, and an interesting data set D; 1. B u i l d an n dimensional matrix M = \H\ | x \H2 \ x ... x \Hn\, \Hi\ is the number of nodes in hierarchy Hf, B u i l d an n dimensional matrix P in the same way. 2. Label the nodes in each hierarchy Hi by post-order: the first leaf is labelled as '1 ' and the root is labelled as \H{\. We use the label of a node as the index of this node in the corresponding dimension of a matrix. /*For example, for region x = (x\, x2,..., xn), the value of x in M is the ele-ment M(index(x{), indexix^), • • •, index(xn)). In order to present the values in M more clear, we only use M(x\,X2, •.., xn) to present the value in M for x. */ 7 8 3. For those regions(a:i, x2, . .a; , , . . . Xj,.., xn) that contains two or more non-leaf nodes, we get the value of M(xi, x2,.. •, xn) by the following methods: (a) for any 1 < i < n, if xi £ Leaves(Hi), compute the length of the local solution(Equation 4.9) on dimension i: ) (4-14) xikeCld(xi) (b) compute the length of the global solution(Equation 4.10) of x: 1 + \{x'\x' G Cell(x),x'.color ^ blue}\ (4.15) (c) if x contains two non-leaf nodes, then we need to call function Quadratic(x) M(a : i , a :2 , . . . ,xn) is the minimum value of Quadratic(x), Equation 4.14 and Equation 4.15. 4. We get the value of P(x1,x2, •.., xn) based on the value of M(x\, x 2 , . . . , xn): P{xx,x2, ...,xn)= < q if M(xi,x2,... ,xn) = Quadratic(x) i if M(xi,x2,... ,xn) — Equation 4.14 g if M{x\,x2,... ,xn) = Equation 4.15 • Algor i thm 4.2.4 calls a function Quadratic(x = (xi,x2,... xn)). This is the function to do the optimization for quadratic functions. Function 4.2.1. Quadratic(x = [x\,x2,..., xn)) Input: a data region x with two non-leaf nodes /*Suppose the two non-leaf nodes are xk and x\. */ 79 1. B u i l d a \Leaves(xk)\ x \Leaves(xi)\ matrix M based on the data region x = (x\, X2, •.., xn), if a data cell is blue, then set the correspondent element in M to 1, otherwise 0; 2. Based on this matrix, compute al l the needed arguments for Mat lab function quadprog; 3. C a l l Mat lab function quadprog to get the optimal solution; 4. Return the optimal result. • The Mat lab function quadprog reaches a global optimality i f the quadratic function is convex [15]. The function generated from the data region is not a con-vex quadratic function, therefore quadprog can only find one local solution in the polynomial time. 4.3 Experimental Results In order to evaluate the three heuristic methods to summarize the hierarchical data wi th exceptions, we choose the hierarchy generator used in paper[19] to create hi-erarchical tree structure, which is the mimic structure used in data warehousing benchmarks. 4.3.1 M D L vs M D L wi th Except ions Figure 4.5 shows the experiment results of the three heuristics and M D L - T r e e algori thm[ll] are presented. The test data set is a 2-dimensional hierarchical data cube and in each dimension the fanout of the tree hierarchical structure is 1 — 4 — 25, 80 which means the hierarchy has one root node that has 4 children and each child of the root has 25 children which are leaf nodes. So the whole data size is 10 4 . The x — axis is the number of cells randomly chosen to be blue; the y — axis is the length of the description to represent those blue cells. Figure 4.5: M D L vs. M D L wi th exceptions From this figure, we can see that M D L wi th exceptions generates shorter descriptions than M D L algorithm. The smallest region in this structure contains 25 nodes ( each parent of leaf-nodes has 25 children), so M D L algorithm need all these 25 cells as blue cells to summarize them to one region. This is the reason that M D L cannot get any gain when the blue cells are less than 6400. The three 81 heuristics all have better results than M D L . Around 5000 — 5500 it is the peak area where the heuristics generate the longest descriptions. When the number of blue cells is less than 5000, the length of description is increasing wi th the number of blue cells. When the number of blue cells is greater than 5500, the length of description is decreasing. A n d the distance between M D L and heuristics w.r.t one x value is increasing wi th the number of blue, that is the more blue cells the more gain we can get. Amongst these three heuristics, the quadratic programming always has better results than others. When blue cells is less than 6000, greedy is a little better than dynamic programming. When the number of blue is greater than 6000 the dynamic programming beats greedy algorithm. The reason is quadratic programming checks the combinations between rows and columns but the dynamic programming only checks the sum of rows or column without the combination. Greedy picks the region wi th the most benefit each time but loses the benefit from the tree structure which is considered well by quadratic programming and dynamic programming. Therefore when the blue ratio is high greedy is the worst one. Running times of Heuristics The running time of each heuristic is shown in the following table. Heuristic Running Time(secs) Greedy 4 Dynamic Programming 5 Quadratic Programming 80 From the table of running time we know the greedy heuristic is the fastest one amongst these three heuristics.At the same time, the descriptions generated 82 by the greedy heuristic axe just a little longer than the descriptions generated by the other two heuristics. In the following experiments we only compare the greedy heuristics wi th the G M D L - T r e e algorithm. 4.3.2 G M D L vs M D L with Exceptions Figure 4.6 is the comparison between GMDL[11] and M D L wi th exceptions. In G M D L - T r e e algorithm white ratio means how many white cells are permitted to be covered. In this figure we only compare the G M D L wi th the greedy method for M D L with exceptions. From the figure we can see the greedy method generates longer descriptions at the beginning but it beats the G M D L wi th the increasing number of blue cells. When there are 6400 blue cells the greedy method works better than G M D L in H containing 15% white cells but sti l l not good as G M D L in H containing 20% while cells. From the definition of G M D L and M D L wi th exceptions we can see that G M D L description is not pure for blue cells because it contains lots of white cells. M D L wi th exceptions covers some rows and columns containing non-blue cells. Bu t those non-blue cells are excluded in the exception part of the description. So the M D L wi th exceptions generates pure descriptions to express blue cells. O n the other hand, as showed in Figure 4.7, when H contains some red cells the G M D L works much worse than M D L wi th exceptions. Figure 4.7 is to compare M D L E wi th G M D L in H containing some red cells. In this experiment we set the red cells are 5% of total cells, that is there are 500 red cells overall. When the blue cells are more than 6000, M D L wi th exceptions works better than G M D L even if H contains 40% white cells, which is really high for real applications. If there are some red cells M D L wi th exception is better than G M D L . 83 MDL with Exceptions vs. GMDL Figure 4.6: G M D L v s . M D L wi th exceptions M D L wi th exceptions generates pure descriptions wi th much shorter length. 4.3.3 Comparison in 3-Dimensional Case So far we only show the experiments in 2-dimensional hierarchical data cube. Now let us discuss the experiments in 3-dimensional hierarchical data cube. Figure 4.8 shows the results for 3-dimensional hierarchical data cube. The tree hierarchical structure is 1—4—4—6 which means this is a 4 level tree. The hierarchy has one root node which has 4 children, and each of those children has 4 children too. So the root has 16 grand-nodes. Each of those grand-nodes has 6 children which are leaf-nodes. So for each dimension there are 96 leaves and the data space is around 10 6 . 84 MDLE vs. GMDL with red cells Figure 4.7: G M D L with red cells From the figure we can see, M D L gets some benefits too because the small-est region only contains 6 nodes which makes it easy for M D L to summarize one region. The results in 3-dimensional hierarchical data cube is coincident wi th the 2-dimensional hierarchical data cube. Finally, quadratic programming also gener-ates the shortest descriptions amongst the heuristics. Quadratic programming and dynamic programming are better than G M D L wi th white ratio 10%. Comparison of Running Times The running times of M D L - T r e e , G M D L - T r e e algorithm and the heuristics for M D L E are shown in the following table. 85 Experiments of 3 Dimensional Hierarchy Data Fanout: 1-4-4-6 500000 T 200000 250000 300000 350000 400000 450000 500000 550000 600000 650000 700000 Number of Blue Cells Figure 4.8: Experiments of 3-Dimensional Hierarchy Algor i thm Running Time(secs) 2-dimensional 3-dimensional fanout: 1-4-25 fanout: 1-4-4-6 M D L - T r e e 4 12 . G M D L - T r e e 4 35 Greedy Heuristic 4 40 Dynamic Programming 5 70 Quadratic Programming 80 30000 From the table of running times we know that the greedy heuristic short-ens the descriptions without taking much more time than G M D L - T r e e algorithm. Dynamic programming is a little slower but it is st i l l fast enough. Quadratic pro-86 gramming is the slowest one. The reason is that we need call Mat lab function 'quadprog' to find an optimal value for the quadratic function, which is the most important part of quadratic programming heuristic. 4 . 3 . 4 Hole Ratio The last figure shows the ratio of holes in the descriptions of M D L wi th exceptions. Figure 4.9 represents the results. We can see from the figure that wi th the increasing number of blue cells, the description includes more and more holes. Hole Ratio in the Descriptions 3000 3500 4000 4500 5000 5500 6000 6500 7000 Number of Blue Cells Figure 4.9: Hole Ratio 87 Chapter 5 Conc lus ions and Fu tu re W o r k 5.1 Summary and Conclusion This thesis has studied the problem of data summarization wi th exceptions. M D L -Tree algorithm can find the concise description for a pure hierarchical data set in polynomial time. Bu t for some cases M D L - T r e e can not gain much benefit and the description is sti l l too long. G M D L - T r e e algorithm can generate much shorter de-scriptions but it contains lots of white cells which affects the description's impurity. In this thesis we define a new method that could generate a pure description which is much shorter than M D L and even G M D L in many cases. We name this new method as M D L wi th exceptions ( M D L E ) , which means the description generated by this approach is in the form of the union of a set of blue cells and a set of data regions wi th the exception of non-blue cells in these regions. We call these included non-blue cells holes. In this thesis we proved M D L wi th exceptions is an N P - H a r d problem. We buil t a complete weighted bipartite graph that only has weights on edges(CEW bipartite graph). First we proved it is an N P - H a r d problem to find a subset of 88 vertices which induces a subgraph that maximizes the addition of the weight of this subgraph and the cardinality of this vertices subset. Then we showed a reduction from C E W problem to M D L wi th exceptions. Known the general M D L wi th exceptions is N P - H a r d , we also studied some tractable cases. One of these tractable cases is the 1-dimensional hierarchical data cube. For this case we proposed a bottom-up algorithm to build the optimal de-scription. The other case is a 2-dimensional 2-level hierarchical data cube without shared holes. We proposed a matrix-filling algorithm for this tractable case and also explain why M D L E remains N P - H a r d when it goes to a multidimensional or multilevel hierarchical data cube. We also proposed three heuristics for M D L with exceptions and compared them wi th M D L - T r e e algorithm and G M D L - T r e e algorithm. Amongst those three heuristics, quadratic programming has the best quality and always generates the shortest descriptions. Greedy method is the fastest one which generates just a little longer description. When the blue ratio is low, dynamic programming is not as good as greedy but if the blue ratio is high dynamic programming generates much shorter descriptions than greedy. The descriptions generated by those heuristics are pure, which means the description only covers blued cells. B u t w i th the increasing number of the blue cells, the description includes more and more holes, which may cause confusing to users. 89 5.2 Future Work 5.2.1 Summariza t ion of Holes From Figure 4.9 we already knew that when the blue cells are more than 6000, the hole ratio is higher than 60%. This means that if the length of the description is 3000 then there is more than 1800 holes included in the description. If the blue ratio is very high, then the descriptions contain too many holes. If we can do some summarization on those holes then we can get much shorter descriptions. For a simple example in Figure 5.1, the description generated by M D L wi th exceptions is: (5, a) + (5,b) + (5,c) + (5, d) + (l,g) - {(1,a) + (1, b) + (1,c) + (1, d)} The length is 9. In this description there are 4 holes. If we can do some summarization on holes, we can get another expression as (5, a) + (5, b) + (5, c) + (5, d) + (1, e) + (1, / ) - {(1, g) - (1, e) - (1, / )} = (5, a) + (5, b) + (5, c) + (5, d) - (1, g) + (1, e) + (1, / ) The length of this description is 7. a b c d e f o o o o o o o o o o o o o o Figure 5.1: Example of Summarization on Holes 90 If we study this data region closer we can find another shorter description, which is ( 5 , 5 ) - ( l , 5 ) - ( 5 , e ) - ( 5 , / ) + ( l ,e) + ( l , / ) The length of this description is only 6. From this example we can see that we can generate shorter descriptions and get more benefits by doing some summarizations on holes. Surely this problem is also a hard problem but we can try to propose some heuristics for it. 5.2.2 Representat ion of Holes The descriptions generated by M D L E contain holes as exceptions. We define the length of each hole as 1, which is the same length as a blue cell or a region. Bu t from the user's point of view it is different. For example, given two descriptions Si and S2, Si contains 20 blue cells/regions and S2 contains 10 blue cells/regions and 5 holes. The length of Si is 20. The length of S2 is 15. Even S2 has shorter length than Si, maybe users st i l l prefer to use Si. W h y does this happen? Because it is much more direct, easy and normal to read a description wi th only blue cells/regions. If the descriptions contain some holes, the user might need to know where these holes come from and need to know more clearly about the tree hierarchical structure to understand these descriptions. So it is natural to assign different length to a hole and a blue cell. We can assign a hole as length 1.5 or 2. This is to add some weights on cells. The other method is that we can limit the hole ratio in a description for easy understanding. 91 5.2.3 Approximation Rate So far we have proposed three heuristics, Greedy, Dynamic Programming and Quadratic Programming. B u t we did not give any approximation rate of these methods. Further work should be done to find a good approximation algorithms wi th suit-able approximation rate. Because the graph mode l (CEW) we used has some 2 — approximation algorithms, it is a good start to find some approximation algorithms from these methods. 92 Bibliography [1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD, pages 94-105, M a y 1998. [2] John C .G.Boot . Quadratic Programming: Algorithms, Anomalies, Applica-tions. Noth-Holland Publishing Company, 1964. [3] S. C H A U D H U R I and U . D A Y A L . A n overview of data warehousing and olap technology. ACM SIGMOD Record, 26(l):65-74, 1997. [4] Joseph C . Culberson and Robert A . Reckhow. Covering polygons is hard. Journal of Algorithms, 17:2-44, 1994. [5] G . Gambosi V . K a n n A . Marchetti Spaccamela G . Ausiello, P. Crescenzi and M . Protasi. Complexity and Approximations: Combinatorial Optimization Problems and Their Approximability Properties. Springer, 1999. [6] J i m Gray, Surajit Chaudhuri , A d a m Bosworth, Andrew Layman, Don Reichart, M u r a l i Venkatrao, Frank Pellow, and Hamid Pirahesh. Da ta cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, l ( l ) :29-53, 1997. [7] Dorit S. Hochbaum. Approximating clique and biclique problems. Journal of Algorithms, 29-1:174-220, 1998. [8] J.Rissanen. Modell ing by shortest data description. Automatica, 14:465-471, 1978. [9] K . G . M u r t y . Linear Complementarity, Linear and Nonlinear Programming. Hel-dermann Verlag, 1988. [10] V . S. A n i l Kumar a n d ' H . Ramesh. Covering rectilinear polygons wi th axis-parallel rectangles. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing (STOC-99), pages 445-454, M a y 1999. 93 Laks V . S . Lakshmartan, Raymond T . Ng , Christine X i n g Wang, and Xiaodong. The generalized M D L approach for summarization. Proceedings of the 28th VLDB Conference, pages 766-777, 2002. P. Keskinocak M . Dawande and S. Tayur. O n the biclique problem in bipartite graphs. In GSIA working paper 1996-04- Carnegie-Mellon University, 1997. Alberto O. Mendelzon and K e n Q. P u . Concise descriptions of subsets of structured sets. In ACM PODS, pages 123-133, June 2003. Raymond T . N g and Christine X i n g Wang. Summarization using the gener-alized minimum description length. In Technical Report. Computer Science Department, University of Br i t i sh Columbia, 2000. Optimization Toolbox: quadprog. http://www.mathworks.com/access/helpdesk /he lp / toolbox/opt im / quadprog.html. Michael R.Garey and David S.Johnson. Computers and Intractablity: A Guide to the theory of NP-Completeness. W.H.Freeman, 1979. Charles E . Leiserson Thomas H . Cormen and Ronald L.Rivest . Introduction to Algorithms. The M I T Press, 1990. W.H.Inmon. Building the Data Warehouse. Q E D Technical Publishing Group, 1992. XiaoDong Zhou. Generalized mdl approach for data summarization. University of British Columbia, Canada, 2002. 94
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- The Summarization of hierarchical data with exceptions
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
The Summarization of hierarchical data with exceptions Bu, Shaofeng 2004
pdf
Page Metadata
Item Metadata
Title | The Summarization of hierarchical data with exceptions |
Creator |
Bu, Shaofeng |
Date Issued | 2004 |
Description | In many applications of OLAP or data warehouse, users need to query data of interest, such as a set of data that satisfies specific properties. A normal answer to such query just enumerates all the interesting cells. This is the most accurate but not the most informative method. Summarizations need to be done in order to return more concise descriptions of these interesting cells to the users. MDL approach has been applied on the hierarchical data to get concise descriptions. However in many cases the descriptions are not concise enough to the users. Another method, GMDL, can generate much shorter descriptions, but the GMDL descriptions are not truly pure. The motivation of our research is to overcome the disadvantages in the above methods. In this thesis, we bring up a methodology that focuses on generating the summarization with exceptions of the hierarchical data. We extend the MDL approach to include some exceptions in the description. The exceptios are some uninteresting cells. The result shows that the description with exceptions is pure, which means that the description only covers "interesting cells". We call this new approach MDLE, i.e. MDL with exceptions. Our new approach aims to find the shortest description with exceptions to cover all "interesting cells". Firstly, we study two simple cases that can be solved in polynomial time and we give the algorithms. Secondly, we prove that MDL with exceptions is an NP-Hard problem in general cases and we propose three heuristics. Finally, we show some experiments that we have done to compare MDLE with MDL and GMDL . The experiment results show that MDLE generates more concise descriptions than MDL and meantime MDLE gets shorter descriptions than GMDL when the white-ratio is low or there are some red cells. |
Extent | 3304718 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051743 |
URI | http://hdl.handle.net/2429/15511 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2004-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2004-0388.pdf [ 3.15MB ]
- Metadata
- JSON: 831-1.0051743.json
- JSON-LD: 831-1.0051743-ld.json
- RDF/XML (Pretty): 831-1.0051743-rdf.xml
- RDF/JSON: 831-1.0051743-rdf.json
- Turtle: 831-1.0051743-turtle.txt
- N-Triples: 831-1.0051743-rdf-ntriples.txt
- Original Record: 831-1.0051743-source.json
- Full Text
- 831-1.0051743-fulltext.txt
- Citation
- 831-1.0051743.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051743/manifest