Generalized M D L Approach for Data Summarization by Xiaodong Zhou M.Sc , Tsinghua University, China, 1999 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of Brit ish Columbia December 2002 © Xiaodong Zhou, 2002 In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l l m e n t of the requirements f o r an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r reference and study. I f u r t h e r agree that permission f o r extensive copying of t h i s t h e s i s f o r s c h o l a r l y purposes may be granted by the head of my department or by h i s or her r e p r e s e n t a t i v e s . I t i s understood that copying or p u b l i c a t i o n of t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be allowed without my w r i t t e n permission. Department of The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Date Abstract There are many applications that identify some data of interest. Usually, such applications just return a set of records that satisfy the criteria applied. However, such results cannot provide enough information for the user. A concise description is more preferable than the individual data. Minimum Description Length (MDL) is a well-known approach to handle such problems. In this thesis, we extend the M D L principle to the Generalized M D L (GMDL) principle by including some "do not care" data. We apply the M D L and G M D L principles to solve the problem of data summarization both in the spatial case and in the hierarchical case. For the spatial case, we improve one current top-down algorithm for high-dimensional data. We also study the G M D L problem for the hierarchical case and find that there exists a unique, non-redundant, and blue-maximal M D L covering. We propose MDL-Tree and GMDL-Tree algorithms to find M D L covering and G M D L covering respectively in the hierarchical case. The experimental results show that G M D L coverings have a much shorter description than M D L covering in the hierarchical case. ii Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgments ix Dedication x 1 Introduction 1 1.1 Data Warehousing and OLAP 1 1.2 Problem Definition 3 1.2.1 Problem Definition for the Spatial Case 5 1.2.2 Problem Definition for the Hierarchical Case 9 1.3 Contributions 12 1.4 Outlines of Thesis 13 2 Background and Related Work 15 iii 2.1 Iceberg Queries 15 2.1.1 Current Research on Iceberg Queries 16 2.1.2 Limitations 16 2.2 Minimum Description Length 17 2.3 GMDL Algorithms in the Spatial Case 18 2.3.1 Bottom-up Pairwise Algorithm 18 2.3.2 R-Tree Splitting Algorithm 19 2.3.3 CAS Algorithm 19 3 Algorithm CAS-Corner 21 3.1 Algorithm CAS 21 3.1.1 Idea of CAS 21 3.1.2 Algorithm CAS 22 3.1.3 Case Study for CAS Algorithm 23 3.2 Algorithm CAS-Corner 24 3.2.1 Limitations of CAS Algorithm 24 3.2.2 Algorithm CAS-Corner 26 3.3 Experiment 28 4 G M D L in the Hierarchical Case 31 4.1 MDL Problem in the Hierarchical Case 31 4.1.1 Spatial vs. Hierarchical 31 4.1.2 Uniqueness in the Hierarchical Case 35 4.1.3 MDL-Tree Algorithm 37 4.1.4 Case Study of MDL-Tree Algorithm 39 4.2 GMDL Problem in the Hierarchical Case 43 iv 4.3 Experiments 48 4.3.1 M D L vs. G M D L 48 4.3.2 Varying Dimensionality . 50 5 Conclusions and Future Work 51 5.1 Summary 51 5.2 Future Work 52 5.2.1 Parallelism 52 5.2.2 Spatial and Hierarchical 53 Bibliography 55 Appendix A Proof of Theorem 4.1 57 v List of Tables 1.1 Transactional Operations v.s. Analytical Operations 1 1.2 An Example of MDL Problem in the Spatial Case 5 4.1 An Example of occurrence, maximal gain, and cost 46 4.2 occurrence, maximal gain, and cost for All Candidates 47 4.3 GMDL with Different Dimensions 50 vi List of Figures 1.1 A Simple Example of Hierarchy 3 1.2 Example for the Spatial Case 6 1.3 Spatial Case for GMDL 9 1.4 Example for the Hierarchical Case 10 3.1 Splitting in CAS 24 3.2 Comparison of CAS in 2D and 3D 25 3.3 Splitting Strategy of CAS 26 3.4 Splitting Strategy of CAS-Corner 27 3.5 CAS-Corner vs. CAS w.r.t Compression Gain in 3D 29 3.6 CAS-Corner vs. CAS w.r.t Time in 3D 30 4.1 Example of Non-redundancy 32 4.2 Blue-maximal in the Spatial Case 33 4.3 Clustering in the Spatial Case and the Hierarchical case 35 4.4 Projection in the Hierarchical Case 36 4.5 Example of MDL-Tree Algorithm 40 4.6 MDL-Tree: Propagate in All Dimensions 41 4.7 MDL-Tree: Redundancy Check 42 vii 4.8 MDL-Tree: MDL Covering 43 4.9 Compression Gain vs. White Budget 49 A . l Uniqueness of Optimal Covering 58 viii Acknowledgments I am deeply indebted with my supervisor Dr. Raymond T . Ng, who not only gave me excellent advice but also showed me how to be a researcher. It was his guidance during the course of my work that made this thesis a possibility. I would like to thank Dr. Laks V.S. Lakshmanan for the advice he gave me in the past year and for the time he spent on reviewing my work. His suggestions are very helpful and valuable. I would like to thank all the members in database lab, especially Edwin M . Knorr, Kai Sang Leung, Ruiyao Yang, and Mack Yuen for their support in my work and life. I would like to thank my parents, Tianyi Zhou and Jinhua Chen, for their love, instructions, and encouragement. Those are the most valuable assets I possess. I owe my gratitude to my sister, Xiaoyun Zhou, her success encouraged me a lot. A special thank-you goes to my wife, Xiaoyan Yan, for her great contribution in taking care of our daughter when I was pursuing my M.Sc. degree on the other side of Pacific Ocean. I could not accomplish my work without her help. Finally, I would like to thank Buddha, Master Chin Kung, and all the people who help me to overcome the difficulties and establish the goal of my life. XIAODONG ZHOU The University of British Columbia December 2002 ix To my parents, for their endless love and patience. Chapter 1 Introduction 1.1 Data Warehousing and O L A P Data Warehousing was introduced to facilitate analytical applications such as De-cision Support Systems, Enterprise Information Systems, and On-Line Analytical Processing (OLAP). Analytical applications are quite different from the transac-tional applications. Some major differences are shown in Table 1.1. Transactional Analytical detailed summarized up-to-date historical event-driven analysis-driven predictable operations ad-hoc operations Table 1.1: Transactional Operations v.s. Analytical Operations Besides the above differences, traditional database systems are designed for pre-defined transactional operations with acceptable performance. Such operations are called On-Line Transaction Processing(OLTP)[18], e.g., banking transactions. The performance, recoverability, and consistency are of great importance for OLTP. 1 In contrast, analytical operations involve both the current data and the historical data, which may contain millions of records in several tables or even databases. Such processes will degrade the overall system performance. Therefore, it is necessary to isolate the analytical operations from the operational ones. The widely accepted definition of Data Warehouse is given by W.H.Inmon: "Data Warehouse is a subject-oriented, integrated, time-variant and non-voilatile collection of data that is used primarily in organizational decision making process" [9]. Data Warehouse is subject-oriented in the sense that the data is organized under spe-cific subjects rather than being distributed across several tables or even databases to satisfy the normal forms requirements. Data Warehouse is also integrated because the data are integrated from different sources, though they may have had different names and types before. Data in data warehouse are associated with time, and this enables time-variant analysis. The data are also Non-volatile, meaning that they will seldom be updated or deleted. With such characteristics, data warehouses build a strong foundation for analytical applications. OLAP aims for the top-level analysis. It enables analysts, managers, and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information. OLAP transforms raw data to reflect the real dimensionality of the enterprise as understood by the user[4]. There are four regular operations in OLAP: slice-and-dice, roll-up, drill-down, and pivot. Those operations consider the data in the multidimensional and hierarchical perspectives. For example, for sales analysis, we need to analyze the data with regard to factors such as quantity of sales, regions, time, and type of products. Such analyses are multidimensional in essence. Moreover, they also can be hierarchical in some dimen-sions, e.g., year-quarter-month-week and country-province-city[3]. Figure 1.1 shows 2 an example of hierarchy. Vancouver Victoria Toronto Ottawa Windsor Montreal Figure 1.1: A Simple Example of Hierarchy In addition to the typical operations as listed above, there are many ap-plications in OLAP that summarize the data to increase understanding or reveal information unknown. One common need is to summarize the data to get more general information rather than individual data. This is what this thesis is about. 1.2 Problem Definition Before defining the problem formally, it is beneficial to describe it through a simple example and introduce some notions that will be used in this thesis. As stated in Section 1.1, one common application in OLAP is data summarization. For example, a sales manager may ask such questions as "which city did sell more than 1,000 T V sets per month, and what was the total sales?". We need the following query on the sales table. SELECT City, Province, Area, Country, SalesPerMonth FROM Sales WHERE QuantityPerMonth >1000 3 The result is a set of records, e.g., "Vancouver, British Columbia, West, Canada, $180,000". Each attributes in the result relation will serve as a dimen-sion, e.g., City and SalesPerMonth. All of the dimensions form a multidimensional dataset. Each tuple in the result relation corresponds to a cell in the dataset. Quan-tityPerMonth, the attribute in the WHERE clause, is considered as a property, say p for all the cells in the dataset. In effect, the problem is how to describe the subset of D, which satisfy the condition of p < c. Rather than enumerating all the satisfactory data individually, the user will prefer a concise summary covering all the interesting data and excluding uninteresting ones. Such a summary amounts to a non-redundant covering of inter-esting cells using maximal axis-parallel hyper-rectangular regionsfl]. The Minimum Description Length (MDL) is one of the most widely used approaches to provide such descriptions, where the description language is restricted to hyper-rectangular regions. Consequently, finding the MDL summary is equivalent to finding a mini-mum rectangular covering. [11] Since our mission is to describe the data where the constraint on property p holds, the nature of the dataset will greatly affect our description. Generally speaking, there are two kinds of data: numerical and categorical. Numerical data are those can be expressed as numbers, e.g., age, salary, distance. Categorical data are those whose values are strings, e.g., the name of a city and the name of a product. Numerical data are ordered numerically. However, categorical data are always organized by a hierarchy, e.g., country-province-city as shown in Figure 1.1. The following two examples, one spatial and one hierarchical, will show us the differences and lead to definitions for both cases. This thesis does not consider the hybrid case, whose dimensions are partially numerical and partially categorical. 4 1.2.1 Problem Definition for the Spatial Case Let us illustrate the problem in the spatial case through the following example. Example 1.1 Suppose an insurance company wants to change their plan strategy. They need to describe the composition of their customers based on their age and salary before they make any decisions. They need to do some research on the distribution of their current customers over age and salary. Table 1.2 shows the number of customers with respect to age and salary. o 00 0 0 2 10 15 16 12 10 8 3 o I— 0 0 3 12 18 25 36 43 31 11 >n »o 0 15 22 55 63 52 46 42 34 13 o 12 24 34 59 69 50 42 46 30 It >o m 16 35 36 51 59 48 46 51 52 40 o >r> 23 30 36 35 38 32 41 50 45 39 in 21 26 28 34 36 38 42 32 31 24 o •<t 26 23 28 36 32 29 21 19 23 16 o 28 16 18 26 19 18 12 15 12 11 o 32 36 19 10 6 4 3 4 8 16 15 20 25 30 35 40 45 Age(years) 50 55 60 Table 1.2: An Example of MDL Problem in the Spatial Case Due to the large number of customers, the manager categorizes customers into 3 directories, highly interesting(F >= 50), uninteresting(ir <= 5) and do not care (50>F>5). He only wants to describe the highly interesting part, and exclude the uninteresting ones. Thus, the highly interesting ones are those we have to describe, the uninteresting ones are those will not be included in our description, and the "do not care" are those may be included in some situation. We use three 5 colors: blue, red and white to represent them respectively. Therefore, the problem of M D L can be transformed as follows - find a minimum number of rectangles to cover all blue cells and only blue cells. Figure 1.2 is the corresponding figure to Table 1.2. Blue cells and red cells are represented by "o" and "x" respectively and all the others are white. From Figure 1.2, it is obvious that rectangles R1,R2,R3,R4 cover all the blue cells. Rl represents the group of customers with a salary ranging from $55,000 to $70,000 and an age from 30 to 40, and each unit in it has at least 50 customers. Thus, Rl is a highly interesting group. As a result, we only need 4 rectangles to describe 12 cells. o oo X X X o r » X X - - . . . \ R2 i n o X ! o o b ! R3 8 ! o o IO j ' JD 1 \ i n i n 1 o "0" 1 V O b i t o Rt U " i i n i i n R4 — > o o o t s X X X 15 20 25 30 35 40 45 50 55 60 Age(years) Figure 1.2: Example for the Spatial Case With the Example 1.1, it is obvious that M D L can provide a concrete de-scription to the multidimensional data with a good compression rate. In order to arrive at definitions for the spatial case, we state the following concepts first. Let D be a AT-dimensional numerical dataset, i.e., D has N numerical at-tributes. di,c?2,...,djv are the N ranges of for the N dimensions of D respec-tively. Each range di(i=l,2,...,N) is bounded and totally ordered. We will use 6 di(i=l,2,...,N) to refer the dimension itself in the following parts. Now let us partition each dimension d{ (i=l,2,..,N) by interval ui, i=(l,2,...,N). The range dj is divided into Ki non-overlapping units. Then, we divide dataset D into a set of non-overlapping axis-parallel rectangular cells. Each cell is an atomic unit which can be represented as a tuple (x\, X2,x^). Xi(i = 1,2, ...,N) is one part from dj. Every cell gets a value of the property p. The property p is one attribute that all the data in D have. Each cell is associated with only one color, either blue, red or white, and each color is associated with property p. We use B as the set of blue cells, R as the set of red cells and W as the set of white cells. A region R in dataset D is a iV-dimensional axis-parallel hyper-rectangle. It is composed of a set of cells. A covering is a set of regions containing the cells to describe. The cardinality of a covering is the number of regions in it. A MDL (Minimum Description Length) covering means that it covers all and only the necessary cells and the cardinality is minimized. With the above notions, we can state our problem as follows: Definition 1.1 Given a multidimensional numerical dataset D with dimensions di (i—l,2,...,m),a property p and a threshold c. For each cell in D, check whether condition p>c holds or not. If true, mark it as blue. Try to find a MDL covering G to cover all blue cells and only blue cells. The above statement can be restated as the following conditions: • D = BURUW • BHRD W= 0 7 • GDB= B • | G\ is minimal By the definition above, we can see that the M D L coverings only covers blue cells, no red cells and no white cells. This makes the M D L covering an exact description of the blue cells. However, in many cases, such a restriction is not always necessary and can be compromised. In our example, the manager do not mind to include some white cells in the covering. That is to say, we can tolerate, to some extent, some "impurities" in the blue M D L covering, if it can provide a shorter description. Thus, the cardinality can be reduced even more. As shown in Figure 1.3(a), if we include the white cell (age=40, salary=$55k) into the blue coverings, we only need 3 regctangles R1,R2,R3 for the M D L covering. Moreover, in Figure 1.3(b), if we include another 2 white cells: (age=55,salary=$50k) and (age=55,salary=$60k), we only need 2 rectangles, Rl and R4 in the M D L covering. With such impurities, the cardinality of M D L covering can be greatly decreased. This is a trade-off between accuracy and regularity. The above analysis introduces the idea of Generalized M D L ( G M D L ) description. Let us define white budget as the number of white cells that we can tolerate in the GMDL covering. We can state the G M D L problem in the spatial case as follows . Definition 1.2 Given a multidimensional numerical dataset D with dimensions di (i=l,2,...,m), a property p, thresholds c l , c2 and white budget OJ. For each cell in the lattice, check whether condition p>cl holds or not. If it holds, mark this cell as blue, otherwise check whether condition p<c2 holds or not. If this condition holds, mark it as red, otherwise mark it as white. The mission now is to find a Generalized 8 O GO X X X o oo X X X o p~ X X R l o 1^ X X R l in o X j O O O [ . R2 in X o o o 1 - - - . . . •a a s ! O O o 1 l T . C L T - -a c o VO o o o \\ o 0 a o m IT) ; O O o 111 o b I ca w 3 Q in m o o o ;; o o • .c ialary(thi o ~% o in 1 "O" 7 ~ R3 ialary(thi o in o ! a cd in v — ialary(thi Tt ialary(thi •^ t R4 o -<t o o m o m o X X X O CN X X X 15 20 25 30 35 40 45 50 55 60 15 20 25 30 35 40 45 50 55 60 Age(years) Age(years) f a ) ( b i Figure 1.3: Spatial Case for G M D L MDL covering G, which covers all the blue cells, no red cells, and the number of white cells being covered does not exceed u. The above statement can be restated as the following conditions: • D = BURL) W • BnRn w= 0 • GDB = B • G n i i = 0 • | G n w] < w • \G\ is minimal 1.2.2 Problem Definition for the Hierarchical Case If the dataset D is composed of categorical data organized by hierarchies in all dimensions, the M D L and G M D L region finding problems are quite different from 9 that of the spatial case. We can see the differences by Example 1.2. Example 1.2 In the Sales Department, the manager needs a report on the so-called "best-selling" clothes in differenct location. For example, they define the blue as those quater sales greater than $10,000, the red as those quater sales less than $1,000 and white as those quater sales between $10,000 and $1,000. Figure 1.4 shows the data in the hierarchical structure. Similar to Example 1.1, "o" stands for blue, "x" for red and the others are white. S j-2 <si JI , i 8 I • " O D- O O -a C 8 •§ I I San Jose San Francisco Minneapolis Chicago oston Summit lbany ew York 35 . Q . Q . Q : Q | Rl p! R2 R3 i°: p! R4 o 1 o1 F5 0. D . £>J D . D . D . pl bL R6 'o! R7 b! 'o! R8 O 0 o O X 1 0 0 0 Oi R9 1 iO 0 0 Oi 1 o. £)_ D_ Figure 1.4: Example for the Hierarchical Case From Figure 1.4, we can see that each dimension contains a hierarchy and only consists of independent categorical values, e.g., "shirt", "tie", "new york", etc. And all values are leaves of the hierarchical tree in this dimension. 10 Thus, one cell can be represented by the combination of leaves in each di-mension, e.g., jackets in northwest. And one region can be represented by the combination of nodes in each dimension, including leaf nodes, internal nodes and the root node. In Figure 1.4, for example, we can say that men's clothes sell well in northwest. This corresponds to region R9. However, we can not group those blue cells freely as in the spatial case, even if they are adjacent to each other. For example, cell (jackets, Minneapolis) and (jackets, San Francisco) are adjacent to each other, but they belong to different categories. So we have to group them into two different regions first, then see whether it is still possible to group the two new regions together. In Example 1.2, we use 9 rectangles (regions) R1,R2,R3,...,R9 to summarize blue cells. Thus we can state the problem as follows. Let us state the concepts for the hierarchical case that are different from those that have already been defined in Section 1.2.1 . Let D be a iV-dimensional categorial dataset. d\,cfe, are the N cate-gorical dimensions of D respectively. Each dimension di(i=l,2,...,N) has Ki values. In each dimension di(i—l,2,...,N), there is a hierarchy T; with Ki leaves. Each leaf contains one value from Ki values in dj. Ni is the total number of nodes, including leaf nodes, internal nodes and root node of Tj. A cell is a combination of one leaf node in each dimension. It can be repre-sented as (x\,X2,...,Xfj), Xi(i = 1,2, ...,N) is one of the leaf nodes in dimension dj. Each cell is associated with a color in the same way as in the spatial case. A region R in dataset D is a combination of one node in each dimension, including leaf, internal and root nodes. R can be represented as (yi,yi, •••,ym), y% is one of the Ni nodes in hierarchy Tj. 11 The property, covering and MDL covering are the same as that of the spatial case. Given a N-dimensional categorical dataset D with hierarchy Ti in each di-mension, the definition of MDL and GMDL problem can be refered to Definition 1.1 and 1.2 respectively. 1.3 Contributions This thesis is strongly related to the report of Wang et al. on GMDL problems in the spatial case[15]. In this report, they developed one bottom-up algorithm, B P algorithm, and two top-down algorithms, R T S and C A S algorithms, for GMDL region finding in the spatial case. We will discuss them in detail in Chapter 2. Based on their work, one part of this thesis focuses on the improvement of the CAS algorithm. Another part of this thesis focuses on the MDL and GMDL regions finding in the hierarchical case. The major contributions are as follows: • Analyze the algorithm CAS, and make improvement for multidimensional case. The algorithm of CAS is a top-down algorithm for GMDL region searching. The cardinality, the number of regions in the GMDL covering, is acceptable in 2-dimensional cases. However, in 3-dimenional cases, it increases greatly. The analysis to the CAS algorithm shows that this is due to the splitting stratergy of CAS. Thus, we developed another algorithm CAS-Corner to improve it. The experimental results show that CAS-Corner algorithm generates less regions and takes less time than CAS algorithm on the same dataset. • We analyze the MDL and GMDL region finding problems in the hierarchical case. Most of the traditional research on MDL focuses on the spatial case. In 12 this thesis, we point out the importance of MDL and GMDL problems in the hierarchical case. Moreover, we elaborate on the nature of MDL and GMDL problems in the hierarchical cases and show their differences from the spatial one. • We find that there exists a unique, blue-maximal and non-redundant MDL-covering for categorical data with hierarchies in each dimension. In this thesis we propose an efficient algorithm, MDL-Tree, for MDL regions finding in the hierarchical case. To the best of our knowledge, this is the first MDL region finding problem for categorical data. • A key analytic result of this thesis is to show that MDL regions finding problem on data with tree hierarchies is not NP-hard, and can be solved in time linear in the size of the hierarchies. • We applied some heuristic methods to extend the MDL-Tree algorithm to GMDL-Tree algorithm to handle the GMDL region finding problem. The experimental results show that the GMDL approach can improve the com-pression by at least an order of magnitude. .4 Out l ines of Thesis • Chapter 2 - Background and Related Work. In this chapter, we talks about three major background of this thesis: ice-berg query, MDL principle and GMDL algorithms in the spatial case. • Chapter 3 - Algorithm CAS-Corner. In this chapter, we analyze the algo-rithm CAS in detail and propose a new algorithm CAS-Corner to improve the 13 performance for high dimensions. Chapter 4 - G M D L in the Hierarchical Case. In this chapter, we first analyze the nature of M D L and G M D L problems in the hierarchical case and show that there exists a unique, blue-maximal and non-redundant M D L covering. Then, we propose two algorithms, MDL-Tree and GMDL-Tree, to solve the M D L and G M D L problems respectively. Chapter 5 - Conclusions and Future Work. In this chapter, we summarize this thesis and propose some suggestions for future work. 14 Chapter 2 Background and Related Work 2.1 Iceberg Queries In many applications, e.g., O L A P , data mining and decision support systems, we need to retrieve the data that satisfy some criteria, especially some aggregation functions such as SUM, A V E R A G E , C O U N T and so on. For example, we need to find all the customers who bought more than 20 items at a time. The result of such queries is much smaller than the original dataset in most cases. This is just like the tip of an iceberg compared to the huge body of itself. We call such queries with aggregation functions iceberg queries. The iceberg query filters the large input data according to a threshold of aggregation functions over some attributes of the dataset. The number of records in its output relation is much smaller than that of the input relation in most cases. 15 2.1.1 Current Research on Iceberg Queries Due to the huge amount of the input data, computation of iceberg query may be time consuming. For example, Park et al. argue that the iceberg query in associa-tion rule mining takes most of the time in that procedure[16]. Thus, most research on this area focuses on how to compute the iceberg queries efficiently. Beyer et al. propose B U C , a bottom-up algorithm, for Iceberg-CUBEs and sparse CUBEs computation[2]. Fang et al. developed efficient algorithms for iceberg queries com-putation and avoided massive memory usage[5]. Ng et al. applied several parallel algorithms to the iceberg-cube computation by distributing the computation to P C clusters[14]. 2.1.2 Limitations It is true that the size of the result is smaller compared with the original dataset. However, in most cases, that size is still too big to understand for users. In other words, the tip of an iceberg is still too big to grasp for a human being. In addition, the iceberg queries can only provide the end users with a set of records that satisfy the applied criteria. Viewing the individual data can not provide enough information for analysts. From the point of view of the end user, a brief description is more useful. Thus, effective summarization on the result is necessary. This thesis focuses on the description of those data with property p holds. It can be viewed as the research on the post-procedure of the iceberg-like queries. 16 2.2 Min imum Description Length Summarizing the data where property p holds is to give a model that describes the required data. The model here is the unseen rules lying behind the raw data. Obviously, we want a concise model, say description, that provides a good fit to the given data. The principle of Minimum Description Length(MDL) is a well-known approach to find such descriptions. It was first proposed by Jorma Rissanen in 1978[10], well developed in the information theory[8], and has been applied to many research areas such as data compression[20], decision tree construction[13, 17], and image segmentation[7] and so on. According to the M D L principle, the best model of data is the one that gives the shortest descriptions[10] [8]. Given data D and model M, let L(M) denote the length of description of M , and L(D\M) denote the length of description of D under M. The sum of L(D\M) and L(M) can be used as an indicator of the length of description of D. Thus, the best description is the one with the smallest sum. In addition, in most cases, we search for some acceptable results instead of the minimum one. The reason lies in the computation difficulty in searching for the minimum result in a large dataspace[12]. This is a trade-off between the accuracy and complexity. In this thesis, "model" is the regions (rectangles) in the covering. L(M) is the number of rectangles used to describe D. L(D\M) corresponds to the number of red cells and white cells in the covering. It is constant for pure M D L covering, because it covers all blue cells and only blue cells. So the sum depends monotonously on the number of rectangles used to describe D. The G M D L problem introduces a trade-off between regularity and accuracy, corresponding to L(M) and L(D\M). We use whitebudget to limit the above trade-off. 17 2.3 G M D L Algorithms in the Spatial Case This thesis is highly related to Wang et al.'s work[15]. They introduced the idea of G M D L regions finding in the spatial case. We will discuss the three algorithms they proposed for the spatial case. 2.3.1 Bottom-up Pairwise Algorithm The M D L region finding problem is well studied in the spatial case. Then it is natu-ral to borrow some ideas from them. The Bottom-up Pairwise algorithm, referred as the BP algorithm is highly related to Agrawal et al.'s work on automatic subspace clustering, which is capable of dense subspace clustering in multidimensional datas-pace using axis-parallel hyper-rectangulars[l]. The BP algorithm borrows their idea of M D L region finding and develops it to solve the G M D L problem in the spatial casefll]. The following is the main ideas of BP algorithm. The MDL covering, which is a set of regions, can be viewed as a G M D L covering with white budget equals 0, i.e., no white cell is included in the G M D L covering. Thus, we can merge the regions in the MDL covering by incorporating only white cells within the white budget. We do not include the red cells in the covering. Therefore, how to merge MDL regions into GMDL regions is of great impor-tance. Two possible strategies to do this are as follows. One is best-fit. That is to say, given a region, we need to find the best counterpart to merge into a new GMDL region, with minimum consumption of white cell. The other is first-fit. That is to say, given a region, we only need to find the first counterpart that can form a new G M D L region with it. The BP algorithm uses the first-fit strategy to do the merging. It also pro-18 vides an optimization to the first-fit by sorting the M D L regions according to their positions. Thus, the nearby regions have a higher opportunities to be merged to-gether. Based on the experiments, the best-fit strategy only provides a little higher quality than first-fit, but it costs much more time than that oi first-fit. 2.3.2 R-Tree Splitting Algorithm The BP algorithm is not typically satisfactory for two main reasons. The first reason is that it incurs unnecessary cost by finding M D L first and then merging them together for G M D L covering. Another reason is that it takes much time when there are a lot of M D L regions and a large white budget to start with. These weaknesses motivate the Top-Down algorithms. The Top-Down algo-rithms start with a region R that covers all the blue cells in the dataspace. It can also include the red and the white. Then, we split region R into smaller regions so that some of them cover all the blue cells, no red cells and reduce the consumption of white cells at the same time. As a result, we end up with a set of GMDL regions covering all blue cells, no red cells and the consumption of white cells is within the white budget. It is obvious that the splitting strategy is the key issue. The above procedure can be viewed as a typical R-Tree node splitting procedure. The algorithm of RTS applied Garcia et.al.'s dynamic, polynomial time R-Tree node splitting algorithm to accomplish the splitting issue[6]. 2.3.3 C A S Algorithm One weakness of RTS algorithm lies in its "color blindness". The RTS algorithm does not take the color of a cell into consideration when it splits. This inspires 19 another algorithm: CAS. CAS stands for Color Awareness Splitting. According to the algorithm of CAS, we first strike out all red cells together with the surrounding white cells. Secondly, we remove the pure white regions until the consumption of white cells is within the white budget. This algorithm is proved to be better than RTS. We will discuss this algorithm in detail in Chapter 3 20 Chapter 3 Algorithm CAS-Corner In Chapter 2, we briefly discussed three algorithms that deal with the GMDL region finding problem in the spatial case. In this chapter, we analyze one of the spatial algorithms, CAS, in detail. We also improve it into algorithm CAS-Corner. The experimental results show that CAS-Corner has a better performance both in time and compression gain over CAS. 3.1 Algor i thm C A S 3.1.1 Idea of C A S As stated in Section 2.3.2, in Top-Down algorithms, we begin with a big region R covering all the blue cells, together with some red and white cells, and split it to get the GMDL covering. The splitting strategy is the key issue for Top-Down algorithms. CAS, which stands for Color Awareness Splitting, is a Top-Down algorithm and follows the above splitting strategy. Moreover, CAS applies a heuristic approach in splitting as follows. 21 First, CAS strikes out all of the red cells with surrounding white cells. Be-cause the red cells can never be included in the final GMDL covering, they do not contribute to the final GMDL coverings. Moreover, a pure white region has little contribution to the description and do consume much of the white budget. Thus, it is quite natural to remove all the red cells and pure white regions as much as possible. In CAS, if a regions R has one or more red cells inside, we pick a red cell, say Cl, and expand it into a hyper-rectangle by incorporating the surrounding red and white cells. The expansion stops in each dimension only when it reaches some blue cells in that dimension. Thus, we form the maximal region R* for Cl including only red and white cells. Then we strike out region R* from R and generate a set of new regions, say hyper-rectangles. As a result, we update the covering G by replacing R with a set of its subregions. We keep removing the red cells until we do not have any red cells in the covering. In this way, all red cells are removed, and we remove white cells as much as possible at the same time. Secondly, we strike out the pure white regions to make sure that the number of white cells in our GMDL covering is within the white budget. As in the first phase, we randomly pick a white cell from each region, expand the white cell into a pure white region and remove it from the original region. Then, we keep on splitting the regions with less and less usage of white cells until the consumption of white cells is within the white budget. Consequently, we can end up with a GMDL covering satisfying the white budget. 3.1.2 Algorithm CAS The pseudocode for the CAS algorithm is shown as follows. Algorithm C A S 22 1. Build indexes if,, Ir for blue and red cells respectively. 2. Initialize covering G with region Ro that covers all the blue cells. 3. Do the following steps until there are no red cells in the covering G. (a) Choose one region R from G. Pick a red cell Cr from R and form region R * by expanding the red cell Cr in each dimension until it hits some blue cells. (b) Split R into R*,Ri,R2,...,Rm. Remove R* from R. Update covering G by replacing R with Ri(i=l,2,...,m) in the covering G. 4. Do the following until the number of white cells in the covering G is less than the white budget. For region R in the covering G, randomly pick up a white cell Cw in R and expand it to a pure white region R'. Remove R' from R and generate a set of subregions Rj{j=l,2,...,m). Replace R with Rj(j=l,2,...,m) in the covering G. 3.1.3 Case Study for C A S Algorithm We can refer to Figure 3.1 for this procedure. Figure 3.1 shows a concrete example in 2-dimensional cases. The blue cells are represented by "o", red cells by "X", and all the other cells are white. The white budget is 20. In Figure 3.1(a), we first pick the circled red cell C\. Then we expand it to region R*. R* only contains red and white cells. After striking out R*, the original R is decomposed to 4 subregions:i?i, R2, R3, and R4. In this example, R\, R2, R^,Ri only include blue and white cells and the number of white cells included is 30, which 23 R l o o o o o o cr. o o o o o o o; ;R2 o; [R3 o oi i < Cl" . *> r 1 io o oi :o o; X :o o °i io oi i X i o oi 'Q. _ • o o ;R4 O o o o io o ( v. »o o o o o :o R l o o o o o o I o o o o o o o; [R2 o; ; R * ' ;R3 o o; oi < Cl" io o Oi ;o o; X • ;o o Oi io oi X i o oi ' , Q . _ j ^ _ _j o o i i o o 0 o R7 •>\ J i io R8 o) ;R5 < 52 up R6 N ' /' .R9 o o o o ll Ca) (b) Figure 3.1: Splitting in C A S exceeds the white budget. Thus we need to strike out pure white regions to satisfy the requirement of white budget. Let us begin with R4. Suppose we pick white cell C2 and expand it to a white region R5. Then we remove R$ from R4. Consequently, Ri is divided into 4 subregions RQ,R7,Rg,andRg as shown in Figure 3.1(b). Now the covering G contains the following regions: Ri,R2, R3, Re,R7, Rs, and RQ. In effect, the coveringG contains 27 white cells, and we have to keep on removing the pure white regions from Ri in G until the number of white cells in G is not greater than the white budget. 3.2 Algorithm CAS-Corner 3.2.1 Limitations of C A S Algorithm C A S works well when the dimension is 2. However, in higher dimensions, it generates too many regions in the G M D L covering, making the result unsatisfactory. Its 24 limitation in this aspect can be shown by the following experiment. First, we need to explain some concepts. The compression gain in this context is the number of blue cells divided by the number of regions found in the G M D L covering. The less regions generated in the G M D L covering, the higher the compression gain. The budget ratio refers to the ratio of white budget to the number of blue cells. In the following experiment, we fix the number of blue cells to 50k, blue density to 75%, and red density to 5%. In this context, blue density refers to the percentage of blue cells in a cluster, and similarly is the red density. They are explained in detail in Section 3.3. We vary the white budget in both 2-dimensional and 3-dimensional cases. In Figure 3.2, we can see that in 3-dimensional cases, C A S gets a much lower compression gain than in 2-dimensional cases. 250 -, 0 -| • = = ; , , 0% 20% 40% 80% 120% 160% Budget Ratio Figure 3.2: Comparison of C A S in 2D and 3D The reason of the drops in compression gain relates to the splitting strategy 25 of CAS. In CAS, when we remove the pure white regions, we randomly pick a white cell in the region and expand it. If this white cell is in the interior of R, i.e., it is not on the edge of R, we will get a maximum 2*d subregions in one split(d is total number of dimensions). This makes the compression gain low in high dimensions as shown on Figure 3.3. This issue also exists when we strike out red cells. However, we cannot control the position of red cells. Therefore, we have to stick to the same splitting strategy as CAS when removing red cells. 1 R l I i \ i / t I ! R: R* ; i R: — r i : I i _ / ; \ i I R4 i i R5 R4! 'A _ : R2 u i ; R6 Figure 3.3: Splitting Strategy of CAS 3.2.2 Algorithm CAS-Corner Based on the analysis above, we propose algorithm CAS-Corner, an improvement on CAS. In CAS-Corner, when striking out pure white regions, we first pick up k corners (k < 2d, d is total number of dimensions.) from region R and expand those corners to the largest pure white regions. In this way, if we find a white corner, we strike out the largest one among them from R and generate maximal d subregions in one split. If we cannot find a pure white region we just revert to the original 26 splitting method in CAS algorithm. This is a greedy approach to remove white cells as much as possible and generate less subregions simultaneously. Figure 3.4 shows this splitting strategy in both 2D and 3D cases. In Figure 3.4(a), the 2-dimensional case, region R is decomposed into 2 regions Ri and R2. In Figure 3.4(b), the 3-dimensional case, region R is decomposed to 3 regions R\,R2 and R3. Compared with Figure 3.1, the splitting strategy of CAS-Corner can generate fewer regions. 1 ; Rl [ R' ! ! / -~ \ R2 \ • \ 1 ! R l \ \ \ R* R2| J \ , ^ >-'----! -1-------R3 L... -Figure 3.4: Splitting Strategy of CAS-Corner The algorithm of CAS-Corner only modifies Step 4 of CAS. Algorithm CAS-Corner Step 1 to 3 are the same as step 1 to 3 in the algorithm CAS. 4. Do the following until the number of white cells in the covering G is less than the white budget. Randomly pick k corners from 2d corners of region R. if there are at least one white corner, then expand all the white corners to the largest white regions, and find the largest one, say R'. Split R into R',Ri,R2, ...,Rm. Update covering G by replacing R with Ri,R-2, ...,Rm. 27 else randomly pick up a white cell Cw in R and expand it to a pure white region R'. Remove R' from R and generate a set of subregions Rj(j—l,2,...,m). Update covering G by replacing R with Rj(j=l,2,...,m). According to the algorithms stated above, it is quite obvious that when a region R is decomposed into a set of subregions, all its subregions are non-overlapping to each other. Thus, the algorithms of CAS and CAS-Corner produce non-overlapping GMDL covering. In other words, the results are non-redundant. 3.3 Experiment The following is how the experiment is set up. We use a data generator to generate some synthetic datasets. The data generator takes the following items as inputs: number of total dimensions, intervals in each dimension, the number of blue cells, red cells, blue density and red density etc. In our experiments, we use a 3-dimensional numerical dataset with 100 intervals in each dimension, i.e. the dataset contains 1 million cells. We randomly pick 50k blue cells in a 50k/bd cluster, where bd is the blue density. For example, if blue density equals 75%, the cluster contains 66,667 cells. The red density is defined in a similar way. We pick 10k red cells from an area of 10k/rd. The red density, rd, is the number of red cells divided by the number of cells in that area. 5% of the blue area is overlapped with the red area. The reason we restrict red cells within a small region rather than distribute them over the whole blue area is that in real life the former is more common than the latter. So we want to simulate such circumstance that there are clusters that have higher percentage of "uninteresting" cells than other areas. 28 180 -i 0% 20% 40% 80% 120% Budget Ratio Figure 3.5: CAS-Corner vs. CAS w.r.t Compression Gain in 3D Our experiments compare the performance of CAS and CAS-Corner in both compression gain and time. Figure 3.5 shows their difference with respect to com-pression gain in 3-dimensional cases. In 3-dimensional cases, we can see that the compression gain increases when the white budget increases for both CAS and CAS-Corner. In addition, the compression gain of CAS-Corner is much higher than that of CAS when the white budget is more than 20% of the blue cells. Correspondingly, Figure 3.6 shows the time used for both algorithms. We can see that CAS-Corner consumes less time than CAS in the same period of budget ratio greater than 20%. This indicates that CAS-Corner can do much less splitting than CAS with appropri-ate white budget, which saves much time and gets higher compression gain. When the white budget keeps increasing, however, such difference is not that significant. The reason is that after striking out the red cells, together with surrounding white 29 70 Budget Ratio Figure 3.6: CAS-Corner vs. CAS w.r.t Time in 3D cells, only a few splits are needed to meet the requirment of white budget for both algorithms. As an effect, the difference between CAS-Corner and CAS decreases. To sum up, in this chapter, we analyzed the CAS algorithm, found its lim-itation, and proposed a new algorithm, CAS-Corner. CAS-Corner use a heuristic approach when striking out pure white regions. Our experiments in 3-dimensional cases show that CAS-Corner has a higher compression gain and takes less time than CAS. 30 Chapter 4 G M D L in the Hierarchical Case In the previous chapter, we discussed G M D L algorithms in the spatial case. In this chapter, we will turn to the M D L and G M D L problems in the hierarchical case. Since the M D L problem in the hierarchical case has not been well studied, we will proceed to analyze the nature of M D L problem in the hierarchical case first, propose the MDL-Tree algorithm, then analyze the G M D L problem while providing the GMDL-Tree algorithm. To begin with, we look into the M D L problem in the hierarchical case. 4.1 M D L Problem in the Hierarchical Case 4.1.1 Spatial vs. Hierarchical As a starting point, we first take a look at two important characteristics of M D L and G M D L problems, non-redundancy and blue-maximal. A non-redundant cover-ing means that no region in the covering can be derived from the other regions. For example, in Figure 4.1, covering G contains three regions: R1,R2,R3. R3 is redun-31 dant because we can derive it from region Rl and R2. The non-redundant covering G ' is R1,R2. A blue-maximal region in MDL means that we cannot add more blue cells into it to form a bigger region. In other words, it cannot be covered by another blue region. In GMDL, blue-maximal means we cannot enlarge this region by adding blue or white cells without exceeding the white budget. A covering is blue-maximal if all the regions in it are blue-maximal. Can we find a blue-maximal, non-redundant MDL(GMDL) covering for the hierarchical case? R l r i i o p i i i i o iO o i I i i a JO: 'R2 Figure 4.1: Example of Non-redundancy Blue-maximal in the Spatial Case First, we take a look at the MDL problem in the spatial case. In the spatial case, a MDL covering with least cardinality is non-redundant. Otherwise we can remove the redundant regions and get a new covering with less cardinality. However, blue-maximal introduces another way to view the quality of the regions in the covering other than MDL. We cannot determine that the MDL covering is blue-maximal just because it is a minimum description. We show this in the following example. Figure 4.2 (a) and (b) show two different covering for the same dataset. Both coverings are minimum descriptions with cardinality—3. The MDL covering in (a) contains 32 R1,R2, and R3, and that in (b) contains regions R4,R5,R6. Region R2 is not blue maximal because it can be expanded to region R*. Thus the covering in (a) is not blue-maximal, but the one in (b) is. From the above example, we can tell that, in the spatial case, there may be more than one M D L covering, and not all of them are blue-maximal. V O Rl V O R4 >n P R : in _. l_ O I \ T f / 1 "0" id >. 1 tn R*" 1 1 Q 1 i Q1 1 / cn R5l 'O' o •) h LQ "I 1 'Q C N :oi T—I R3 R6 1 2 3 4 5 6 1 2 3 4 5 6 (a) (b) V O R l 1 1 R2 V O <o 1 1 m i :oi " O o , IT) R5 :o: P T f i \ • Q ' o oi T f i — i •Q1 D O l R7 cn 0' a to) cn O i o; i — v :°: C N R3 :o: R4 C N R6 :o: T—1 1 2 3 4 5 6 1 2 3 4 5 6 Cc) Cd) Figure 4.2: Blue-maximal in the Spatial Case Another question should be whether there is only one blue-maximal, non-redundant covering, and if not, whether they have the same cardinality. Let us take a look at the example shown in Figure 4.2 (c) and (d). In (c), there are 4 33 regions in the MDL covering: R1,R2,R3,R4. Every region above can not be enlarged anymore, and is blue-maximal and non-redundant. In (d) there are 3 regions in the MDL covering: R5,R6,R7. It is also blue-maximal. We can see that both coverings are non-redundant and blue-maximal coverings, but their cardinalities are different. Consequently, from this example, we can tell that there maybe more than one blue-maximal covering with different cardinalities in the spatial case. Moreover, finding the optimal blue-maximal MDL covering in the spatial case has been proved to be a NP-hard problem[19], and so follows the GMDL problem. In this context, optimal refers to least cardinality. In addition, the spatial algorithms discussed in the previous chapters are all heuristic algorithms. To be specific, those three top-down algorithms, RTS, CAS and CAS-Corner, use splitting strategies. As an effect, they will easily split the neighboring blue cells, and can hardly make the GMDL covering blue-maximal. BP algorithm uses the first-fit strategy and can not guaranteed blue-maximal either. To sum up, it is difficult to find a blue-maximal non-redundant MDL covering in the spatial case. Blue-maximal in the Hierarchical Case As stated in Section 1.2.2, the GMDL problem in the hierarchical case is quite different from that of the spatial case. In the spatial case, if two blue cells are adjacent to each other, it is possible to cluster them together. Suppose they are the only blue cells, a blue-maximal covering has to cluster them together as one region. However, in the hierarchical case, if two blue cells have different parents. We cannot cluster them together, even if they are adjacent to each other. That is to say, clustering can only happen with siblings. Shown in Figure 4.3, cell Cl and C2 are neighbors. They can be clustered together in (a), but cannot be clustered 34 1 2 3 4 5 1 2 3 4 5 ( a) spatial (b ) hierarchical Figure 4.3: Clustering in the Spatial Case and the Hierarchical case together in (b). The hierarchical case has a more strict structure than the spatial one. 4.1.2 Uniqueness in the Hierarchical Case Based on the above observation, in the hierarchical case, a region R have to be a combination of one node Ki in each dimension. As stated in Section 1.2.2, one region in the hierarchical case can be expressed as a tuple (a; 1,0:2, . . . , £ # ) , in which Xi(i = l ,2, . . . , iV) is one of the node in the dimension i, and iV is the number of dimensions. In other words, R4, the projection of R on the dimension i (i=l,2,...,N), is exactly one node in the hierarchy Tj. That is to say, R4 is composed of all the leaves rooted at node Ki in hierarchy Tj. This can be shown in Figure 4.4. The projection of region Rg on dimension location is northeast, i.e., all the leaves rooted at northeast: New York, Albany, Summit, and Boston. According to the above analysis, we have the following property for regions in the hierarchical case. Property 4.1 Let R and S be two regions (xi,X2, . . . , £ J V ) and (2/1,2/2» •••IVN)- For 35 * a s a a il -w "> «> .1, f 5 I cS i -a -S -a :an jose "san francisco l^^^^-minneapolis ~ .Chicago .bostdn summit1 .albany ,' iew yprk . Q _Q . Q -p! R l p! Pi R2 R3 Pi Pi R4 bi o1 o1 R") 0. D_ ft D . D . D . D . fl' R6 b! R7 'oi bi R8 0 0 o 0' X l (0 o o 0' R9 l o 0 o Oi 1 CL £>_ sy Figure 4.4: Projection in the Hierarchical Case each dimension i(i=l,2,...,N), let Ri and Si be the projection of R and S on it respectively. Ri and Si are the collections of leaves that rooted at node Xi and yi respectively. For all 1 <i < N, either R4 fl Si — 0 or Ri C S%, or Si C R^. This can be understood in the following way. As stated before, R4 is the collections of leaves that rooted at X{, and Si is that of yi. Xi and yi are two nodes in the hierarchy Tj. Let us take a look at the relationship between Xi and yi. If Xi is the ancestor of yi, it is obvious that Si C R4. If Xi is the descendent of yi, it is obvious that R4 C Si. Otherwise, i.e., there is no ancestor-descendent relationship between X{ and yi, let us assume Rif) Si ^ 0. Then there must be some value V{, Vi € Ri and Vi € 5 ; . As there is no ancestor-descendent relationship between Xi and yi, Vi has two different parents under Xi and j/j. This is a contradiction to the hierarchical tree, where each node only have one parent. Thus, we can be sure the 36 Property 4.1 is correct. Let us compare this with the spatial examples in Figure 4.2 (b). The M D L covering in (b) contains 3 regions: R^,Re,Rj. The projections of region R$ and R-j on the vertical dimension are 3,4,5 and 2,3,4 respectively, which overlap partially with each other. That is to say, regions of the optimal M D L covering in the spatial case can have overlapping projections in a dimension. This makes things complicated in the spatial case. The above comparison shows us that the strict structure of the hierarchy makes the problem not as complicated as the spatial case. The following theorem tells that there exists a unique, blue-maximal and non-redundant M D L covering in the hierarchical case. The proof in Appendix A is taken from Lakshmanan et al.'s workfll]. Theorem 4.1 [11] Consider N categorical dimensions, each of which is organized in a tree hierarchy Tj(l < i < N). There is a unique non-redundant blue-maximal MDL covering. 4.1.3 MDL-Tree Algorithm Based on Theorem 4.1, we propose algorithm MDL-Tree, a per-hierarchy approach, to find the unique, non-redundant, and blue-maximal M D L covering. First, we find the blue-maximal region in the following way. Each node d in hierarchy Tj contains a list list[d], which is used to contain all regions that are associated with node d. Re-gion R in d is represented as a tuple (xi,x^, Z j - i , d, £ j+i , ...,x^). R can be "gener-ated" if every child node dj of node d contains the tuple (xi, X2,Xi-\,dj, Xi+\, ...,x^). This is the key step to generate the blue-maximal M D L covering. In the algorithm MDL-Tree, we traverse all hierarchies and generate all possible blue regions. 37 However, while searching for all possible blue-maximal blue regions, we gen-erate many redundant regions. As we want to find out all blue-maximal regions, we have to keep those redundant regions until we find all of the blue-maximal regions. We discuss this in Example 4.1 in Section 4.1.4. As an effect, in the last hierarchy, there are two kind of redundancy for region R in Ustfd], The first is that R is covered by its ancestors in the last hierarchy. In other words, by some region existing in the list of d's ancestors in the last hierarchy. The second is that R is covered by another region within the same list list[d]. That is to say, such redundancy is generated in the previous hierarchies. After eliminating the above two kind of redundancy, we get the unique, blue-maximal, non-redundant M D L covering. The algorithm MDL-Tree is stated as follows. Algorithm MDL-Tree /*Phase 1, generating blue-maximal regions*/ 1. Initialize C with all the blue cells of the form c\, c<i,cjv, initialize all hier-archy Tj (1 < % < N), with an empty list for each node. 2. For 1 < i < N, repeat the following steps: (a) for each item (xi,X2, ...,X{-i,d,Xi+i, ...,XN) in C, add it into list[d] inTj . (b) Repeat the following steps in a post-order traversal of the nodes in T i . if every child dj of d contains (xi, X2, a J j - i , dj, X j + i , X N ) , we insert (xi,X2, ~;Xi-i,d,Xi+i, ...,XN) into list[d] and C /*Phase 2, redundancy check-get non-redundant blue-maximal M D L cover-ing.*/ 3. In a post-order traversal of the node d in T/v, delete region R from Ustfd], 38 if region (a;i, X2,JCi-i, e, Xi+i,XAT) is contained in listfej. Node e is the parent of node d. 4. In a pre-order traversal of the node d in TN, regions R is deleted if R is a descendent of another region S in listfdj. This is determined on hierarchies Ti ,T2 , . . . ,T /v - i -From the above analysis, we can see that in MDL-Tree algorithm, we traverse all the hierarchies once in the propagating phase and once in the redundancy check phase. Thus the MDL-Tree finding algorithms can be done in time linear to the size of the hierarchies. Data Structures Notice that there are many searching operations in the algorithm MDL-Tree, e.g., in step 2(a) we need to search in listfdj] for region (xi,x^, ...,Xi-\,dj,Xj+\, . . . , X N ) , and similarly in both step 3 and 4. When there are many blue cells, the list in each node can be very long, which results in much time for searching. Thus, we implement hash table to accelerate the searching. For all nodes in all hierarchies, we build the same number of buckets, each of which contains a list listfh]. We design a hash function f(x\,X2, . . . ,X j_ i , a ; j+ i , ...,XN) to make sure that the children have the same value as their parent in one hierarchy. For example, if (l,g) is in bucket b in node g, (l,a),(l,b), and (l,c), if exist, are in bucket b of node a,b, and c respectively. This greatly improves the time performance. 4.1.4 Case Study of MDL-Tree Algorithm In order to illustrate the algorithm of MDL-Tree, we use the following example to illustrate how MDL-Tree works. 39 Example 4.1 Figure 4-5 shows us a 2-dimensional categorical dataspace with tree hierarchy Ti(i=l,2) in dimension i. Each node in dimension 1 is represented by an integer, and each node in dimension 2 is represented by an alphabet. So a region can be represented by a tuple (x,y), e.g., (l,a), (5,e). ts c ' ( / : C -I '•3 60 : o c3 o o o o o o o o o X o o o o o o o o o o o o X Figure 4.5: Example of MDL-Tree Algorithm Phase 1: generating blue-maximal regions First, as we see in Figure 4.6 (a), we first initialize tree hierarchy in dimension 1. Notice that in Figure 4.6, we simplify region (x,y) as xy by omitting the bracket and comma. Each round rectangle, together with the blue regions inside it, represents the list of a node. Al l the 21 blue cells are in the leaf nodes of T\. Initially, the list of node 7,8, and 9 are empty. During the post-order traverse of T\, regions (7,a) is generated because (l,a) and (2,a) are in listfl] and list[2] respectively. As a result, (7,a) can be generated and inserted into list[7]. Similarly, (7,b),(7,c),(7,e) are inserted into list[7], 40 ( a) propergate in dimension 1 (b ) propergate in dimension 2 Figure 4.6: MDL-Tree: Propagate in Al l Dimensions and (8,e) is inserted into list[8]. As (7,e) and (8,e) exist respectively on node 7 and 8, the only child nodes of node 9, regions (9,e) is generated and inserted into list[9]. Up to now, the covering C gets 6 new blue regions: (7,a),(7,b),(7,c),(7,e),(8,e),(9,e), and the total number of regions reaches 27. After we finish propagation in T i , we initialize all the leaf nodes of tree hierarchy T2 with all of the blue regions in covering C, as shown in Figure 4.6(b). We repeat the same propagation procedure as in T\, and get 4 more blue regions: (l,g),(2,g),(7,g) in listfg] and (4,h) in listfh]. By far, covering C contains all the possible blue regions. Phase 2: redundancy check We show the process of redundancy check in Figure 4.7. The data in Figure 4.7(a) initially contains all the possible blue regions. According to step 3, we first delete the redundant regions with the ancestor-descendent relationship. For example, region 41 (a) redundancy check with ancestor-decendant (b ) redundancy check within the same list Figure 4.7: MDL-Tree: Redundancy Check (l,g) in node g is generated from regions (l,a),(l,b),and (l,c). That is to say, (l,g) can cover the above 3 regions. Thus (l,a), (l,b), and (l,c) are struke out from their list and covering C. In Figure 4.7(a), the regions being crossed out are all those that are removed from C. After we finish step 3, we need to strike out all the redundant regions within one list. Take node g as an example, listfg] contains 3 regions: (l,g), (2,g), (7,g). According to the tree hierarchy T\. (l,g) and (2,g) are covered by (7,g), and shall be removed from list[g] and covering C. Finally, the remaining regions in covering C are (7,g),(4,h),(4,a),(5,a),(5,b),(6,b),(6,d),(9,e),(2,f),(3,f). The M D L covering is shown on Figure 4.8. In the MDL-Tree algorithm, while propagating in each dimension, there are many regions can be considered "redundant". However, we only apply the redun-dancy check after completing propagation in all dimensions. The reason is that we want to find all blue-maximal regions. In order to do this, we have to keep all the 42 7 A 1 2 8 3 4 5 6 p] •p o o !o. o o; X T 1 •0 1 i !o o; .Q 6' 'P_ i _9< 0 Q X Figure 4.8: MDL-Tree: M D L Covering possible blue regions during propagation. Let us take a look at Figure 4.6(a). After generating region (8,e), regions (3,e),(4,e),(5,e),(6,e) is covered by (8,e), and can be considered "redundant". If we delete them from covering C immediately, we cannot get region (3,e), (4,e), (5,e), (6,e) in Figure 4.6(b). Consequently, we cannot get region (4,h) when propagating in hierarchy T<i- That is to say, covering C is not blue-maximal any more. Therefore, we choose to do the redundancy check after generating all possible blue-maximal regions. 4.2 G M D L Problem in the Hierarchical Case Before we study the G M D L problem in the hierarchical case, it is beneficial to think about the essential idea of G M D L . The reason we propose Generalized M D L is due to our tolerance of "impurity" with an improvement on compression, i.e., we can get shorter descriptions. In other words, our sacrifice in accuracy can lead to higher 43 compression. This principle provides us a guide to think of the G M D L problem in the hierarchical case. Let us take a look at Example 4.1 in Section 4.1.4. In Figure 4.5, blue cells (5,a) and (5,b) can not be propagated to (5,g), because (5,c) is a white cell, not blue cell. If we turn (5,c) from white to blue, (5,a) and (5,b) can propagate to (5,g). Thus, we just need (5,g) in stead of (5,a) and (5,b) in the covering C. In general, the reason that a region can not be propagated to its parents lies in that it lacks of siblings. This provides us an idea of G M D L region finding in the hierarchical case-turn white siblings into blue when necessary. However, we have to control the degree of inaccuracy by the whitebudget w. That is to say, we have to choose the white cells that needs to be included. In order to do this, we introduce the following concepts: candidate, occurrence, maximal gain, and cost. A candidate is a region R, represented as (xi,Xi-i,d, X j + i , x ^ ) , that potentially can be generated as a blue region. Suppose node d has L child nodes, and regions R\,...,Rk are the k regions that can be used to generate R in the L child nodes of d in hierarchy T j . Each candidate may contain a set of white cells. Thus, our mission is to choose white cells according to which candidates we want to change to blue. The following three concepts, occurrence, maximal gain, and cost provide us a way to choose the candidates. occurrence is the number of existing blue regions(cells) that can be used to generate R in the child nodes of d, i.e., k. The maximal gain is the maximal reduce in the number of regions if we use R instead of its children Ri,...,Rk. maximal-gain equals to occurrence-1, if it is greater than zero. Otherwise, it is set to zero, cost is the number of white cells that needs to be included, in order to get region R. That is to say, cost is the number of cells covered by the set of regions -Rjt+i, . . . , R L -44 So far, we have not talked about red cells. If any one of the region in Rk+i, -Rfc+2; •••) R L contains a red cell, R can not be generated as blue. In this case, the cost is set to "X", which means this candidate needs to be struke out. Theoretically, any region R (x\, X2,• x^) can be a candidate, if at least one coordinate X{ in x\,X2, ...,XN is an internal node of Tj and there is no red cell inside. However, the tree hierarchies also exist among the candidates. If one candidate is chosen, it means all of its child candidates are chosen as well. Thus, it does not make sense to sort all candidates at the same time. To simplify, in this thesis, we confine the candidates to regions satisfying the following conditions: there are exactly N-l coordinates,say (x2, £ 3 , ...,XN) of (x\, X2, • x^), are in leaf nodes, and the remaining one, say x\, is a parent of leaves in the remaining hierarchy T\. As an effect, we can choose candidates in a per-hierarchy way. In this thesis, we apply a heuristic approach to sort candidates- candidates with higher maximal-gain and lower cost have a higher priority to be included. First, we sort all the candidates with maximal gain descendently. For candidates having the same maximal gain, we sort them with increasing cost. We build a white list WL to store all the included white cells. Then, we traverse candidates from the beginning to the end of the candidate list. When a candidate is chosen, we add the corresponding white cells into WL, if they do not exist there. We keep traversing until we run out of white budget w. Let us use Figure 4.5 as an example. Table 4.1 shows a list of regions together with occurrence, maximal gain, and cost. Take (5,g) as an example, there are two blue cells, (5,a) and (5,b), under (5,g). So the occurrence of (5,g) is 2, and the maximal gain is 1. We need to include one white cell (5,c) to propagate to (5,g), which means the cost is 1. For (l,g) and (2,g) the cost is 0 means that (l,g) is 45 already a blue region, and we do not need to include any white cell. For (3,g), the occurrence is 0 and maximal gain is 0, which means (3,g) is a pure white region. As (6,g) covers a red cell (6,a), we mark its cost as "X", and do not need to sort it. Thus, after striking out pure blue candidates and the candidate covering at least one red cell, we sort the remaining candidates decendently on their maximal gain, and then on increasing cost. In this example, we build Table 4.2 for all the candidates. The ranking can be (5,g), (2,h), (6,h), (8,b), etc. With the above ranking, we can find the white cells to be included. Then we input the chosen white cells together with original blue cells into MDL-Tree algorithm. Here (5,c) is put into MDL-Tree algorithm due to candidate (5,g). Thus, in covering C, we can use (5,g) instead of (5,a) and (5,b) separately. The algorithm GMDL-Tree is stated as follows. candidate (l,g) (2,g) (3,g) (4,g) (5,g) (6,g) occurrence 3 3 0 1 2 1 maximal gain 2 2 -1 0 1 0 cost 0 0 3 2 1 X Table 4.1: A n Example of occurrence, maximal gain, and cost Algorithm GMDL-Tree 1. Initialize candidates list CL, white cells list WL as empty. Each item in CL is associated with four fields: candidate, occurrence, maximal gain, and cost. 2. For each tree hierarchy Tj, do the following: (a) Initialize the leaves with all blue cells. (b) For each leaves d, suppose node e is cTs parent. For each blue cell b:(x\, ...,Xi-\,d, Xj+i, ...,XN) in d, generate 6's parent c:(x\, ...,Xi-\,e, i j + i , 46 candidate (l,g) (2,g) (3,g) (4,g) (5,g) (6,g) occurrence 3 3 0 1 2 1 maximal gain 2 2 -1 0 1 0 cost 0 0 3 2 1 X candidate (l,h) (2,h) (3,h) (4,h) (5,h) (6,h) occurrence 1 2 2 3 1 2 maximal gain 0 1 1 2 0 1 cost 2 1 X 0 2 1 candidate (7,a) (7,b) (7,c) (7,d) (7,e) (7,f) occurrence 2 2 2 0 2 1 maximal gain 1 1 1 -1 1 0 cost 0 0 0 2 0 1 candidate (8,a) (8,b) (8,c) (8,d) (8,e) (8,f) occurrence 2 2 0 2 4 2 maximal gain 1 1 -1 1 3 1 cost X 2 4 2 0 2 Table 4.2: occurrence, maximal gain, and cost for Al l Candidates in e. If c is in CL increase its occurrence by one. Otherwise, add c into CL with occurance=l. (c) Finally, for each candidate c in parent e, set its maximal gain to the value of occurrence-1, cost to the number of children of e minus occurrence. (d) If a candidate is the parent of a red cell or its cost is negative, remove it from CL. 3. Sort candidates in CL with a descendent maximal gain. For candidates with the same maximal gain, sort them with increasing cost 4. For each candidate in the list CL, add the corresponding white cells to WL if they do not exist there. 5. Run MDL-Tree algorithm with the original blue cells and first w white cells 47 in WX(treat them as blue). 4.3 Experiments 4.3.1 M D L vs. G M D L To evaluate the effectiveness of the MDL-Tree and GMDL-Tree algorithms, we de-veloped a hierarchy generator to generate tree hierarchies with different fanouts to mimic the hierarchies in data warehousing benchmarks. For example, the tree hier-archy with fanout 1x4x4x12 means that the root node has 4 children, each of which have 4 children, and each parent of leaf nodes has 12 children. Figure 4.9 shows us the experimental result of 3-dimensional cases. In Figure 4.9, we generate a 1x4x4x6x6 hierarchy in each dimension, and randomly pick 1 million blue cells from a cluster of 110 leaves in each dimension. This is approximate 75% in blue density. In addition, we randomly pick some red cells with in the same cluster. The x-axis is called budget ratio which is the white budget divided by the number of blue cells. The Y-axis is the compression gain, which is the number of blue cells divided by the number of G M D L regions found. Figure 4.9 shows how the compression gain varies when the white budget changes from 0% to 60%. The results with budget ratio equals 0% is actually equivalent to MDL-Tree. From this figure, it is clear that the compression gain increases as budget ratio, i.e., white budget, increases. When white budget equals 0, the compression gain is only around 2. The reason lies in that, according to the M D L principle, if any child"of a region is not blue, this region can not be blue. In this experiment, the fanout of the parent of leaves is 6, i.e., all of the 6 leaves of a parent have to be blue 48 120r Budget Ratio Figure 4.9: Compression Gain vs. White Budget in order to propagate to their parent. As we pick the blue cells in a random way, it is not quite easy to propagate. However, in G M D L , when we include some white cells according to the maximal gain and cost, more propagation can take place. For example, with budget ratio equals 60% and no red cell, the compression gain can be more than 100, i.e., as much as 50 order of magnitude of M D L . This clearly shows that G M D L approach get a much shorter description than M D L . In addition, we randomly pick some red cells from the same cluster to that of the blue cells. One red cell prohibitates all of its ancestors in each dimension from being generated. Figure 4.9 clearly shows that the compression gain drops when we increase the number of red cells. 49 4.3.2 Varying Dimensionality In Table 4.3, we show the compression gain and the time used for different dimen-sions. In this experiment, we set the blue cells to 1 million, red cells to 1000, and the budget ratio to 100%. Prom our experiments, we can see that G M D L have a higher compression gain than M D L even in the higher dimensions. Even in 5d, it is more than an order of magnitude. The reason of low compression gain in 5d is that in order to compare the results in different dimensions, we keep the number of blue cells constant to 1 million and blue density to 75%. Thus, there are only around 17 leaves for the blue cluster in each hierarchy, and there are not many chances for the blue cells to propagate to their parents. 3-d 4-d 5-d compression gain 39.2 26.4 5.9 run time (sees.) 159 514 1573 Table 4.3: G M D L with Different Dimensions 50 Chapter 5 Conclusions and Future Work 5.1 Summary In this thesis, we have studied the data summarization problems both in the spatial case and in the hierarchical case. We discussed the MDL principle and presented how to apply it to data summarization. In addition, we extended the MDL principle to the GMDL principle by including some tolerable impurities. The GMDL approach is effective in providing a shorter description in both the spatial and the hierarchical cases. We first studied three current GMDL algorithms, BP, RTS and CAS, for the spatial case, compared the performance of CAS in 2-dimensional and 3-dimensional cases, and analyzed the reason of its drops in compression gain from 2-dimensional to 3-dimensional cases. As a result, we proposed a new algorithm CAS-Corner, which has a better performance in both time and compression gain over CAS. We also studied the nature of MDL and GMDL problems in the hierar-chical case, and found that they are quite different from the spatial case. After 51 that, we gave a theorem of the existence and uniqueness of the blue-maximal, non-redundant M D L covering. Moreover, we proposed the algorithm MDL-Tree to find this covering. Algorithm MDL-Tree visits each node at most twice. With MDL-Tree algorithm, we can find M D L covering in polynomial time in the size of hierarchies. In addition, we proposed algorithm GMDL-Tree, a heuristic approach, to find the G M D L covering in the hierarchical case. It uses the white budget based on the contribution from each candidate. The contribution was determined by its maximal gain and cost. The experimental results showed that the GMDL-Tree algorithm can result in a higher compression gain, i.e., a shorter description, than the MDL-Tree algorithm. 5.2 Future Work 5.2.1 Parallelism In practice, applications often involve large datasets and high dimensions, e.g., O L A P . In such cases, MDL-Tree and GMDL-Tree algorithms need large memory and take a long time to complete. One possible direction of the future work is to apply parallelism to han-dle these problems. A simple idea for parallelism is a per-hierarchy style. Each computer contains one hierarchy out of the N hierarchies. We first distribute the dataset to each computer, and apply MDL-Tree algorithm on every hierarchy si-multaneously. Then, we move the newly generated blue regions in hierarchy Tj to hierarchy Tj+i (i=l,2,...,N-l), and do propagation again. We repeat this step until i=N-l. Consequently, we shrink all the possible blue regions in hierarchy T/v- That is the same result as in Step 2 of algorithm MDL-Tree. Furthermore, we can split 52 one hierarchy into several parts and split the dataset correspondingly. Then, we dis-tribute each part of hierarchy and corresponding dataset into different computers. This will accelerate the overall procedure and reduce the usage of memory in one computer. For example, for a 2-dimensional dataset, there are two hierarchies T\ and T2 in each dimension respectively and 1000 blue cells initially. Let us assume that after propagation in hierarchy T\, we get 400 new blue regions. Let T\(numBlue) denote the time used on the propagation of numBlue blue cells in hierarchy T\. As an effect, in this example, the time spent on Step 2 of MDL-Tree algorithm is T i (1000) + T2(1400). If we do it parallel with T x and T 2 in two different computers, the total time used is only max{Ti(1000),T2(1000)) +T2(400). This is less than that of the original MDL-Tree algorithm. Furthermore, in parallel, the computer only need to process at most 1000 cells/regions at a time, which is less than the 1400 cells/regions in the MDL-Tree algorithm. In result, it needs less memory. Those benefits are more significant in higher dimensions. 5.2.2 Spatial and Hierarchical In this thesis, we only studied the pure spatial and pure hierarchical cases. In reality, however, there exist such cases that some dimensions of a dataset are numerical while the others are categorical, which are organized by hierarchies. For example, in a dataset there are 3 dimensions, age, salary, and location. The first 2 dimensions, age and salary, are numerical and the third dimension, location, is categorical with hierarchy country-province-city. We call such condition hybrid cases. The problems of summarizing data in such cases are non-trivial and deserved for further research. One simple idea to solve this problem is to use the algorithms discussed in this 53 thesis, and combine them together to solve the problems in the hybrid case. 54 Bibliography [1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD, pages 94-105, 1998. [2] Kevin Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In ACM SIGMOD Conference, pages 359-370, 1999. [3] E . F . Code. Providing O L A P (on-line analytical processing) to user- analysts: A n IT mandate, http://www.hyperion.com/. [4] O L A P Council. O L A P Council White Paper, http://www.olapcouncil.org. [5] Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In VLDB, pages 299-310, 1998. [6] Yvan J . Garcia, Mario A. Lopez, and Scott Leutenegger. On optimal node splitting for r-trees. In VLDB, pages 334-344, 1998. [7] Haisong Gu, Yoshiaki Shirai, and Minoru Asada. Mdl-based segmentation and motion modeling in a long image sequence of scene with multiple independently moving objects. In IEEE Transactions on Pattern Analysis and Machine Learn-ing, volume 18(1), pages 58-64, 1996. [8] Mark H. Hansen and Bin Yu. Model selection and the principle of minimum description length. In Journal of American Statistical Association, volume 96, pages 746-774, 2001. [9] W . H . Inmon. Building the Data Warehouse. John Wiley, 1992. [10] J.Rissanen. Modeling by shortest data description. In Automatica, volume 14, pages 465-471, 1978. 55 [11] Laks V.S. Lakshmanan, Raymond T. Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore J. Johnson. The generalized mdl approach for summa-rization. In VLDB, Hong Kong,China, 2002. [12] M. Li and P. Vitanyi. Ideal mdl and its relation to bayesianism. In Proceedings of Information, Statistic and Induction in Science World Scientific, pages 282-291, Singapore, 1996. [13] Manish Mehta, Jorma Rissanen, and Rakesh Agrawal. Mdl-based decision tree pruning. In KDD, pages 216-221, 1995. [14] Raymond T. Ng, Alan Wagner, and Yu Yin. Iceberg-cube computation with pc clusters. In ACM SIGMOD, pages 25-36, 2001. [15] Raymond T. Ng and Christine Xing Wang. Summerization using the gener-alized minimum description length. In Technical Report, Computer Science Department, Univ. of British Columbia, 2000. [16] J.S. Park, M.S. Chen, and P.S. Yu. An effective hash based algorithm for mining association rules. In ACM SIGMOD, pages 175-186, May 1995. [17] J. Ross Quinlan and Ronald L. Rivest. Inferrning decision trees using the min-imum description lenght principle. In Information and Computation, volume 80(3), pages 227-248, 1989. [18] Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. McGraw Hill, 2003. [19] R. A. Reckhow and J. Culberson. Covering simple orthogonal polygon with a minimum number of orthogonally convex polygons. In ACM Annual Compu-tational Geometry Conference, pages 268-277, 1987. [20] Eric S. Ristad and Robert G. Thomas. Context Models in the MDL Framework. In Data Compression Conference, pages 62-71, 1995. 56 Appendix A Proof of Theorem 4.1 Theorem 4.1 Consider k categorical dimensions, each of which is organized in a tree hier-archy T{ (1 < i < k). There is a unique non-redundant blue-maximal MDL covering. Proof: Let us consider the case in detail for the 2-dimensional case, i.e., k = 2. Suppose there are two distinct non-redundant coverings C\ and C 2 . Then there must exist a blue cell (c i ,c 2 ) such that it is covered differently in the two coverings. That is, the blue cell is covered by R = (x\,a:2) in C\ but not in C 2 , and is covered by S = (2/1,2/2) hi C 2 but not in C 2 . Because of Property 4.1, it must be the case that either R\ contains 5i , or vice versa. (They cannot be disjoint because of the blue cell (ci,c2).) This is equivalent to saying that x\ is an ancestor of y\, or vice versa. Without loss of generality, let say that x\ is an ancestor or y\. Similarly, considering the other dimension, either i? 2 contains S 2 or vice versa. Now that x\ is an ancestor of y\, it must be the case here that y 2 is an ancestor of a;2-(Otherwise, R would be a strict superset of S, and C 2 cannot be blue-maximal, which is a contradiction.) Figure A . l depicts the situation, with the rectangular area showing 57 Areas I. II. Ill mustbe all blue Figure A . l : Uniqueness of Optimal Covering the corresponding blue cells. Note that because (#1,0:2) is a blue region, all cells in area I and II must be blue. Similarly, because (2/1,2/2) is a blue region, all cells in area I and III must be blue. Then there are two cases. Case 1: not all cells in area IV are blue. In this case, there must be non-redundant regions in C2 covering the blue cells in area II. Thus, C2 is not blue-maximal, because it misses the blue-maximal region consisting of area I and II combined. This is a contradiction. Case 2: all cells in area IV are blue. If it is correct for the blue-maximal region consisting of area I and II combined not to appear in C2, it must be the case that it is redundant. This is the situation when all cells in area IV are blue, in which case area I and III form one blue-maximal region, and area II and IV form another blue-maximal region in C2. But if this 58 is the case, the region (£1,2/2) is a blue region, and then both C\ and C2 are not blue-maximal, which is a contradiction. The above argument can be easily generalized. Specifically, given the blue cell (ci , . . . ,Cfe), and the two distinct regions (xi,...,Xk) £ C\ and (yi,.--,yk) £ C2, the k dimensions can be divided into three subgroups: (i) those Vs (1 < i < k) such that Xi — yf, (ii) those z's such that X{ is an ancestor of yi\ and (iii) those such that yi is an ancestor of Because the latter two subgroups must be non-empty, we can essentially use the same argument as above to derive a contradiction. 59
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Generalized MDL approach for data summarization
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Generalized MDL approach for data summarization Zhou, Xiaodong 2002
pdf
Page Metadata
Item Metadata
Title | Generalized MDL approach for data summarization |
Creator |
Zhou, Xiaodong |
Date Issued | 2002 |
Description | There are many applications that identify some data of interest. Usually, such applications just return a set of records that satisfy the criteria applied. However, such results cannot provide enough information for the user. A concise description is more preferable than the individual data. Minimum Description Length (MDL) is a well-known approach to handle such problems. In this thesis, we extend the MDL principle to the Generalized MDL (GMDL) principle by including some "do not care" data. We apply the MDL and GMDL principles to solve the problem of data summarization both in the spatial case and in the hierarchical case. For the spatial case, we improve one current top-down algorithm for high-dimensional data. We also study the GMDL problem for the hierarchical case and find that there exists a unique, non-redundant, and bluemaximal MDL covering. We propose MDL-Tree and GMDL-Tree algorithms to find MDL covering and GMDL covering respectively in the hierarchical case. The experimental results show that GMDL coverings have a much shorter description than MDL covering in the hierarchical case. |
Extent | 2569767 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-10-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051405 |
URI | http://hdl.handle.net/2429/13859 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2003-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2003-0047.pdf [ 2.45MB ]
- Metadata
- JSON: 831-1.0051405.json
- JSON-LD: 831-1.0051405-ld.json
- RDF/XML (Pretty): 831-1.0051405-rdf.xml
- RDF/JSON: 831-1.0051405-rdf.json
- Turtle: 831-1.0051405-turtle.txt
- N-Triples: 831-1.0051405-rdf-ntriples.txt
- Original Record: 831-1.0051405-source.json
- Full Text
- 831-1.0051405-fulltext.txt
- Citation
- 831-1.0051405.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051405/manifest