Semantic Compression and Pattern Extraction with Fascicles By Jason Chun-Sing Madar B .Sc , University of British Columbia, 1996 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF T H E REQUIREMENTS FOR THE DEGREE OF M A S T E R OF SCIENCE in T H E F A C U L T Y OF G R A D U A T E STUDIES D E P A R T M E N T OF COMPUTER SCIENCE We accept this thesis as conforming to the required standard T H E UNIVERSITY OF BRITISH C O L U M B I A April, 1999 ©Jason Chun-Sing Madar, 1999 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. The University of British Columbia Vancouver, Canada DE-6 (2/88) Abstract Often many records in a database share similar values for several attributes. If one is able to identify and group these records together that share similar values for some — even if not all — attributes, not only does one have the possibility of a more parsimonious representation of the data, but one may also gain useful insight into the data from an analysis and mining perspective. In this thesis, we introduce the notion of fascicles. A fascicle F(k,t) is a subset of records that have k compact attributes. An attribute A of a collection F of records is compact if the width of the range of A-values (for numeric attributes) or the number of distinct A-values (for categorical attributes) of all the records in F does not exceed t. We introduce and study two problems related to fascicles. First, we consider how to find fascicles such that the total storage of the relation is minimized. Second, we study how best to extract fascicles whose sizes exceed a given minimum threshold and that represent patterns of maximal quality, where quality is measured by the pair (k,t). We develop algorithms to attack both of the above problems. We show that these two problems are very hard to solve optimally. But we demonstrate empirically that good solutions can be obtained using our algorithms. i i Table of content ABSTRACT II TABLE OF CONTENT Ill LIST OF TABLES V LIST OF FIGURES VI ACKNOWLEDGEMENT VII CHAPTER 1 INTRODUCTION 1 1.1 DATA COMPRESSION 1 1.2 DATA MINING 1 1.3 PROBLEM DEFINITION 1 1.4 CONTRIBUTION 4 1.5 THESIS OUTLINE 5 CHAPTER 2 RELATED WORK 5 2.1 GENERAL DATA COMPRESSION ALGORITHMS 5 2.2 DATA CLUSTERING ALGORITHMS 6 2.3 SUBSPACE CLUSTERING ALGORITHM 7 2.4 ASSOCIATION RULE DATA MINING ALGORITHMS 7 2.5 DIMENSIONALITY REDUCTION ALGORITHMS 8 CHAPTER 3 MODEL AND ASSUMPTIONS 9 CHAPTER 4 FINDING CANDIDATE K-D FASCICLES 10 4.1 NAIVE, LEVEL-WISE APRIORI ALGORITHM 10 4.2 T H E SINGLE-K ALGORITHM 11 4.2.1 The Basic Approach: a Lattice-based Conceptualization 11 4.2.2 Choosing Good Initial Tip Sets 12 4.2.3 Growing a Tip Set to a Maximal Set 13 4.3 T H E MULTI-K ALGORITHM 15 CHAPTER 5 GREEDY SELECTION FOR STORAGE MINIMIZATION 16 5.1 GREEDY SELECTION WITH BATCH WEIGHT RE-ADJUSTMENT 16 5.2 WEIGHT RE-ADJUSTMENT FOR THE MULTI-K ALGORITHM 17 CHAPTER 6 EXPERIMENTAL EVALUATION FOR THE STORAGE MINIMIZATION PROBLEM 19 6.1 DATASETS USED AND EXPERIMENTAL SETUP 19 6.2 INEFFICIENCY OF THE APRIORI ALGORITHM 20 6.3 T H E EFFECT OF P, THE NUMBER OF TIP SETS 23 6.4 T H E EFFECT OF K, THE (MINIMUM) DIMENSIONALITY OF FASCICLES 24 i i i 6.5 T H E EFFECT OF T , THE COMPACTNESS TOLERANCE 27 6.6 T H E EFFECT OF M , THE MINIMUM FASCICLE SIZE 27 6.7 SCALABILITY WITH RESPECT TO DATA SET SIZE 28 6.8 FURTHER COMPRESSION 29 C H A P T E R 7 S O L V I N G T H E P A T T E R N E X T R A C T I O N P R O B L E M 30 7.1 T H E PATTERN EXTRACTION PROBLEM WITH FASCICLES 30 7.2 T H E MULTI-EXTRACT ALGORITHM 32 7.2.1 Computing Fascicles for all (k,t) Pairs 32 7.2.2 Maximizing Fas(R) 33 7.3 EMPIRICAL RESULTS 35 C H A P T E R 8 C O N C L U S I O N S A N D F U T U R E W O R K 37 R E F E R E N C E S 38 A P P E N D I X - R E S U L T S F R O M T H E N H L D A T A S E T 40 iv List of Tables Table 1: A Fragment of the N H L Players' Statistics Table 2 Table 2: Character frequency table 6 Table 3: Itemset dimensionality path 15 Table 4: Inappropriateness of the Apriori Algorithm for Storage Minimization 20 Table 5: The Effect of t, the Compactness Tolerance , 27 Table 6: The Effect of m, the Minimum Fascicle Size 28 Table 7: Effect of addition compression scheme on fascicles 29 Table 8: Descriptions of Sample Patterns in the N H L Data Set. Each column is a fascicle with the attribute ranges given. A l l attributes are compact except the ones in parentheses 35 Table 9: Effect of t on storage and coverage for the N H L dataset 42 Table 10: Effect of m on storage for the N H L dataset 42 v List of figures Figure 1: Example of data clusters 6 Figure 2: Example of CLIQUE 7 Figure 3: A Skeleton of the Single-k Algorithm 14 Figure 4: The Effect of P... 22 Figure 5: Optimal k for Storage Minimization 25 Figure 6: Runtime is independent of k 26 Figure 7: The compactness tolerance's effect on coverage 27 Figure 8: Scalability with respect to Data Set Size 29 Figure 9: An Example: Pattern Extraction with Fascicles for the N H L data 31 Figure 10: A Skeleton of the Multi-Extract Algorithm 33 Figure 11: Effect of P on storage for the N H L dataset 40 Figure 12: Effect of P on runtime for the N H L dataset 41 Figure 13: Effect of k on storage and coverage for the N H L dataset 41 vi Acknowledgement There are just too many people for me to thank. First of all, I'd like to thank my thesis supervisor, Dr. Raymond Ng, for making this thesis a reality. I ' l l always be grateful for his guidance and patience throughout my research and thesis writing. Without Raymond, this thesis would not be possible. I would also like to thank my second reader, Dr. Gail Murphy, for her insightful comments and quick response. Gail made it possible for me to graduate on schedule and I ' l l always be grateful. I am forever in debt to my mom and dad, Eleanor and Nazir. Words cannot describe the unconditional love that they have shown me. They are always supportive, caring, patience, and loving. If I were to express my gratitude towards them on paper, I would find myself writing something in excess of the length of this thesis. I would, therefore, refrain from writing such a document until a later time. For now, I just want to tell my parents that I love them both very much. To my little brother Alvin, the times that we spent jamming were simply awesome! Thanks for being there for me through the good and the bad times and helping me see things more clearly. To all my friends (there are just too many to name, I am sure you know who you are). I have always thought that I have the most wonderful friends in the world. You guys and girls are the best. Thanks for sticking by me all these years. I will always treasure the friendship. Special thanks to Dr. George Tsiknis, for the wonderful conversations and pointers; and to Michael McAllister, for being the most wonderful office mate and showing me "the ropes" around the department. Most of all, I would like to thank my personal Lord and Savior, Jesus Christ. Thank you for walking with me through all my troubles and keeping me safe. Without Him, I would never have the strength nor the ability to finish this thesis. vii Chapter 1 Introduction 1.1 Data Compression Most computer users are familiar with the idea of data compression. Very often, when we try to receive files from the Internet, we would ask the other party to "zip" the file before sending to reducing the total transfer time. As the Internet becomes more popular and people are trying to do many things over the Internet, the speed in which we transfer data becomes increasingly important. As an example, people are now trying to broadcast live audio over the Internet. C D quality audio takes about 10Mb per minute of storage space (or around 170k per second). A typical ISDN connection gives us only around 18k to 20k per second and is therefore incapable of receiving live audio. A recent solution to this live audio over the Internet problem is a compression format call MP3 (MPEG Layer 3). The MP3 format is able to decrease the size of an audio stream by a factor of 10 with minimal loss of sound quality. This format makes it suitable for quality audio over the slow connections. MP3 is a form of "lossy" compression. In many lossy compression algorithms, the user can choose to sacrifice quality for better storage. 1.2 Data Mining Data mining, also known as "knowledge discovery in databases" is a relatively new area of database research. The goal of data mining is to extract useful patterns from existing data. The supermarket customer transaction database is popular example of pattern extraction. Every time one goes to the check out counter in a supermarket, the supermarket inserts a record into a database recording the items that are bought. Such database is very useful in keeping track of inventory. However, as this transaction database increases in size, we may be able to extract patterns from this database such as "90% of the transactions that involves chips also involves salsa". Patterns like this may be very useful to the supermarket. At first glance, data compression and data mining may not appear to be related. We introduce the notion of fascicles1 in this thesis. Using fascicles, we can perform both lossy compression and pattern extraction from large datasets. 1.3 Problem definition Table 1 shows part of a table of National Hockey League (NHL) players' statistics in 1996. For each player, his record describes the position he played, the number of points2 he scored, the 1 According to Webster, a fascicle is 1. a small bundle; 2. one of the installments of a book published in parts. 2 If a player scored or assisted in a scoring play, the player was awarded one point 1 number of minutes he was on the ice and was in the penalty box. While there are players of almost every possible combination, there are numerous subsets of records having very similar values for many attributes. In other words, records in one of these subsets vary considerably in only the other attributes. For example, (i) Cullimore and Wotton belong to a (large) subset of defensemen who played, scored and penalized sparingly; (ii) Borque, Gretzky and Tkachuk belong to another group who played and scored a lot; and (iii) Peca and Tkachuk are the kind of centres who were often penalized. It is easy to see that identification of such subsets can be the basis for good compression schemes, and also valuable in its own right for data mining purposes. Name Position Points Played Minutes Penalty Minutes Blake defense 43 395 34 Borque defense 77 430 22 Cullimore defense 3 30 18 Gretzky centre 130 458 26 Karia centre 125 620 22 Konstantinov defense 10 560 120 May winger 35 290 180 McCarty winger 41 340 220 Odjick winger 9 115 245 Peca centre 30 - 582 176 Tikkanen winger 19 • 220 36 Tkachuk centre 82 530 160 Wotton defense 5 38 6 Tab le 1: A Fragment of the N H L P laye r s ' Statistics Tab le For the N H L data set, taking no action with regards to these subsets is probably fine, because the data set is small (i.e., about 1,000 rows with 20 attributes). But for much larger data sets that exhibit this kind of subsets, it would be desirable to exploit the subsets to minimize storage. Yet there is not much that can be done by existing compression techniques within the bounds of having a single relation. The only option is to develop a new scheme with multiple relations. The difficulties are that the specific choices of new schemes may not be known at schema definition time, and that the number of combinations can be enormous. In order to resolve these difficulties in defining multiple relations, the crucial first step is the identification of what we call fascicles. A fc-dimensional, or ^-attributed, fascicle F(k,t) of a relation is a subset of records that have compact attributes. An attribute A of a collection F of records is compact if the width of the range of A-values (for numeric attributes) or the number of distinct A-values (for categorical attributes) of all the records in F does not exceed t. We refer to t as the compactness tolerance. For instance, in the N H L players statistics example, suppose the compactness tolerance imposed on the attributes are: p^osition = 1 (i.e., exact match is required), ?piayedMinutes = 60, Joints = 10 and fpenaityMinutes = 20. Then Cullimore and Wotton are in a 4-D fascicle with Position, Points, Played Minutes and Penalty Minutes as the compact attributes. And Peca and Tkachuk are in a 3-D fascicle with Position, Played Minutes and Penalty Minutes as the compact attributes. 2 If the set of compact attributes were the same for every fascicle, then we would have a situation very similar to a GROUPBY operation. However, in most instances, the compact attributes may be different for different fascicles. If all attributes were compact, a fascicle would reduce to a traditional cluster. In general, the non-compact attributes of a fascicle can have a very large range, so most fascicles are not traditional clusters. Given a fascicle, we can minimize its storage requirement by projecting out all the compact attributes and storing their representative values only once, separately. Even though this compression can be lossy, the loss of information is limited to an acceptable amount specified by the user, through the selection of a suitable compactness tolerance t. To be more precise, suppose we have /V records in a relation with n attributes, each requiring b bytes of storage. The total storage required is Nnb bytes. Suppose we are able to find c fascicles, each with k compact attributes projected out, such that between them these fascicles cover all but N() of the records. Then the storage required for a fascicle with cardinality s is kb+(n-k)bs. Adding up over all c fascicles, we get kcb+(n-k)(N-No)b. Thus, in total, we save (N-No-c)kb bytes of storage. If we are able to get small values for and for c, at least compared to N, then the storage savings would be almost Nkb bytes. Thus, we consider the following problem: [Storage Minimization with Fascicles] Find fascicles such that the total storage is minimized. As presented above, storage minimization with fascicles is achieved at the potential cost of some information loss. A natural question to ask then is: Given the existence of many data compression techniques that offer lossless compression, why should we care about lossy compression with fascicles? A key claim of this thesis is that storage minimization with fascicles is valuable for the following reasons. Existing compression techniques are "syntactic" in the sense that they operate at the data representation level, e.g., byte-level or string-level. Compression with fascicles are "semantic" in nature in that the tolerance t takes into account the meanings and the dynamic ranges of the attributes. The compression is independent of the physical ordering of the tuples. Regarding storage minimization, a key point here is that both styles of compression are orthogonal and can work in tandem. As can be seen in chapter 6.8 . We have presented storage minimization as a pure data reduction problem, and indeed that is the formulation in which it is easiest to evaluate the efficacy of anything that we do. However, data reduction and data mining can sometimes be intricately linked to each other. In the case of fascicles, the linkage is strong and will be made precise below. For instance, corresponding to the appropriate 3-D fascicle, there are about 25% of N H L players who had little impact on the game, i.e., their points, played minutes and penalty minutes all very low. Those players are different only in the position they played. In general, provided that the size of a k-D fascicle is not small, the fascicle represents a significant fraction of the population that more or less agrees on the values of the k attributes, but differs considerably only in the values of the other attributes. This is a pattern that could be of great data mining value. By definition, a record R can be contained in many fascicles. Previously, in storage minimization, we select from among these, one fascicle to contain R for data reduction. Here from a pattern extraction standpoint, the more fascicles generated (and hence more patterns 3 extracted), the better, even if R is contained in multiple fascicles - provided that all these fascicles are of the best "quality". The quality of a fascicle is measured by two components: t, the compactness threshold, and k, the number of compact dimensions. Given the same compactness threshold, a fascicle with more compact dimensions is higher in quality than another with fewer compact dimensions. Similarly, given the same number of compact dimensions, a fascicle with a tighter compactness threshold is higher quality than another with a more liberal threshold. Two fascicles are incomparable in quality if one has more compact dimensions but the other has a tighter threshold. This partial ordering of fascicles, to be formalized in Chapter 7.1 , is denoted as >f. Now given fascicles Fi,...,Fu all containing R, we can define: Fas(R) = {Fi\V\<j<UAJ*i: Ft >f F.} Equation 1 That is, we are only interested in the fascicle with the maximal qualities among F],...,FU. Lower quality fascicles can be derived from higher quality ones. Finally, to generate as many fascicles with the maximal qualities as possible, we maximize the set Fas(R) for each record R. Since Fas(R) is a set, maximization is conducted with respect to the well-known Smyth ordering of power sets[20], denoted as >s, which will be detailed in chapter 7.1 . Thus, we have the following problem: [Pattern Extraction with Fascicles] Given a minimum size m, find fascicles of sizes > m, such that for all record R, Fas(R) is maximized with respect to the ordering >s. 1.4 Contribution This thesis provides algorithms to attack both of the problems presented in the previous chapter. We will first explore similarities with the well-known problems of subspace clustering and association rule mining. We will show that solutions to these well-known problems cannot be applied to our specific problem. Furthermore, we will show that the problem at hand is very hard to solve optimally. We will then proceed to develop algorithms that are non-deterministic, fast, and very scalable when compared to other related algorithms. We will demonstrate empirically that good solutions can be obtained using this non-deterministic approach. We will also investigate the effect of various parameter choices, such as the compactness tolerance and the minimum fascicle size. The major contributions of this thesis are: • The single-k algorithm: We have developed this algorithm to solve the storage minimization problem. A l l fascicles produced by this algorithm will have the same number of compact attributes specified by the user. We will show that the single-k algorithm is ideal when the number of compact attributes are large (e.g. 10 out of 13). • The multi-k algorithm: This algorithm is an alternative to the single-k algorithm. The fascicles that are produced by this algorithm can have a variable number of compact 4 dimensions up to a certain user-specified minimum. We will show that this algorithm can give us better storage when the minimum compact dimensions are small (e.g. 4 out of 13). • Lossless compression: We will show that if we use the fascicles obtained by either the single-k or the multi-k algorithm to re-order the tuples in a relation, we can make syntactic compression scheme more effective when it is applied to the re-ordered relation. • Lossy compression: We can first perform lossy compression with fascicles by removing compact attributes, and then apply normal syntactic compression on the remaining attributes. Experimental results will show that the extra lossy compression step can reduce the final compressed size a few times more. • The multi-k extract algorithm: This algorithm is modified from the multi-k algorithm. We will show that the multi-k extract algorithm is an effective way to solve the pattern extraction problem. 1.5 Thesis Outline The outline of the thesis is as follows. Chapter 2 discusses related work in the area. In Chapter 3 we lay out our assumptions and describe the basic model of fascicles. In Chapter 4 , we develop several algorithms to find fascicles. In Chapter 5 , we present a greedy approach to select fascicles (from among those generated using the techniques of Chapter 4 ) to minimize storage. In Chapter 6 , we present experimental results evaluating the various algorithms for solving the full storage minimization problem. Having completed a study of the storage minimization problem, we turn to the pattern extraction problem in Chapter 0. We describe how the algorithms previously developed need to be modified, and present empirical results. Finally, we conclude with Chapter 8 . Chapter 2 Related Work This chapter covers other works related to our research. We look at the different techniques used in data compression, data clustering, data mining, and dimensionality reduction. Although techniques in these areas can help achieve our goal, we show that none of these techniques alone can solve our basic problem and that a new approach is needed. 2.1 General Data Compression Algorithms There are many general data compression techniques out there such as Huffman Coding and Lempel-Ziv Coding. These algorithms basically try to find the optimal coding for all arbitrary strings in the file being compressed. As an example, we can look at a simplified version of Huffman's coding technique. Suppose we have a text file with 100,000 characters but only 3 unique characters appeared. We can construct the frequency table: 5 A B C Frequency (in thousands) 60 30 10 Table 2: Character frequency table Uncompressed, this file would take 100,000 * 8 = 800,000 bits of storage. However, since we only have three characters to look at, we can assign each character with a 2 bit code. i.e. A = 00, B = 01, C=10. Each character is now only 2 bits, our file size is reduced to 100,000 * 2 = 200,000 bits. We have just reduced our file to 25% of its original size. The actual Huffman code is a little more complicated, involving assignment of variable length code to each character depending on the character's frequency (i.e. characters that appear more frequently will have a shorter code). In chapter 6.8 , we'll show that our techniques are complementary to general data compression algorithms, and string compression can be used to reduce further the storage requirements for our fascicles. Moreover, string compression techniques cannot be used to address the pattern extraction problem. 2.2 Data Clustering Algorithms There are numerous studies on partitioning a data space spanned by a relation or a collection of points, including the studies on clustering (e.g. [1][6][10][17][22]). Almost all clustering algorithms operate in the original data space, corresponding to row partitioning of a relation. For example, if we have a collection of points, then the clustering algorithm would try to identify points that of close proximity to each other as clusters. For example, in Figure 1, we have three clusters defined by points that are close to each other based on Euclidean distance. The data clustering problem is definitely related to fascicles. However, the resulting clusters are based on all of the attributes being close to each other. Fascicles are more relax in that we only need some of the attributes to be similar, but not necessarily all of them. Figure 1: Example of data clusters 6 2.3 Subspace Clustering Algorithm Perhaps the one area that is closest to our definition of fascicle is subspace clustering. One of these algorithm, CLIQUE, is particularly similar to our solution. CLIQUE is a density based method that finds all clusters in the original data space, as well as all the subspaces. Figure 2: Example of CLIQUE We can see a 2 dimensional example of how CLIQUE works in Figure 2. CLIQUE first divides the data space into a grid. Then it proceeds to find all subspace clusters using a "bottom up" approach. In our example, let us say that any box that has more than 2 points would constitute a cluster. CLIQUE would first find out that there are three clusters on the y-axis (row 2, 5 and 6). Then, it will find that there are no clusters along the x-axis. Finally, CLIQUE would conclude that there are no clusters on the original data space. In a post-processing phase, CLIQUE would combine the clusters of row 5 and 6 into one cluster. As one can see, clusters found by CLIQUE are largest regions of connected dense units, where pre-binning is used to create a grid of basic units. This notion of clusters is fundamentally different from fascicles where there are no pre-defined "grid". Furthermore, the study conducted in [1] does not address the storage minimization problem and the pattern extraction problem analyzed here. Despite all the above differences, the algorithm presented in chapter 4.1 can be considered a variant of CLIQUE. As such, it suffers from the same pitfalls of requiring pre-binning and operating in a bottom-up level-wise computational framework. Chapter 6.2 gives various reasons why such a framework is inappropriate for the storage minimization and pattern extraction problems. 2.4 Association Rule Data Mining Algorithms Association rule mining tries to answer the question of which items usually appears together in a transaction in a large dataset. Specifically, if we look at a dataset for items purchased in a supermarket, we may found out that 10% of all transactions involve either milk or cereal. Of all the transactions that involve milk or cereal, 90% of them have both milk and cereal appearing in the same transaction. In association rule mining, we call the 10% the support for this milk-cereal rule. We also call the 90% the confidence of such a rule. For any given relationship to become a rule, such relationship must satisfy a certain user specified support s and confidence c. 7 One popular algorithm for association rule mining is the Apriori algorithm[2][3][13]. This algorithm proceeds in a "bottom-up" way similar to CLIQUE. It first count, for each item, the number of transactions that contains the item, throwing away all items that has a count less than the minimum support s. The remaining item counts are called the large 1-itemsets. After acquiring the large 1-itemset, Apriori starts counting the combination of these large 1-itemset, forming 2-itemsets, and throwing away all 2-itemsets that have a count of less that s. The algorithm continues to iterate forming large itemsets of size 3, 4, up to k. Apriori stops when no more large itemsets could be found. The advantage of the Apriori algorithm is the completeness of this approach. However, as can be seen in chapter 6.2 , this approach suffers from being too slow for our task at hand. The Apriori algorithm considered in this thesis for finding fascicles is a close variant of the Apriori algorithm for association rule mining. There are many studies optimizing the execution of the Apriori framework, including [5] [18] [19] [21]. Given the fact that our Single-k and Multi-k algorithms are orders of magnitude faster than the Apriori algorithm for the tasks at hand, it is unlikely that even enhancing the Apriori algorithm with any of those optimizations would change any of our conclusions. 2.5 Dimensionality Reduction Algorithms One important aspect of fascicles is dimensionality reduction (which leads to compression). In fact, "feature reduction", is a fundamental problem in machine learning and pattern recognition. There are many well-known statistical techniques for dimensionality reduction, including Singular Value Decomposition (SVD)[9] and Projection Pursuit[8]. The key difference here is that the dimensionality reduction is applied to the entire data set. In contrast, our techniques handle the additional complexity of finding different subsets of the data, all of which may permit a reduction on different subsets of dimensions. The same comment extends to FastMap[7], which is an approximation to multidimensional scaling, the SVDD technique[15], which is based on SVD, and the DataSphere technique[ll], which reduces dimensionality by using 2-dimensional summaries. 8 Chapter 3 Model and Assumptions There are altogether three parameters in the fascicle framework. First, there is the compactness tolerance t3. In the storage minimization problem, there is only one value of t, representing the maximum error on an attribute that the user can tolerate in a lossy compression. In the pattern extraction problem, there is a series of compactness tolerance with tj < t2 < ... < tu. These values form a lattice structure so that qualities of fascicles can be compared. Second, there is the minimum size parameter m. For storage minimization, the smaller the value of m, the more fascicles could be found and the larger the potential for minimization. Thus, in principle, m is not a key parameter in the storage minimization problem, although in practice very small fascicles, with only a few records, do result only in small incremental savings in storage. For pattern extraction, however, m plays a more critical role. If the value of m is too small, there are likely to be too many fascicles to provide any major insights into the data. In this setting, m functions as the well-known support threshold in association rule mining. Last but not least, there is the dimensionality parameter k, the number of compact dimensions required of a fascicle. In both the storage minimization problem and the pattern extraction problem, k plays a central role. The fascicles finding algorithms in this thesis treat k as the main independent variable. 3 To be more precise, t should be represented as t because, as shown in the earlier NHL example, each attribute may have its own compactness tolerance. We simplify our presentation here by assuming, without loss of generality, that there is a single tolerance applied to every attribute. Moreover, in practice, more elaborate definitions of compactness are possible. For instance, values of a categorical attribute can be organized in a generalization hierarchy. The address attribute, for example, may be organized in a township-county-state-country hierarchy. In that case, the compactness tolerance may be defined with respect to the depth of the hierarchy. As for numeric attributes, one could specify compactness based on quantiles rather than actual values. The specific definition of compactness is not material for the central algorithms in this paper. For ease of discussion, and in our experiments, we shall use only the definition given in Chapter 1. 9 Chapter 4 Finding Candidate k-D Fascicles In this chapter, we consider the problem of finding k-D fascicles, which is a sub-problem for both the storage minimization and the pattern extraction problems. In later chapters, we will discuss how to select from amongst the candidate fascicles so generated. Below we first present a level-wise algorithm that finds all k-D fascicles by finding all j-D fascicles for all j < k. This algorithm is named after the Apriori algorithm for association rule mining. We then present a search-based algorithm, called the Single-k algorithm, that finds many, but not all, k-D fascicles. Fascicles obtained by the Single-k algorithm all have the same dimensionality. Finally, we extend the Single-k algorithm to form the Multi-k algorithm that finds fascicles with dimensionalities > k. 4.1 Naive, Level-Wise Apriori Algorithm Recently, there has been considerable study of association rule mining. In its first phase, all frequent itemsets (i.e., sets whose support exceeds the minimum threshold) are computed by the well-known Apriori algorithm [2][3][13]. The algorithm basically performs a bottom-up level-by-level computation of the underlying lattice space, which consists of all possible subsets of items. Specifically, it first computes all frequent itemsets of size 1, then those of size 2, and so on. Efficient pruning is made possible because the support of an itemset decreases monotonically as the itemset grows in size. To find k-D fascicles, a similar Apriori algorithm can be used. Here the underlying lattice space consists of all possible subsets of attributes. For a given subset of attributes {Ai,...,Au}, its "support" is the number of records that form a fascicle with Ai,...,Au as the compact attributes. This simplified view applies perfectly to categorical attributes. But for numeric attributes, a pre-processing step is necessary to divide the attribute range into bins of width equal to the specified tolerance t. We call this step pre-binning. Once pre-binning has been performed, the Apriori algorithm proceeds in the usual fashion. It first computes 1-D fascicles whose support exceed the minimum size parameter m. Then it proceeds to compute 2-D fascicles and so on. The usual pruning strategy applies because the set of records supporting the fascicle monotonically decreases in size as the set of compact attributes grows. In evaluating the effectiveness and efficiency of the Apriori algorithm for finding k-D fascicles, we make the following observations: • To compute all k-D fascicles, the Apriori algorithm computes all k-D fascicles for all l<j<k. This bottom-up level-wise strategy is costly except for very small values of k. For our two problems, particularly the pattern extraction problem, the value of k is intended to be as close to the original number of attributes as possible. It is clear that a bottom-up strategy is undesirable. • A bottom-up level-wise strategy would not be so bad if the minimum size parameter m were large relative to the size of the dataset. This is, however, not the case for the storage 10 minimization problem. The value of m can indeed be as low as 2. For low values of ra, it is likely that almost every possible 1-D fascicle has the minimum support. Worse, if the number of 1-D fascicles is u, there could be 0(u 2) 2-D fascicles, 0(w3) 3-D fascicles, so on, and almost all of these have the minimum support required, at least for the first several dimensions. Therefore, the number of candidate sets grows exponentially for the first several passes. • For a given value of k, the Apriori algorithm finds all k-D fascicles with sizes larger than m. Typically, for many circumstances, this completeness property is a bonus. However, as will become clear in Chapter Chapter 5 , the storage minimization problem is NP-hard, and we can only resort to using a greedy approach to select a small percentage of the fascicles from among all the possible candidates. Under the circumstance, computing each and every k-D fascicle does not necessarily help to improve the outcome of the minimization. It only means excessive effort. • Last but not least, there is the pre-binning bias introduced by the Apriori algorithm. The selection of the bins is made on a per attribute basis and is based on the entire relation. If the selection is made "dynamically" each time based on the specific subset of records under consideration, the choice could conceivably be more appropriate. We will have more to say about this effect below. Our problem now, even though well formulated, has a few characteristics quite different from that of association rule mining. First, the number of possible tuple sets in our problem is exponentially larger than the number of possible item-sets in an association rule problem. Second, the number of records in a candidate set can be very large itself. Most association rule algorithms make numbers of passes that are equal to the number of items in the largest item-set with sufficient support. In our case, we will end up with a very large number of passes. Third, the number of attributes matched is itself a variable and needs to be treated appropriately. For all these reasons, we cannot use association rule algorithms "out of the box", but instead have the challenge of devising our own for our considerably more challenging problem definition. 4.2 The Single-k Algorithm 4.2.1 The Basic Approach: a Lattice-based Conceptualization For all the above reasons, our next proposed algorithm is based on a totally different approach. In particular, we seek to find k-D fascicles directly, without first finding k-D fascicles for all j < k. To understand how to do so, consider a lattice comprising all possible subsets of records in the relation of interest. As usual, an edge exists between two subsets S,T if SDT and LSI = 171+7. At the top of the lattice is a single point/set, denoted as T, representing the entire relation. At the bottom of the lattice is a single point ± representing the empty set. Let n be the total number of attributes in the relation, and k(0<k<n) be the number of compact attributes desired in a fascicle. Consider any path <±,Sj,S2,...,T> linking the bottom and top points in the lattice. Trivially, ± and set Si, which consists of a single record, are n-D fascicles. 11 By the definition of a fascicle, it is clear that if 5 is a superset of T, and if 5 is aj-D fascicle, then T must be an i-D fascicle for some i>j. This monotonicity property guarantees that for any given value \<k<n, either T is itself a k-D fascicle4, or there must exist two successive sets St, St+i on the path such that: i) St is an i-D fascicle, ii) S t + i is a j-D fascicle, and iii) j < k < i. We call S, a tip set. To put it in another way, 5 is a (k-D) tip set if S is a k-D fascicle, and there is a parent T of S in the lattice such that T is a j-D fascicle with j < k. Within the class of tip sets, there is a subclass that we call maximal sets. S is a (k-D) maximal set if S is a k-D fascicle, and for all parents T of S in the lattice, T is a j-D fascicle with j < k. Now our task is to find maximal sets. Forming the border discussed in [16], the collection of maximal sets completely characterizes all k-D fascicles. However, given the huge size of the lattice space, the entire collection of maximal sets is so large that its computational cost is prohibitive. Furthermore, as we shall soon see, we cannot in any case afford to select optimally from a given set of candidate fascicles for the problems we seek to address. Therefore, it may suffice to compute some, but not all, maximal sets. Below outlines the Single-k algorithm which selects "good" initial tip sets, and grows a tip set to a maximal set. 4.2.2 Choosing Good Initial Tip Sets Conceptually, to find a tip set, we begin by picking a random record as the first member of a tip set. We then add in a second record at random, and then a third, and so on — until the collection of records is no longer a k-D fascicle. In essence, we are constructing a path <_L,5;,5,2,...,T>. We can repeat this process as many times as we wish, each time exploring a different path up the lattice5. Given a relation that does not fit in main memory, each record sampled would require one disk access. To minimize the amount of I/O activity performed, a commonly used technique is "block sampling", where an entire disk page is read into memory and all records on the page are used. We propose to adopt this approach as well, but there is an important difference. In traditional sampling, possible correlation between co-located records can be compensated for by increasing the sample size. In our case, the specific ordering of records in the sample is of critical importance. Increasing the sample size would be of no help at all. For example, suppose that the hockey players relation is stored on disk sorted by number of minutes played. When we consider successive records, they will very likely have minutes played as a compact attribute, to the potential detriment of other possible compact attributes. 4 If this rather unlikely corner case applies, our problem is trivially solved. To keep our exposition clean, we assume that this corner case does not apply in the sequel. That is, T is a j-D fascicle for some j < k. For all practical purposes, T , consisting of the entire relation, is almost always a 0-D fascicle. Theoretically, it is possible that the same lattice path is generated multiple times if we sample records with replacement from the relation. For large relation sizes, this possibility is remote. Moreover, even if a few candidate fascicles generated are duplicates, no great harm is done, as long as we have enough candidates that are different. However, there is a clear benefit to obtaining tip sets that do not overlap at all. 12 To address these concerns, we read into memory as many randomly sampled blocks of the relation as our buffer space permits. Now we can work purely with the sample of the relation in memory. From amongst the records in memory, we can choose records at random, without paying any attention to block boundaries. Typically, the number of tip sets generated is quite large, and so each record has an opportunity to be considered for multiple tip sets. It is quite easy to make this consideration a systematic sampling without replacement by (conceptually) imposing a random sort order on the records in memory and then considering them sequentially. When one tip set is complete, the first record that would render it non-compact is used to start a second tip set, and so on. Once all records in the sample relation have been considered once, a new random permutation of the records is established, and each record is considered exactly once. Because of the oversampling we perform, it is possible not to work with just a single sample of the relation. Instead, we can produce some tip sets working with the sampled relation described above; when this is done, we can obtain a fresh sample of b completely different pages from the relation on disk, and repeat. Proceeding thus, we can consider the entire relation as q disjoint pieces, where each piece is a random collection of pages. This procedure requires each page to be read only once from disk. 4.2.3 Growing a Tip Set to a Maximal Set Tip sets obtained thus far are confined to records in one piece of the relation. To remove the bias introduced by dividing a relation into pieces according to the buffer space, it is desirable to make tip sets include records from other pieces of the relation. Equally importantly, the above procedure of enlarging a tip set is conducted in a one-record-at-a-time fashion. We seek to speed up this enlargement process in a judicious way. This turns out to be easy to do. A tip set obtained thus far has k compact attributes, corresponding to k attribute value ranges (for numeric attributes) or sets of values (for categorical attributes). These in effect specify a k-D selection predicate on the relation. And we can simply evaluate a query that returns all records matching the query, such as by scanning the entire relation using one more pass. Because the new records added to the tip set do not change the k attribute ranges or values, the newly grown tip set is still a k-D fascicle. We refer to this process as the tip set growing phase. Before the growing phase, suppose that the compact attribute ranges and values of a tip set are already at the maximum "width" allowed by the compactness tolerance. Then it is easy to see that after the growing phase, the newly grown tip set has indeed become a maximal set. That is, it is impossible to add any new record to it to keep it a k-D fascicle. For some tip sets, however, there may be at least one of its compact attribute ranges that still has some room to reach the maximum allowable width. Even after the growing phase, those tip sets may still not be maximal. One solution is to allow those ranges to expand during the scan in the growing phase — so long as they remain within the specified width. For example, suppose the 13 current tip set's salary range is [30K,35K] but the width is 8K. Then when a record Rj whose salary value is 37 K is encountered.the record can be added to the tip set with the salary range updated to [30K,37K] (provided that the other attributes of Ri do not force any of the compact attributes of the tip set to become non-compact). This simple range expansion strategy, while easy to implement, does not guarantee the best possible result. For example, suppose after the range has been updated to [30K,37K], two records R2,R3 with salary values 27K and 28K respectively are encountered. Because the above range expansion strategy is basically a greedy approach, these two records would not be added to the tip set. In order to do the latter to achieve the best possible results, all of the records (and many others) would need to be remembered and their various combinations would need to be considered. Experimental results have shown that the fascicles we got from the simple strategy were not all that different from the latter approach. Moreover, after the greedy selection stage described in chapter Chapter 5 , both strategies ended up producing similar results. For what it is worth, the extra computational effort appears not to be justifiable. Similarly, a few tip sets generated may have more than k compact attributes because the addition of the next record would have resulted in two or more of these being rendered non-compact. When we grow such a tip set, we have a choice of which dimensions to keep compact. Once more, we simply let the scan order dictate. In summary, Figure 3 shows a skeleton of the Single-k algorithm. It is a randomized algorithm that computes many — but not all — k-D fascicles, all of which are maximal sets. No matter what the available amount of buffer space is, the algorithm reads each data page from disk at most twice (i.e., once for Step 2 and once for Step 3). Unlike the Apriori algorithm, the algorithm does not compute j-D fascicles for any j different from the desired k. Also, unlike the Apriori algorithm, it does not require pre-binning; compact ranges are defined entirely based on the set of records under consideration. In fact, the algorithm computes tip sets whose compact ranges may overlap. A l g o r i t h m Single-k Input: a main memory buffer of size b pages, a dimensionality k, an integer P, a relation R comprising no more than b • q pages, but more than b(q-l) pages. Output: k-D fascicles 1. Divide into disjoint pieces, each comprising up to b randomly chosen pages from R. 2. For each piece: /* choosing initial tip sets */ 2.1 Read the piece into main memory. 2.2 Read the records in main memory to produce a series of tip sets as discussed in Chapter 4.2.2 . 2.3 Repeat 2.2 with an appropriate number of times, each with a different permutation of the records, until P/q tip sets are obtained. 3. /* growing the tip sets */ Grow all P tip sets with possible range expansion, as discussed in Chapter 4.2.3 , with one more pass over the relation. Output the grown tip sets. Figure 3: A Skeleton of the Single-k A l g o r i t h m 14 4.3 The Multi-k Algorithm Recall that for both the storage minimization problem and the pattern extraction problem, we may allow fascicles to vary in their dimensionalities. The Single-k algorithm, however, only produces fascicles with one specified value k. The problem here is that the algorithm may miss out on the opportunities offered by the fascicles with higher values of k. For instance, there could be a fascicle of dimensionality 10 that is a subset of a fascicle (of dimensionality 6) found by the algorithm. The higher dimensionality fascicle, if known and exploited, can clearly lead to additional reduction in storage cost. One simple way to overcome the above problem is to run the Single-k algorithm repeatedly, each with a different value of k. But each execution of the algorithm share many repetitive operations, including the relation scans. To minimize the shared operations, we extend the Single-k algorithm to the Multi-k algorithm which produces fascicles all having dimensionalities > k. Recall from the Single-k algorithm how a k-D tip set corresponds to a path <L,Si,S2,...,St>. Observe that as we move from the beginning of the path to the end of the path, the dimensionality of the corresponding fascicle decreases monotonically. Thus, while the Single-k algorithm constructs a path <L,Si,S2,...,St> and obtains St as a k-D tip set, the Multi-k algorithm uses exactly the same path to obtain for each value i>k, the largest set on the path with dimensionality i, if it exists. Let us illustrate this by an example. Suppose there are a total of n=\2 attributes in the relation. Suppose k=6, and the path corresponding to a particular 6-D tip set computed by the Single-k algorithm is <L,Si,S2,---,Sw>- The following table shows the dimensionality of each set on the path: Set 1 s, S2 s 3 5 4 Ss S6 Si Sz S9 Sio k 12 12 12 11 11 11 9 9 8 6 6 Tab le 3: Itemset dimensional i ty pa th The Multi-k algorithm uses the same path to obtain the tip sets S2, S5,S7,Ss and Sio with dimensionalities 12, 11, 9, 8 and 6 respectively. To complete the description of the Multi-k algorithm, these tip sets with varying dimensionalities can all go through the same growing phase together to become maximal sets, sharing one relation scan. Thus, Figure 3 presents an accurate skeleton of the Multi-k algorithm, provided that Step 2.2 is modified to implement the procedure discussed in the previous paragraph. 15 Chapter 5 Greedy Selection for Storage Minimization In the previous chapter, we have presented the Apriori, the Single-k and the Multi-k algorithms for finding fascicles. These algorithms represent the first step towards solving both the storage minimization problem and the pattern extraction problem. This chapter discusses how these algorithms can be completed to solve the full storage minimization problem. 5.1 Greedy Selection with Batch Weight Re-adjustment Given a collection of candidate fascicles produced by the algorithms described above, because the fascicles may overlap, the remaining task is to choose some of these fascicles so that the total storage is minimized. To simplify our discussion, let us consider for the moment the "un-weighted" version of the above task: find the minimum number of candidate fascicles that cover the whole data set. This turns out to correspond to the well known minimum cover problem [12]. That is, given a collection C of subsets of a finite set S and a positive integer K, is there a subset C ' c C with \C\<K such that every element of 5 belongs to at least one member of C ? The minimum cover problem is NP-complete. To the best of our knowledge, greedy selection is among the best heuristics that exist for solving the minimum cover problem. Now to return to the task of picking fascicles to minimize total storage, we adopt a greedy approach with each fascicle F weighted by: wt(F)=k*\F\, where k is the dimensionality of r . The weight of F corresponds to the storage savings induced by the fascicle, i.e., k fewer attributes to store for each record contained in F. In a straightforward implementation of the greedy selection, we select the candidate fascicle with the highest weight, subtract from all the remaining fascicles records that are in the selected fascicle, and adjust the weights of the remaining fascicles accordingly. Specifically, if A is the selected fascicle, then the adjusted weight of each remaining fascicle F is given by: wt(FIA) = k*\F-A\ Equation 2 Then from among the remaining fascicles, we pick the one with the highest adjusted weight, and repeat. This implementation, while simple, suffers from the problem that for each selection made, we need to perform for every remaining fascicle, set subtraction, weight re-adjustment and re-placement of the fascicle in the sorted list of candidates. This overhead is considerable. To reduce the overhead, we batch the above operations as follows. As before, we begin with all the candidates sorted in descending order of weight, and we pick the fascicle F\ with the highest 6 k is the same for all fascicles produced by the Single-k algorithm. As such the multiplication by k can be dropped. However, for continuity with the chapters that follow, and for a better conceptual understanding, we show the multiplication explicitly. 16 weight. Then we walk down the sorted list and pick the first fascicle F2 that does not overlap with F\ . Next we pick the first fascicle F3 in the sorted list that does not overlap with F1UF2, and so on. This process is stopped when there are no more candidates that do not overlap with all the picked fascicles, or when a sufficient number of fascicles (e.g. 10) have been picked in this round. It is only at this point that we conduct for each remaining fascicle a set subtraction and weight re-adjustment, and re-sort the list of candidates. In other words, these operations are only done once per batch/round, instead of once for every fascicle selected. 5.2 Weight Re-adjustment for the Multi-k Algorithm The weight adjustment formula is more complicated for the Multi-k algorithm, because the candidate fascicles may have varying dimensionalities. More importantly, as described in Chapter 4.3 , the Multi-k algorithm can generate many pairs of candidates F i , F 2 such that: (i) Fi<=F2 , but (ii) k\>fc2, where k\ and fc2 are the dimensionalities of the fascicles. Empirically, we observe that more often than not, the weight of the lower dimensional F2 is higher than that of the higher dimensional F\, i.e. the extra size of F2 more than offsets its lower dimensionality. In this case, the greedy selection picks F2 over F\. Then if one were to use the formula given in Equation (2), the new weight of F\ would be reset to zero. Consequently, even though the Multi-k algorithm produces candidates of varying dimensionalities, which is desirable as motivated in Chapter 4.3 , the above greedy selection procedure tends to negate the effort by picking only the fascicles with the minimum dimensionality. It is easy to see that even after F2 is selected, F i can still be chosen to provide further storage savings, albeit only to the records contained in F\. Consider the following example, where IFil=50, &i=8, 17*21=100 and £2=6. F2 is selected because its storage savings is 600, as compared with 400 from F\. However, for the 50 records contained in both F\ and F2 , they can indeed be compressed further to 8 compact dimensions using F\. This means an additional savings of 50*(8-6) is possible. There are at least two ways to augment the batch greedy selection procedure to deliver the kind of additional savings illustrated in the above example. The first way is to maintain explicit inclusion relationships that exist among candidates. Recall from Chapter 4.3 that one search path <_L,Sy,52,...,T> gives rise to multiple candidate tip sets Si, S2 with decreasing dimensionalities. What we could do at that time is to explicitly record and maintain the inclusion relationship between Si and S 2. Consequently, when S2 is chosen by the greedy selection procedure, we could use the inclusion relationship to also select Si (and all its higher dimensional fascicles along the same initial search path) to check if additional storage savings is possible. While this solution deals with situations involving fascicles completely contained in other fascicles, such as Si c S2 in the above example, it cannot handle a more general situation whereby Si and S 2 only partially overlap. To illustrate, let us modify the above example. Instead 7 When two sets do not overlap, checking whether they overlap takes as much time as finding their difference. However, when they do overlap, the former operation takes much less time than the latter. In the context of picking candidate fascicles, this saving is considerable. 17 of Si being a strict subset of 52, let 5i overlaps with 52 on 50 records. Then by exactly the same argument as in the previous example, an additional savings of 50 *(8-6) is again possible. This motivates the second solution. To reflect this additional savings possibility, we re-adjust the weight of a fascicle in a way more general than in Equation 2: k*\F-A\+(k-kA)*\FnA\ ifk>kA wt(F/A) = { k*\F-A\ ifk<kA Equation 3 If the selected fascicle A is of the same or higher dimensionality, then the weight of F is re-adjusted as usual. However, if A is of a lower dimensionality, then the weight of F includes the component (k-kA)*\FnA\, corresponding to the additional storage reduction of the records in both A and F. To compare the Single-k algorithm with Equation 2 and the Multi-k algorithm with Equation 3, we make the following observations: • The kind of batch subtraction that is effective for equation is no longer as good. This is because for Equation 2, if the fascicles selected in this round are F\,...,FU, it is sufficient to apply the equation once simply with uUj=i Fj. However, for Equation 3, because the dimensionalities of F\,...,FU are explicitly used, and these dimensionalities are not necessarily identical in value, the computation is more involved. • Recall that in the case of the Single-k algorithm, because of batch subtraction, the fascicles selected by the greedy procedure are all non-overlapping. The situation for the Multi-k algorithm is more complicated. After all the fascicles have been picked with Equation 3, the selected fascicles may overlap, which is the strength of the Multi-k algorithm. The price to pay is, however, to find out for each record, the right fascicle to use to contain and compress the record. This can be done by sorting the selected fascicles in descending order of dimensionality. The right fascicle for a given record is the first fascicle that contains the record. This final step for the Multi-k algorithm is not particularly expensive; nonetheless, it represents extra overhead. 18 Chapter 6 Experimental Evaluation for the Storage Minimization Problem This chapter evaluates the Apriori, Single-k and Multi-k algorithms for the storage minimization problem. We first show how the three algorithms compare with each other regarding their efficiency and effectiveness. We then show how the Single-k and Multi-k algorithms behave with respect to the parameters: the minimum dimensionality k, the compactness tolerance t, and the minimum fascicle size m. We also show how the algorithms scale up with respect to data set size. 6.1 Datasets used and Experimental Setup Since we seek to exploit hidden patterns in the data, it is hard to perform any meaningful experiments with synthetic data. We used two very different real data sets for all our experiments: • A T & T Data Set: This data set has 13 attributes describing the behavior of A T & T customers, one per record. Some attributes are categorical (e.g., calling plan subscribed), and others are numeric (e.g. minutes used per month). This is a large data set, of which 500,000 records were extracted randomly and used for the bulk of the experiments. We will make explicit reference wherever a different data set size was used. • N H L Data Set: We have already seen a sample of the N H L data set earlier in the paper. For each of about 850 players, there is one record with 12 attributes describing the position the player played, the minutes played, the points scored, and so forth. Among numerous other sites, N H L players statistics can be obtained from http://nhlstatistics.hypermart.net. A l l experiments described in this chapter were run with both data sets, and produced very similar results. For clarity, we only show in this chapter the results from the larger A T & T data set. The results for the N H L dataset can be found in the Appendix. A l l experiments for the A T & T data set were run in a time-sharing environment provided by a 225MHz UltraSparc workstation. As shown below, the behavior of the algorithms may depend on the choices of values of several parameters: the dimensionality k, the compactness tolerance t, the number P of tip sets, the minimum size m of a fascicle, and the data set size. Unless otherwise stated, the default values are: data set containing 500,000 records, P=500, m=88, and r=l/32, which means that for a numeric attribute A, t is set to be 1/32 the width of the range of A-values in the data set, and that for a categorical attribute A, t is set to be fw/32l, where w is the total number of distinct A-values in the data set. Strictly speaking, for storage minimization, the minimum size of a fascicle can be as low as 2. However, to allow the Apriori algorithm some pruning power, but not too much to skew the storage minimization problem, we arbitrarily set to a small integer. 19 For the results reported below, we give figures for runtimes of the algorithms, relative storage and coverage. Runtimes are given in seconds of total time (CPU + I/O). Relative storage is defined to be the ratio of the size of the compressed data set (i.e., storing compact attribute values only once for the entire fascicle) to the original size. The smaller the relative storage figure, the more effective the compression is. Coverage is defined to be the percentage of records that are contained in some fascicle after the greedy selection. 6.2 Inefficiency of the Apriori Algorithm Table 4 compares the Apriori algorithm with the Single-k algorithm for computing candidate k-D fascicles, for various values of k. To ensure that Apriori terminates in a reasonable amount of time, we used a data set of 1,000 records extracted at random from the larger A T & T data set. The numbers for the Multi-k algorithm are very similar to those for the Single-k algorithm for the small data set used in this experiment, and is therefore not presented. An in-depth comparison of these two algorithms is presented shortly. The difference in runtime performance between Single-k and Apriori is truly astonishing. Even for 1,000 records, the time required for Apriori for k>5 is so large that we did not even find it possible to continue running the process on our computer. For smaller values of k, Single-k dominates Apriori by orders of magnitude. Single-k Apriori Runtime Relative Coverage Runtime Relative Coverage Storage Storage 2 3.5 0.846 0.999 13.5 0.848 0.992 3 3.5 0.772 0.988 299.8 0.780 0.967 4 3.4 0.700 0.974 1591.0 0.750 0.846 5 3.4 0.640 0.941 >2500 n/a n/a Tab le 4: Inappropriateness of the A p r i o r i A l g o r i t h m for Storage M i n i m i z a t i o n The only remaining question as far as the usefulness of the Apriori algorithm is concerned, is whether it gives substantially better storage and coverage results, possibly justifying the additional computational cost. This could be the case because Apriori generates every candidate fascicle, whereas Single-k only generates some fraction of these sets. Surprisingly, Apriori actually did worse. Table 4 shows that Single-k gives slightly better relative storage and coverage figures than Apriori. In fact, the difference in coverage widens as k increases. The reasons for this phenomenon may lie in the "pre-binning" step required by Apriori. The first possibility is that the current pre-binning is of low quality, and that there may exist a better pre-binning scheme. In particular, in the current pre-binning, because bins do not overlap, different values of the attribute, once classified into different bins, cannot participate in the same fascicle, even if the values differ by less than t and are close to the bin boundary. The second possibility may be that pre-binning, regardless of how 20 good we are doing it, is inappropriate. This points to the fact that binning on a fascicle-independent and per-attribute basis is too inflexible. To examine the situation more closely, we modified Apriori to eliminate the occurrence of the first possibility. That is, in Apriori-plus, we created enough (overlapping) bins such that for each attribute value v, there is one bin with v as either the left or right boundary of the bin. Amongst those numerous bins, the most appropriate ones are then selected out as the algorithm progresses. Needless to say, with this modification, the runtime required by Apriori-plus becomes even more hopeless than Apriori, on account of the vast increase in the number of bins considered (e.g., 140 seconds for k=2, and >2500 seconds for k=3). What is really of interest about Apriori-plus is its relative storage and coverage figures. For k=2, the relative storage is 0.848 and coverage is 0.994 — in the middle between Apriori and Single-k. That Apriori-plus gives higher coverage than Apriori is expected. That Apriori-plus gives lower coverage than Single-k indicates that any binning done on a fascicle-independent, and per-attribute basis reduces effectiveness. As can be seen from Table 4, the coverage of the Single-k algorithm drops slightly as k increases, but its relative storage improves. In other words, the increasing k factor far dominates the decreasing coverage factor. Since the ultimate measure of performance for the storage minimization problem is storage requirements, we shall focus primarily on storage figures. 21 (a) Storage Requirement 0 500 1000 1500 2000 2500 3000 3500 P, number of t ip sets -»-S ing le-k(1 /32,8) -®-Mul t i -k(1/32,8) Multi-k(1/16,8000) - * - Multi-k(1/16,8) (b) Runtime 0) E P, number of tip sets — • — Single-k:total — • — Multi-k:total > Single-k:greedy — K — Multi-k:greedy Figure 4: The Effect of P 22 6.3 The Effect of P, the Number of Tip Sets We now study the effect of the various parameters on the Single-k and Multi-k algorithms. We begin with the internal parameter P, the number of tip sets generated, shows how P affects the relative storage and runtime of the two algorithms. We ran with: k=A, P from 250 to 4000, m from 8 to 8000, and t varied between 1/16, 1/32 and 1/48. In terms of storage requirements, we make the following observations: • Even for as few as P=250 tip sets, for a data set of 500,000 records, both algorithms offer a storage savings of over 30%. This indicates that both algorithms are effective in generating (at least some) good quality candidate fascicles. This also confirms that Apriori's strategy of generating every candidate fascicle is wasteful. • Single-k algorithm is relatively insensitive to P and gives a relative storage around 0.7 for all values of P with r=l/32 and m=8. With other combinations of t and m, the relative storage changes in a way to be described in chapter 6.5 and chapter 6.6 . In all cases, we still get a very flat curve. • In contrast, the Multi-k algorithm can be more sensitive to P. For example, with r=l/32 or 1/16 and m=8, Multi-k becomes more effective in storage reduction with increasing P values. (Here we focus on the shape of the curve; the absolute position of the curve regarding changes in t and m will be discussed in chapters 6.5 and 6.6 ). This is the case because for the Multi-k algorithm, P corresponds to the total number of j-D tip sets for all j>k. For example, among a total of 4,000 tip sets found by Multi-k, there may only be around 500 4-D tip sets (there are 13 attributes). Thus, more tip sets are required for the Multi-k algorithm to stabilize in relative storage. However, with m=8000, a situation where there are not many tip sets to begin with, even Multi-k becomes insensitive. In terms of runtime, Figure 4(b) shows (i) the total time and (ii) the portion of the total time spent on greedy selection for both algorithms with r=l/32 and w=8. (The runtime trends for other combinations are the same). The total time increases linearly with P, and there is little difference between the two algorithms. But Multi-k takes a lot longer in greedy selection than Single-k does. This is the overhead in conducting the more sophisticated weight re-adjustment shown in chapter 5.2 . For the remaining experimentation, because the runtimes of both algorithms increase linearly with P, and the improvement of storage from using a larger P value tails off quickly, we consider P=500 as a reasonable choice for our data sets. A l l remaining comparisons of the two algorithms are based on the Multi-k algorithm obtaining exactly the same number P of tip sets as the Single-k algorithm. In previous paragraphs, we have explained how Multi-k would compare if a larger value of P were used. 23 6.4 The Effect ofk, the (Minimum) Dimensionality of Fascicles Next we turn our attention to the parameter k. Figure 5(a) shows the relative storage provided by both algorithm as k varies from 2 to 12. For the Single-k algorithm, we observe that the storage savings is maximized for an intermediate value of k. When k is small, only a small amount of savings in storage is achieved through compaction — precisely because only a small number of attributes are being compressed. As k increases, there are two opposing forces: (i) k itself favors further storage reduction, but (ii) coverage decreases, thereby decreasing the number of records that can be compressed. Figure 5(b) shows the coverage figures as k varies. Initially the k factor dominates the coverage factor, giving a minimum relative storage below 0.5. But as k increases past the optimal value, in this case &opt=8, the drop in coverage dominates. 24 (a) Storage Requirement 0 2 4 6 8 10 12 k, (minimum) dimensionality of fascicles Storage-single-k — a — Storage-multi-k (b) Coverage 0 2 4 6 8 10 12 k, (minimum dimensionality of fascicles Coverage-single-k —m— Coverage-multi-k Figure 5: O p t i m a l k for Storage M i n i m i z a t i o n For the Multi-k algorithm, there is also an optimal value of k. This is the case because as explained in chapter 5.2 , the greedy selection procedure tends to select first fascicles of the 25 minimum dimensionality. So in this sense, it behaves like the Single-k algorithm. However, the multi-k algorithm differs from the Single-k algorithm in the following ways: • It is less sensitive to k. This is because even if the input k value is smaller than the actual optimal value, Multi-k algorithm is able to produce candidate fascicles of dimensionality > k. • When k is small, observe from Figure 5(a) and (b) that despite having a lower coverage, Multi-k provides better compression than Single-k. This is again due to the multi-k's ability to produce candidate fascicles of dimensionality > k. • For larger values of k, however, this ability of Multi-k becomes less important because there are fewer fascicles of dimensionality > k. At that point, coverage becomes more correlated with relative storage, and Multi-k lags behind Single-k in effectiveness. This is largely a consequence of our forcing both algorithms to generate the same number P of tip sets. The effectiveness of Multi-k can be improved by using a larger value of P — at the expense of a larger runtime. As expected, the runtime of the Single-k algorithm is independent of the value of k. Same comment applies to Multi-k, which produces the same number of tip sets as Single-k (see Figure 6).. Runtime 2500 2000 ? 1000 u , , _ , , , .. .. 3 4 5 6 k, (minimum) dimensionality of fascicles • Single-k Runtime —o— Multi-k runtime Figure 6: Run t ime is independent of k 26 6.5 The Effect of t, the Compactness Tolerance This is rather straightforward. As expected, as t increases, more records qualify to participate in a fascicle, and so there are more opportunities for storage reduction (Figure 7). Coverage 100% o o 70% I 1 1 1/16 1/32 1/48 t Figure 7: The compactness tolerance's effect on coverage A more interesting effect is that the kopt value that minimizes storage changes as t varies. Table 5 shows the kopt and the corresponding relative storage numbers for various t values. For both algorithms, as t increases, larger values of k are desirable. In other words, as we become more forgiving in our lossy compression, it is fruitful and possible to make more dimensions compact. Single-k Multi-k kopt Relative Storage ... kopt Relative Storage 1/48 8 0.486 8 0.598 1/32 8 0.472 8 0.590 1/16 10 0.423 10 0.500 Tab le 5: The Effect o f t, the Compactness Tolerance 6.6 The Effect of m, the Minimum Fascicle Size As the minimum size m increases, fewer fascicles can assist in storage minimization. Table 6 shows this phenomenon. As for the optimal kopt value, it decreases as m increases. Unlike in the previous case of increasing t, this time as we become more restrictive in admitting candidate fascicles regarding their sizes, the higher dimensional fascicles reduce more drastically in numbers, and thus in their effectiveness for storage minimization. 27 m Single-k Multi-k kopt Relative Storage kopt Relative Storage 8 8 0.472 8 0.590 800 8 0.487 8 0.598 8000 7 0.562 7 0.630 Table 6: The Effect of m, the Minimum Fascicle Size 6.7 Scalability with respect to Data Set Size Figure 8 shows how the two algorithms scale up with respect to increasing data set size. Both algorithms scale up linearly. While the results shown in the figure are based on k=4, the results generalize to other values of k. The reason why Multi-k runs faster than Single-k is strictly a consequence of our forcing both algorithms to produce the same number of tip sets (P=500 as usual). In other words, it takes the Multi-k algorithm less time to produce 500 tip sets since multi-k produces more tip sets per disk block (see Chapter 4.3 ). In sum, we have shown empirically how the two algorithms behave when the various parameters change. Even with a small number of tip sets (e.g. P=500 for 500,000 records), both algorithms are capable of compressing the data significantly (e.g., to be below 50% of the original size). Both algorithms scale up linearly with data set size. As for choosing between the two algorithms, Multi-k is our choice for small values of k, but Single-k can deliver more storage reduction for large values of k in a fixed amount of runtime. 28 Size of dataset VS runtime 6.8 Further Compression In chapter 2.1, we stated that our algorithm could be used in conjunction with other compression scheme to offer further data reduction. Table 7 shows the result on using two common compression algorithms on our dataset. The algorithms are gzip and compress on the Unix platform. We used ASCII data for gzip and binary data for compress. We used a dataset size of 500k, k = 8, and various values of t. We also organize our fascicles in two ways. The first way is simply reorganizing the dataset so that all tuples that are in the same fascicle would be put next to each other, without any loss of data. The second way is to store only the non-compact attributes for all fascicles, throwing away the compact attributes. We call the first way "Lossless", and the second way "Lossy". The results shown are the result dataset size when compared to the original dataset. Gzip Compress Lossless Lossy Lossless Lossy Original dataset 41.9% 67.6% t = 1/32 34.2% 14.0% 55.4% 20.0% t = 1/16 31.5% 14.0% 51.4% 20.0% t = 1/10 34.2% 14.2% 54.0% 20.4% Table 7: Effect of addition compression scheme on fascicles 29 As expected, by throwing away the compact attributes, we can achieve a very good compression ratio when a compression routine is used on top of our fascicles. Another interesting result is that even simply reorganizing the dataset according to fascicles can help general compression algorithms in compressing the data. Perhaps this is not too surprising since general data compression works best when there are many repeated values and fascicles can help group similar values together. Chapter 7 Solving the Pattern Extraction Problem Pattern extraction, unlike storage minimization, is not a well-defined problem — there are many different types of patterns one may wish to extract from a data set, and many different possible metrics to assess the quality of the extracted patterns. So, we begin this chapter by defining our pattern extraction task in precise terms. Then we develop an algorithm to carry out the task, which is an extension to the Multi-k algorithm. Finally, we show some empirical results. 7.1 The Pattern Extraction Problem with Fascicles By definition, a k-D fascicle contains records that more or less agree on the values of the k compact attributes, and differ considerably only on the values of the other attributes. Provided that the size of the fascicle is not small, this may represent a pattern or a piece of knowledge of great data mining value. Given a minimum size m, like the minimum support in association rules, we say that a fascicle is frequent if it contains at least m records. Frequency alone, however, is not enough to guarantee the generation of fascicles (and patterns) of the highest quality. To do so, we need a quality ordering on fascicles and sets of fascicles. Given two fascicles F\, F2 with dimensionality k\, and compactness tolerance t\, h respectively, we say that: Fx{kvtx)>f F2{k2,t2) iff kx>k2 and tx<t2 Equation 4 The intuition is that for any fascicle F\(k\, t\), F\ is automatically a fascicle of the quality pair (&2, t2) where <, k\ and h > t\ . For any record R contained in both F\, Fi with F i >f F2, we naturally prefer F i . For the pattern extraction problem, there is a pre-defined range of dimensionalities [fcmin, &max] and a series of compactness tolerance t\ < ... < tu. Given these, the partial ordering in (4) defines a lattice. Figure 9 gives an instance of the quality lattice for pattern extraction from the N H L data set. In this instance, the dimensionality k varies from some minimum (not shown in the figure) to 12, and the compactness tolerance can be 1/8, 1/16 or 1/32 (cfChapter 6.1 ). In the figure, a directed edge indicates the partial ordering relationship, e.g. (12,1/32) (12,1/16). 30 The ordering >f defined on fascicles can be extended to give an ordering >s on sets of fascicles. To do so, we rely on the following ordering: 5, >f S2 iff VF2 e 52,3F, e 5, :F, >r F 2 E q u a t i o n 5 This is the well-known Smyth ordering for power sets [20]. For example, with the lattice shown in Figure 9, the sets 5i = [(12,1/8), (10, 1/16)} and 5 2 = [(11,1/32)} are mutually incomparable. However, the set 53 = {(12,1/32)} dominates both, i.e., 53 >s Si, Sj >s 52. Increasing k value (12, 1/32) 4 (11,1/32) 4 (10, 1/32) 4 Decreasing • compactness tolerance Figure 9: A n E x a m p l e : Pa t te rn Ex t r ac t ion w i t h Fascicles for the N H L data The ordering >s in Figure 9 allows us to choose between sets of fascicles. Specifically, it defines equivalence classes of fascicles. Thus, it is possible to have two sets Si >s 52 and 52 ^ s 53 without having S\ = 5 2. An example are the sets {(12, 1/32)} and {(12, 1/32), (11, 1/32)}. In other words, the equivalence classes defined by >s are not necessarily minimal in the set inclusion sense. This is particularly relevant to our task at hand, because a record R may be contained in many fascicles. To ensure that the final generated fascicles and patterns are of the highest quality, we define Fas(R) given in Equation 1 to ensure that among the set of fascicles all containing R, only the maximal ones with respect to >f can keep R. Hence, putting all the pieces together, the pattern extraction problem we are tackling here is one that given a pre-defined range of dimensionalities [k„an, fcmax] and a series of compactness tolerance t\ < ... < tu, and a 12 11 10 (i2, i/sr I (11, 1/8)" (10, 1/8)" 4 "(12, 1/16) I "(ll, 1/16) 4 "(10, 1/16) 4 31 rninimum size m, tries to find fascicles of sizes > m, such that for all record R, Fas(R) is maximized with respect to the ordering >s. There are two important details to note in the above definition of the task. First, a record R that is contained in both F\ and F2 with Fi > F2, does not contribute to the size/frequency count of F2. In other words, for F2 to be considered frequent, there must be at least m records for which F2 is one of their maximal fascicles. Second, one may wonder why we go to the complication of finding fascicles to maximize Fas(R) for each record R, but not simply find the fascicles with the maximal qualities. Under this second approach, if we have fascicles of qualities (11,1/16), (10,1/32) and (9,1/32), those of qualities either (11,1/16) or (10,1/32) would be output, but those of quality (9,1/32) would be discarded. Our definition of the problem is strictly more informative than this second approach in that while fascicles of qualities either (11,1/16) or (10,1/32) are output, fascicles F of quality (9,1/32) are also output — provided that there are at least m records for which F is maximal. The argument is similar to the Multi-k fascicle selection problem discussed in chapter 5.2 . 7.2 The Multi-Extract Algorithm 7.2.1 Computing Fascicles for all (k,t) Pairs Just like the storage minimization problem, there are two steps in solving the pattern extraction problem. The first step is again to find the fascicles. To that end, we can consider the three algorithms studied previously. First, let us consider the Apriori algorithm. For the task at hand, the advantage of the Apriori algorithm is its completeness property. As far as data mining is concerned, it is nice to be able to find all fascicles for a given (k,t) pair (but subject to the bias introduced by pre-binning). However, the disadvantage is that, as shown in the example in Figure 9, the best patterns are those corresponding to fascicles with high dimensionality, i.e., the closer to the total number of attributes, the better. A bottom-up, level-wise computation simply requires immense effort to get there. The experimental results shown in chapter 6.2 confirm this concern. Furthermore, recall that for the Apriori algorithm to work, pre-binning is required with respect to a particular tolerance. In other words, each compactness tolerance would require one execution of the Apriori algorithm. The total amount of computational effort is prohibitive. Next consider the Single-k and Multi-k algorithms. To use the Single-k algorithm for multiple values of k and t, we have to run the algorithm once per (k,t) pair. Even though the Single-k algorithm is fast, the expense of running it multiple times can quickly get to be prohibitive. The situation for the Multi-k algorithm is better. Because the Multi-k algorithm generates tip sets for all values of k under consideration, it is sufficient to run the algorithm once for each value of t. As it turns out, it is easy to adapt the Multi-k algorithm to do even better. Recall that there is an initial tip set generation phase, followed by a phase to grow tip sets into maximal sets. Iterating the Multi-k algorithm for multiple values of t amounts to running both the generation and the growing phase multiple times. We can be even more efficient by running the generation phase multiple times, but only once for the growing phase. More specifically, we iterate the generation phase (over each chunk read into memory) to produce tip sets for different values of t. This does 32 not require any additional disk I/O. Then we grow all tip sets of varying dimensionalities and compactness tolerance simultaneously in one final scan of the relation. This is summarized in the first two steps of the Multi-Extract algorithm shown in Figure 10. Algorithm Multi-Extract Input: relation R, minimum dimensionality k, minimum size m, a series of tolerance ?i<f2<...<fu, integer P. Output: fascicles F having dimensionality > k and containing at least m records for which F is maximal. 1. Divide R into q disjoint pieces (as in the Single-k algorithm), each comprising a set of randomly selected pages of R. For each piece: /* choosing initial tip sets */ 1.1 Read the piece into main memory. 1.2 For each value of t{. I* iterating over different tolerance values */ 1.2.1 Read the records in main memory to produce a series of tip sets like in the Multi-k algorithm. 1.2.2 Repeat 1.2.1 an appropriate number of times, each with a different permutation of the records, until Plq tip sets are obtained. 2. /* growing all tip sets together */ Grow all the tip sets with possible range expansion, as discussed in chapter 4.2.3 , with one more pass over the relation. 3. /* maximize Fas(R) */ For all the tip sets F obtained above, process in ascending rank: 3.1 -F r e m = F - ( U O F G ) , where the G's were frequent tip sets obtained in previous iterations of the loop. 3.2 If IF r e m l > m, output F r e m . Figure 10: A Skeleton of the Multi-Extract Algorithm 7.2.2 Maximizing Fas(R) Once candidate fascicles have been generated, the second step for pattern extraction is quite different from that for storage minimization. In the case of the latter, we sought to find for each record a single fascicle to minimize the storage of the record. For pattern extraction, a record can be put in multiple (frequent) maximal fascicles. To achieve this, we define the rank of a fascicle, with respect to the partial ordering given in Figure 10. Given the range of dimensionality [kmin, k^] and a series of compactness tolerance t\<...<tu, we have the following inductive definition of rank: 33 • (base case) rank(F(fcmax, t)); and • (inductive case) rank(FOU))=1 +min {rank(F(fc+1 ,fO),rank(F(IU-i))} For the example given in Figure 9, the series of compactness tolerances is 1/32, 1/16 and 1/8, and the largest dimensionality is 12. Accordingly, all fascicles F(12,l/32) are assigned the top rank, i.e., rank=l. By the inductive case of the definition, all fascicles F ( l 1,1/32) and F(12,l/16) have rank=2. Similarly, all fascicles F(12,l/8), F(12,l/16) and F(10,l/32) have rank=3. From Figure 9, it is easy to see that the rank of F(k,t) corresponds to the length of the shortest path to (k,f) from the root, which is the top pair, e.g. (12,1/32). Also, for any pair of fascicles F \ , F2 assigned the same rank, it is the case that F\ and F2 cannot be compared using >f. Having defined the notion of rank, we can now solve the pattern extraction problem by processing fascicles in ascending order of rank. In our example, we first process (12,1/32) fascicles. Then we process all (11,1/32) fascicles and (12,1/16) fascicles, and so on. This order of processing fascicles is captured in Step 3 of the Multi-Extract algorithm shown in Figure 10. The only remaining issue is to ensure that in determining whether a fascicle F\ is frequent, we only count those records that are not contained in any frequent fascicle F 2 with F2 >f F \ . The lemma below guarantees that with our strategy to process fascicles in ascending order of rank, when it is time to process F \ , we have already processed all the applicable F2's with F 2 >f F \ . We can then simply conduct the set subtraction FI-(KJF>F\ F), and check whether the remaining set meets the minimum size threshold m. This is summarized in the loop of Step 3 in Figure 10. L e m m a 1 Let F be a fascicle of rank j. Then: • For all fascicles G with rank 1" < j, either G >f F, or F and G are incomparable. • For all fascicles G >f F, the rank of G is strictly less than j. Proof Sketch: By a simple induction on rank. One additional optimization is worth mentioning. In computing the subtraction F\-{KJF>F\ F), it is possible to accumulate the component KJF>FI F in ascending order of the rank of F. This allows ^>F>FI F to be computed incrementally, and frees us from keeping around all frequent fascicles obtained in previous iterations of the loop in Step 3. L e m m a 2 [Correctness of M u l t i - E x t r a c t ] Let S be the set of candidate fascicles generated in Steps 1 and 2 of Algorithm Multi-Extract. For any record R, let the set of maximal fascicles containing R output by Step 3 be SR={FI,...,FU}. Then: • There does not exist a pair Fj,Fj such that Fi>fFj, l<u, j<u. • There does not exist a set S ' e S and S ' - Q S ^ such that 5'> s5«. 34 Proof Sketch: The first part is guaranteed by our notion of rank and Lemma 1. Here we only show a sketch for the second part. Suppose there exists such an S'. Then there must exist a pair of (frequent) fascicles F,F\ such that FeS, F&SR, and F>fF\. By the definition of rank, F is assigned a smaller rank than F\. Thus, in Step 3 of Algorithm Multi-Extract, F is processed before F, and is output by Step 3. By set subtraction, R is not contained in F\. Thus, F, is not in SR, which is a contradiction. The lemma above shows that Step 3 of Multi-Extract is correct in producing Fas(R) and maximizing Fas(R) for each record R — based on all the candidate fascicles computed in Steps 1 and 2. But the algorithm as a whole is not exact because in Steps 1 and 2 we do not seek to compute all fascicles. The experimental results shown in chapter 6.2 convincingly demonstrate that computing all fascicles is computationally prohibitive. 7.3 Empirical Results Below we present some results by running the Multi-Extract algorithm on the N H L data set. The results from the larger AT data set were similar, and in fact quite a bit stronger in terms of the (proportional) sizes and quality of the fascicles discovered. However, we are unable to present these results without divulging proprietary information, and so restrict ourselves only to the N H L data set for this chapter. Fascicles (12,1/6) (10,1/12) (8,1/8) Population .10% +12.5% +12.5% Games played [1,82] [1,5] ([1,79]) ([1,82]) Goals Scored [0,50] [0,0] [0,3] [0,3] Assists [0,80] [0,2] [0,5] [0,5] Points [0,130] [0,2] [0,7] [0,7] PlusMinus [-20,40] [-2,1] [-2,1] ([-13,8]) Penalty Min [0,400] [0,18] [0,22] ([0,233]) Power Play Goals [0,20] [0,0] [0,1] [0,1] Short Handed Goals [0,9] [0,0] [0,0] [0,1] Game winning Goals [0,12] [0,0] [0,1] [0,1] Game tying Goals [0,8] [0,0] [0,0] [0,1] Shots [0,400] [0,8] [0,23] [0,25] Percentage of Goals Scored [0,20] [0,0] ([0,5]) ([0,5]) Table 8: Descript ions of Sample Patterns i n the N H L Da ta S e t E a c h co lumn is a fascicle w i t h the attr ibute ranges given. A l l attributes are compact except the ones i n parentheses. Table 8 shows three fascicles obtained in the N H L data set. Each row describes the range of a particular attribute. The range following the name of the attribute in the first column gives the range of the attribute of the entire data set. The first fascicle has all 12 attributes of the data set as compact attributes and a compactness tolerance of 1/16 of the attribute range. In spite of the stringent requirements, 10% of the players are in this fascicle. These are players who have very limited impact on every aspect of the game. The second fascicle accounts for an additional 35 12.5% of the population. Compared with the first fascicle, the additional players here could play in a lot more games and have some variations in scoring percentage. Even though these players played in many more games, they still had little impact on the game. The first two fascicles together indicate that almost 1 out of 4 players did next to nothing. Finally, on top of the first two fascicles, there is an additional 12.5% of players whose main contribution appear to be their penalty minutes, as indicated by the range ([0,233]) of the non-compact attribute. The performance of the Multi-Extract algorithm is very similar to that of the Multi-k algorithm and hence not presented here. In particular, the variation with P, the number of tip sets generated, is exactly as with Multi-k. Since we wish to use a broad range of different values of k and t to obtain fascicles of differing qualities, these no longer remain tuning parameters. Similarly, the size parameter, m, is no longer just an internal parameter as it was for storage minimization. Instead, it establishes a support threshold, so that, by and large, we have far larger values of m for the pattern extraction problem than if storage minimization were our objective. 36 Chapter 8 Conclusions and Future Work Many data sets have (approximately) repeated values for many attributes. We have taken a first step towards identifying and exploiting such repetition in this paper. We have introduced the notion of a fascicle, which is a subset of records in a relation that have (approximately) matching values for some (but not necessarily all) attributes. We have presented a family of algorithms to find fascicles in a given relation. We have evaluated how these algorithms behave under different choices of parameters, and shown how fascicles can be used effectively for the tasks of storage reduction and pattern extraction. Given that optimality is very hard to obtain for both tasks, our heuristic algorithms produce good results. We believe that the compression aspect of fascicles can provide good support of approximate query answering9 in the style championed in [4]. The dimensionality reduction aspect of fascicles can find applications in reducing indexing complexity, reverse engineering of schema decomposition, vocabulary-based document classification, and other operations that are exponential in the number of dimensions (e.g., high dimensional outlier detection [14] or even approximate data cube computation). Finally, as shown in Chapter 7 , the pattern extraction aspect of fascicles can find applications in any data set that has (approximately) repeated values for many attributes. As for future work, (1) we believe that fascicles have opened the door to conducting data reduction (and analysis) not on the coarse granularity of the entire data set, but on the fine granularity of automatically identified subsets. As such, it would be interesting to see how the notion of compactness studied here change to other more sophisticated criteria, such as subsets with strong correlation, subsets amenable to SVD, etc. (2) In this paper, we have assumed a simple lossy compression of the compact attributes of a fascicle. However, it is possible to apply lossless compression on those attributes. For instance, Huffman encoding is particularly effective for similar values, which is exactly the case for compact attributes. If encoding is applied to the compact attributes, it would be interesting to see how to incorporate this aspect into the storage formula and minimization process. (3) We have not considered incremental updates here; this should be an interesting topic to explore. (4) We have assumed that once the fascicles are found, one will be able to engineer the query processing engine to consult multiple relations and to deal with compact attribute values projected out. The complexity of doing all this is not much different from the complexity of using views to answer queries. However, the details need to be worked out to create such a system. 9 For point queries, the maximum error is til in each of the k compact attributes. For range queries, we may modify the end-points of the range by up to til for ranges specified on compact attributes. No error is introduced on ranges specified on non-compact attributes. For aggregate queries, the worst case errors are as above. But the average errors are likely to be much smaller since positive and negative errors tend to cancel each other. 37 References [I] AGGR98 R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. Proc. 1998 SIGMOD , pp 94--105. [2] AIS93 R.Agrawal, T.Imielinski and A.Swami. Mining association rules between sets of items in large databases. Proc. 1993 SIGMOD , pp 207-216. [3] AS94 R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Proc. 1994 V L D B , pp 478-499. [4] D. Barbar a , W. DuMouchel, C. Faloutsos, P. J. Haas, J. M . Hellerstein, Y . Ioannidis, H. V . Jagadish, T. Johnson, R. Ng, V . Poosala, K. A. Ross, K. C. Sevcik. The New Jersey Data Reduction Report. IEEE Data Engineering Bulletin, 20, 4, Dec. 1997. [5] S. Brin, R. Motwani, J. Ullman and S. Tsur. Dynamic itemset counting and implication rules for market basket data. Proc. 1997 SIGMOD , pp 255-264. [6] M . Ester, H.P. Kriegel, J. Sander and X . Xu. A density-based algorithm for discovering clusters in large spatial databases with noises. Proc. 1996 K D D , pp 226—231. [7] C. Faloutsos and K. Lin. FastMap: a Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. Proc. 1995 ACM-SIGMOD , pp. 163— 174. [8] J.H. Friedman and J.W. Tukey. A Projection Pursuit Algorithm for Exploratory Data Analysis. IEEE Transactions on Computers , 23, 9, pp 881-889, 1974. [9] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990. [10] S. Guha, R. Rastogi, K. Shim. CURE: an Efficient Clustering Algorithm for Large Databases. Proc. 1998 ACM-SIGMOD, pp. 73-84. [II] T. Johnson and T. Dasu. Comparing massive high-dimensional data sets. Proc. 1998 K D D , pp 229-233. [12] R. Karp. Reducibility among combinatorial problems. Complexity of Computer Computations , Plenum Press, 1972, pp 85—103. [13] M.Klemettinen, H.Mannila, P.Ronkainen, H.Toivonen, and A.L Verkamo. Finding interesting rules from large sets of discovered association rules. C I K M 94, pp 401—408. [14] E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. Proc. 1998 V L D B , pp 392-403. 38 [15] F. Korn, H.V. Jagadish, C. Faloutsos. Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. Proc. 1997 ACM-SIGMOD , pp. 289--300. [16] H. Mannila and H. Toivonen. Level-Wise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery , 1, 3, pp 241—258. [17] R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. Proc. 1994 V L D B , pp 144-155. [18] J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. Proc. 1995 SIGMOD , pp 175-186. [19] A.Savasere, E.Omiecinski, and S.Navathe. An efficient algorithm for mining association rules in large databases. Proc. 1995 V L D B , pp 432—444. [20] M . Smyth. Power Domains. Journal of Computer and System Sciences , 16, 1, pp. 23— 36, 1978. [21] H.Toivonen. Sampling large databases for association rules. Proc. 1996 V L D B , pp 134— 145. [22] T. Zhang, R. Ramakrishnan and M . Livny! BIRCH: an efficient data clustering method for very large databases. Proc. 1996 SIGMOD , pp 103-114. 39 Appendix - Results from the NHL dataset In this appendix, we will cover the results obtained by running our algorithm with the N H L dataset. The N H L dataset is very small in comparison to the A T & T dataset (855 tuples versus 800,000+ tuples). Therefore, we will see some variations when comparing these results to the ones obtained from the A T & T dataset. These variations, however, are very minor and the overall results are very similar. First, we will look at the effect of P (the number of tip sets generated) on the N H L dataset. Figure 11 shows how P effect the relative storage for t=l/8, m=8. The curves shown here is very similar in shape to Figure 4(a). We see that we get a very flat curve for the single-k algorithm and a more P sensitive curve for the multi-k algorithm. Effect of P on storage Q) cn c u _ . o 0 .8 0 .7 0 .6 0 .5 0 .4 0 .3 0 .2 0.1 0 . _ _ _ — _ _ _ . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 5 0 0 1 0 0 0 P 1 5 0 0 S i n g l e - k (1 /8 ) Mu l t i - k (1 /8 ) 2 0 0 0 Figure 11: Effect of P on storage for the NHL dataset Next, we look at the effect of P on runtime in Figure 12. We observed that while the runtime for the greedy stage is as expected (i.e. multi-k greedy takes longer to run as P increases due to the weight re-adjustment), the single-k total runtime is a great deal higher. The reason for the large gap in runtime can be a result of the dataset being so small. The single-k algorithm generates one tip set per disk block read while the multWc algorithm produces several tip sets for the same block. As a result, the single-k algorithm performs more disk I/O than the multi-k algorithm. As the block size is small, the entire block may fit into the cache and result in much a faster multi-k execution. 40 Effect of P on runtime Figure 12: Effect o f P on runt ime for the N H L dataset Let us look at the effect of k on the N H L dataset. Effect of k on stroage and coverage 0 2 4 6 8 10 12 k Single-k storage — • — Single-k coverage Multi-k storage —x— Multi-k coverage Figure 13: Effect of k on storage a n d coverage for the N H L dataset 4 1 Figure 13 shows the effect of k on storage and coverage for the N H L dataset. Again, the results are very similar to those obtained from the A T & T dataset. Coverage for both algorithms decreases as k increases. The multi-k algorithm is less sensitive to k as expected. As for storage, the multi-k algorithm again wins at low value of k but loses to the single-k algorithm when k is optimal. Finally, we look at the effect of t and m. Single-k Multi-k kopt storage coverage kopt storage Coverage 1/8 8 0.569 0.657 8 0.644 0.582 1/16 8 0.673 0.461 8 0.776 0.366 1/32 6 0.736 0.518 6 0.811 0.318 Tab le 9: Effect of t on storage a n d coverage for the N H L dataset m Single-k Multi-k kopt Storage kopt Storage 8 9 0.59 8 0.64 80 8 0.65 8 0.65 200 7 0.71 7 0.71 Tab le 10: Effect of m on storage for the N H L dataset Once again, the results obtained from the N H L dataset for t and m are very similar to the results obtained from the AT&T dataset. We see that as t decreases, storage, coverage, and kopt decrease. As for m, it is directly proportional to storage while inversely proportional to k o p t . 42
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Semantic compression and pattern extraction with fascicles
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Semantic compression and pattern extraction with fascicles Madar, Jason Chun-Sing 1999
pdf
Page Metadata
Item Metadata
Title | Semantic compression and pattern extraction with fascicles |
Creator |
Madar, Jason Chun-Sing |
Date Issued | 1999 |
Description | Often many records in a database share similar values for several attributes. If one is able to identify and group these records together that share similar values for some — even if not all — attributes, not only does one have the possibility of a more parsimonious representation of the data, but one may also gain useful insight into the data from an analysis and mining perspective. In this thesis, we introduce the notion of fascicles. A fascicle F(k,t) is a subset of records that have k compact attributes. An attribute A of a collection F of records is compact if the width of the range of A-values (for numeric attributes) or the number of distinct A-values (for categorical attributes) of all the records in F does not exceed t. We introduce and study two problems related to fascicles. First, we consider how to find fascicles such that the total storage of the relation is minimized. Second, we study how best to extract fascicles whose sizes exceed a given minimum threshold and that represent patterns of maximal quality, where quality is measured by the pair (k,t). We develop algorithms to attack both of the above problems. We show that these two problems are very hard to solve optimally. But we demonstrate empirically that good solutions can be obtained using our algorithms. |
Extent | 4388717 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-06-15 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051612 |
URI | http://hdl.handle.net/2429/9180 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1999-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1999-0277.pdf [ 4.19MB ]
- Metadata
- JSON: 831-1.0051612.json
- JSON-LD: 831-1.0051612-ld.json
- RDF/XML (Pretty): 831-1.0051612-rdf.xml
- RDF/JSON: 831-1.0051612-rdf.json
- Turtle: 831-1.0051612-turtle.txt
- N-Triples: 831-1.0051612-rdf-ntriples.txt
- Original Record: 831-1.0051612-source.json
- Full Text
- 831-1.0051612-fulltext.txt
- Citation
- 831-1.0051612.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051612/manifest