Efficient Data Mining of Constrained Association Rules by Chiu Yan (Alex) Pang P h . D . (Physics), North Carolina State University, 1995 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR THE DEGREE OF M a s t e r of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia August 1998 © Chiu Yan (Alex) Pang, 1998 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Computer Science The University of British Columbia Vancouver, Canada V 6 T 1Z4 Date: Abstract W i t h the recent advances in information technology, companies are now collecting more and more data related to their business. Companies are very interested in decision support systems that can discover knowledge from data and help them gain insight into their data. Data mining with the goal of discovering non-trivial information or patterns hidden in large databases has, therefore, recently become one of the most active research areas in database technology. Association rules relate items which tend to occur together in a given event or record. Mining association rules represents one of the most important problems in data mining. However, the current framework suffers seriously from the lack of user interaction and focus. In this thesis, we propose a new paradigm called Constrained Association Rules where (i) the mining of the rules is divided into two phases with various breakpoints for user feedback, and (ii) users can associate constraints with their queries. We analyze many SQL-style constraints and introduce the notions of succinctness and anti-monotonicity for their classification. We design a new algorithm called C A P for mining association rules that satisfy a set of given constraints. The idea is to check for satisfaction of the constraints as early as possible by exploiting the properties of anti-monotonicity and succinctness of the constraints. Several optimization techniques are developed. Our experimental evaluation indicates that C A P runs much faster and can sometimes outrun several basic algorithms by as much as 80 times. ii Contents Abstract ii Contents iii L i s t o f Tables vi L i s t of F i g u r e s viii Acknowledgements ix Dedication 1 2 x Introduction 1 1.1 Data Mining 1 1.2 Association Rules 4 1.3 Motivation of the Thesis 6 1.4 Contributions of our work 7 1.5 Outline of the Thesis 9 Background: Association Rules 10 2.1 Formulation 10 2.2 Apriori Algorithm 12 2.3 Related Works 14 iii 3 Constrained Association Queries 21 3.1 Definitions 21 3.2 Architecture 22 3.3 Syntax and Examples 24 3.4 Classification of Constraints 27 3.4.1 Anti-monotonicity 27 3.4.2 Succinctness 31 4 Basic Algorithms 36 4.1 Apriori 36 4.2 Full Materization 37 4.3 Hybrid(m) 38 + 5 Algorithm C A P 41 5.1 Succinct and Anti-monotone Constraints 42 5.2 Succinct but Non-anti-monotone Constraints 42 5.3 Non-Succinct but Anti-monotone Constraints 44 5.4 Non-Succinct and Non-anti-monotone Constraints 45 5.5 Multiple Constraints 46 5.6 Summary 47 6 Experimental Evaluation 49 6.1 Implementation and Experimental Environment 49 6.2 Succinct and Anti-monotone Constraints 51 6.3 Succinct but Non-anti-monotone Constraints 56 6.4 Non-Succinct but Anti-monotone Constraints 61 6.5 Non-Succinct and Non-anti-monotone Constraints 65 6.6 Summary 69 iv 7 Conclusions 70 7.1 Conclusions 70 7.2 Future Work 72 Bibliography 76 v List of Tables 2.1 Equi-depth vs. distance-based partitioning 16 3.1 Classification of 1-var Constraints 28 6.1 Number of frequent sets that satisfy max(S.'Price < v) /number of frequent sets for different selectivities (v/10) with support = 0.5% 6.2 Number of frequent sets that satisfy max(S.Price) . < v) /number of frequent sets for different support values at 30% selectivity 6.3 55 Number of frequent sets that satisfy {soda} C S'.Type /number of frequent sets for different selectivities with support = 0.5% 6.4 57 Number of frequent sets that satisfy {soda} C S'.Type /number of frequent sets for different support values at 13% selectivity 6.5 Number of frequent sets that satisfy sum(S.Price) /number of frequent sets for different MaxSum 6.6 < 60 MaxSum) with support = 0.5% Number of frequent sets that satisfy sum(S.Price) < 64 MaxSum) I number of frequent sets for different support values with MaxSum = 500 6.7 54 64 Number of frequent sets that satisfy Avg(S.Vvice) number of sets that satisfy mm(S.Price) < AvgPrice frequent sets for different values of AvgPrice vi < AvgPrice / j number of with support = 0.5% . 68 6.8 Number of frequent sets that satisfy Avg(S.Price) < AvgPrice number of sets that satisfy m m ( 5 . P r i c e ) < AvgPrice frequent sets for different support values with AvgPrice 7.1 A n example to illustrate the idea of segment support vii j / number of = 200 . . . 68 73 List of Figures 2.1 The lattice space formed from five items { A , B , C , D , E } 12 2.2 A n example illustrating Apriori Algorithm with five items 14 2.3 A n example of a taxonomy (concept hierarchy) 15 3.1 A n Architecture for Exploratory Association Mining 22 6.1 Speedup vs Selectivity for a Succinct and Anti-monotone Constraint 52 6.2 Speedup vs Support for a Succinct and Anti-monotone Constraint 6.3 Speedup vs Selectivity for a Succinct and Non-anti-monotone Constraint . 53 58 6.4 Speedup vs Support for a Succinct and Non-anti-monotone Constraint 59 6.5 Speedup vs Selectivity for a Non-succinct and Anti-monotone Constraint 6.6 62 Speedup vs Selectivity for a Non-succinct and Anti-monotone Constraint 6.7 63 Speedup vs Selectivity for a Non-succinct and Non-anti-monotone Constraint 6.8 66 Speedup vs Support for a Non-succinct and Non-anti-monotone Constraint 67 vin Acknowledgements I would like to thank my supervisor Dr. Raymond Ng for his invaluable guidance and financial support. He is nice and brilliant. Without him, this work could not have been completed. Special acknowledgments also go to Dr. Laks Lakshmanan and Dr. Jiawei Han who are the coauthors of the paper [NLHP98] on which this thesis is based. I would also like to thank my second reader Dr. George Tsiknis for reading this thesis and for his invaluable suggestions. Finally, I would like to thank all of my friends and other people in the Computer Science department at U B C for having created a supportive and friendly environment which has made my study here very memorable. CHIU Y A N (ALEX) PANG The University of British Columbia August 1998 ix To my family x Chapter 1 Introduction 1.1 Data Mining What is data mining ? Over the past 15 to 20 years, computers have been used to capture detailed transaction information in a variety of corporate enterprises. Some of the examples of transaction-intensive industries are retail sales, banking, telecommunications, and credit card operations. W i t h the availability of powerful and affordable computer systems, many corporations have created huge repositories of data related to their business. These collected data represent an invaluable source of information because the data implicitly contain knowledge about the behavior of the customers. The knowledge in turn can play a crucial role in the corporation's survival in today's competitive market. Consequently, new techniques and tools which can intelligently and automatically transform the processed data into useful information and knowledge has become highly relevant. Data mining, which is also referred to as knowledge discovery from databases, can be denned as an automatic process of discovering non-trivial, previously un1 known, and potentially useful information from very large datasets. In the past few years, it has not only become one of the most active research areas in databases, but has also attracted increasing industrial attention. A leading industrial analyst [Meta Group] has projected the data mining market to grow from $3.3 billion in 1996 to $8.4 billion by the year 2000. There are many applications of data mining. For example, discovered knowledge from data mining has been found to be relevant in direct marketing, decision making, fraud identification, and process control. Data mining models Discovered knowledge means high-level information such as regularities, rules, patterns, trends, etc. Many interesting models for discovered knowledge have been identified by researchers. These include associations [AS94], sequential patterns [AS95], classification [MAR96], clustering [NH94], outliers [KN97], and temporal patterns [ALSS95]. Association models focus on items that occur together in a given event or record. A typical rule discovered in an association based data mining takes the form: If item X is part of an event, then c% of the time (the so-called confidence factor), item Y is also part of the event. A famous example is the rule "98% of the people who buy diapers also buy beer ". Another example is "85% of customers that purchase tires and auto accessories also get automotive services ". Association rule analysis has gained increasing interest with the widespread use of checkout scanners, which let retailers gather transaction details. This is why market-basket analysis has been the most well-known application of association rule analysis. This whole thesis is, in fact, on the problem of finding association relationships. A more precise definition of the model will be given in the following section. Sequential patterns are similar to association models, except that the relationships among items are spread over time [AS95]. In fact, association mining 2 techniques can be applied to mining sequential patterns by treating sequences as associations in which the events are linked by time. The motivation behind sequential patterns is the elapsed time between transactions or the duration of an event in an association can be crucial. In classification [MAR96], it is assumed that the value of a categorical variable can be assigned to any cases. The categorical variable is used as a classification label. A number of cases for which the classification label is known are employed as training examples. A classification algorithm then attempts to find a predictive pattern that can classify other cases. Clustering models segment a database into different groups whose members are very similar [NH94]. However, unlike classification, one does not know a priori what the clusters will be, or on what attributes the data will be clustered. Consequently, the resulting clusters could reveal previously unknown facts about the data. Interesting applications of both classification and clustering are in direct marketing. For example, consider a company doing promotion with a mailing list. If the company can classify the customers on the list into categories such as "likely response" or "unlikely response" based on historical patterns, the company can then tailor its marketing plan and target only the customers that are more likely to respond. This can significantly reduce the cost and increase the rate of return of the promotion. Another pattern that is gaining increasing popularity is outlier detection [KN97]. While most of the models mentioned above deal with the trends of the majority, outlier detection attempts to identify the exceptional cases (i.e. the minority). For example, in the credit card business, an unusual spending pattern within a short period of time often indicates stolen card usage. Other applications include fraud detection in production processes. Finally, time is often an important attribute of a dataset. A significant example is in financial time series [ALSS95]. It is certainly a man's dream come 3 true if he can predict the movement of a stock price. Given a time series database, the problem of finding all the time series which have similar behavior is therefore of great interest. Similarity and clustering are some of the typical questions asked in data mining of temporal patterns. 1.2 Association Rules Association relationships or association rules represent one of the most popular patterns pursued in data mining processes. The basic idea of an association rule is to capture the sets of items that tend to appear together. A n example is the rule "80% of the people who buy butter also buy milk". There can be many applications of the discovered rules. To use the rule in the above example, the manager in a supermarket will make sure that butter and milk will not be on sale at the same time. The idea of association rules is introduced by Agrawal, Imielinski and Swami in the early 90's [AIS93] as part of the IBM's data mining research program [QUEST]. The motivation is to understand customer behavior by examining transactional databases. Each transaction in the database contains items that are purchased together. B y counting the frequency of the items appearing in the database, one can identify the sets of items that tend to be purchased together. To be more precise, an association rule is defined as an implication of the form X =>• Y, where X and Y are sets of items and the set X U Y appears frequently enough in the transactional database in the sense that at least s% of the transactions contain X UY, and at least c% of the transactions that contain X also contain Y. Here, 5 and c are called the support threshold and the confidence respectively. In an association query, a user simply supplies specific values for the support threshold and the confidence as input. A mining engine then finds all the itemsets 4 X, Y such that X =>• Y satisfies the above definition of an association rule. The output of the query is a list of all such association rules found. Without prior knowledge of the data, a user usually has no idea what the appropriate values for the support threshold or the confidence should be. Runs with supplied values might discover no rules or thousands of rules. This is one of the problems with the above framework of association rules. Since its introduction [AIS93], the problem of mining association rules has been the subject of numerous studies. Issues discussed include extending the association rules framework [HF95, SA95, SA96, F M M T 9 6 , MY97, T U A C M N 9 7 , BMS97, SVA97], applying the association framework to solve other problems [AS95, AMS97], improving the efficiency of the mining algorithms [PCY95, B M U T 9 7 , CHNW96], and parallel implementation of the Apriori Algorithm [AS94, HKK97]. In particular, Agrawal and Srikant in 1994 [AS94] proposed a very efficient algorithm called Apriori for mining association rules. We will explain in detail the Apriori Algorithm and review some of the more important related works in the following chapter. While all of the above mentioned work has enriched the field of association mining, none of them address adequately the question of the interestingness of the discovered rules. There may be thousands of rules found, but only a small portion of them are interesting to the user. Only Srikant et al [SVA97] discuss the issue of putting constraints on the frequent sets, which is one of the suggestions of this thesis. However, Srikant et al [SVA97] only consider membership constraints in the context of an item taxonomy, which correspond to only a small subclass of the categories of constraints considered in this thesis. 5 1.3 Motivation of the Thesis While the notion of association rules with the corresponding Apriori Algorithm discussed in the previous section represents a significant development in data mining, it suffers seriously from the following problems. P r o b l e m 1 — L a c k o f U s e r E x p l o r a t i o n a n d C o n t r o l : In a classical association query, a user supplies the support threshold and the confidence. Then, Apriori or a similar algorithm returns with all the association rules it finds. The whole mining process behaves like a black-box where the user has very limited control of the process except for supplying the two threshold values. But the supplied threshold values might not be appropriate. It is more desirable if the mining process can support user exploration on the data. However, even with the development of many efficient algorithms, state-of-the-art association mining nowaday still requires hours to complete. Therefore, without much control over the mining process and the relatively long turnaround time, the classical model of mining association rules cannot support efficient user exploration. Furthermore, the results returned by the black-box may not contain what the user is looking for. P r o b l e m 2 - L a c k of Focus: The second problem is related to the fact that the ultimate goal of a data mining user is to support his or her business decision making. The user probably has some specific questions he or she would like to answer and may therefore be much more interested in only certain types of association patterns. For example, a user may want to find associations between sets of items whose origins are domestic, or associations from itemsets of male products to itemsets of female products. However, the classical model does not support any of these expressions of focus or preference. P r o b l e m 3 - R i g i d N o t i o n o f R e l a t i o n s h i p : The third problem is that the classical notion of association relationship is too rigid. Two sets of items are "as- 6 sociated" only if they appear together frequently enough and the rule confidence exceeds the confidence threshold. While such a notion of associations is useful, there can be other types of association that are relevant as well. For example, correlation is often used in statistics. Brin, Motwani, and Silverstein [BMS97] argue that in many circumstances correlation can be more useful than confidence. For example, the rule "past active duty in military no service in Vietnam " has a very high confidence of 0.9 in census data. Yet it is quite misleading because having past military service only increases the chances of having served in Vietnam. Furthermore, sometimes it may make more sense to have different support threshold for the antecedent sets and consequent sets, especially when they are from different domains. The rule pepsi => snacks is an example of associations from sets of items to sets of types. In such situation, the appropriate support threshold values for pepsi and snacks (i.e. any item of type snacks) can be very different. These shortcomings in the current framework of the classic Association Rules form the motivation of our work. To overcome these problems, we suggest several principles which will be given in the following section. 1.4 Contributions of our work We summarize our contributions in this section. First of all, we suggest several principles of association mining which can be used to address the problems given in the last section. The principles suggested are: 1. The mining process should not behave like a black-box with one-time user supplied input parameters and final results at the end. Instead, there should be breakpoints in the process for accepting user feedback. 2. W i t h the user feedback mechanism, a user should not only be able to guide and control the mining process, but also have the chance to approve any task 7 that involves a substantial cost. 3. The mining system should have a mechanism to allow a user to express his or her focus or preferences. 4. W i t h the user preferences specified, the system should have the ability to adjust its mining algorithms so that it performs only the necessary computation and nothing more. 5. The system should have a mechanism to allow a user to choose different significance metrics and criteria to be used in defining an association relationship. To realize these principles, we have designed a two phase architecture (Figure 3.1 ) for exploratory association mining. We will discuss the architecture in more detail in section 3.2. Here, we emphasize that the architecture is new, but downward compatible in the sense that if a user wants only classical associations and the classical mode of interaction, he or she can do so by setting all the parameters at the beginning and turning off all the breakpoints. Moreover, it is consistent with the above five principles, and is powerful in providing human-centered exploration. The second major part of our contributions is the introduction of the notion of constrained association queries (CAQ) to be discussed in Chapter 3. Along with the proposed architecture, C A Q provides a rich interface for the user to express focus and to control the mining process to best suit the particular interests of the user. In addition, we design and develop an efficient algorithm which we call C A P (Constrained Apriori) to make use of the additional pruning power provided by the constraints. To answer a constrained association query, one can imagine a few algorithms, for example, running Apriori followed by constraints checking at the end. We have compared C A P with a few of these simpler algorithms. Our experimental results indicate that C A P can sometimes outperform other algorithms by as much as 80 times ! 8 In our investigation, we discover that one can classify all the constraints according to two properties, namely, succinctness and anti-monotonicity. These properties turn out to be extremely important in optimizing the mining algorithm. While the anti-monotonicity property may be known in literature (in fact Apriori is based on the anti-monotonicity of the frequency constraint), the classification of constraints based on succinctness and anti-monotonicity is fundamentally new and is one of our major contributions. 1.5 Outline of the Thesis The thesis is organized as follows. In the next chapter, we describe the formulation of the problem of mining association rules and present the Apriori Algorithm. We also review in detail some of the important related works. In Chapter 3, we present constrained association queries as a new paradigm in mining association rules. We then discuss some of the basic algorithms that one can think of for these new type of queries in Chapter 4. Chapter 5 describes an optimized algorithm which we refer to as C A P . Chapters 4 and 5 are the main focus of this thesis. Performance of various algorithms are then compared in Chapter 6. Chapter 7 presents yet another optimization technique that can be used not only in C A P , but also in Apriori. Finally, we draw our conclusions in Chapter 8 along with a discussion on some possible directions for future work. 9 Chapter 2 Background: Association Rules 2.1 Formulation In this section, we present a mathematical formulation of the association rule problem. We also establish our notation used for the rest of the thesis. The starting point is to assume that there is a finite (but very large) set of items which we denote by Item . The strict power set of Item is denoted by 'P(Item) or 2 I t e m . A transaction T is simply any subset of 2 I t e m . A transaction database, denoted by VB, is a set of transactions. Definition 2.1 Given any set S C 'P(Item) and a transaction database, the support or the frequency of S is the number of transactions containing S. We often refer to S as an itemset. Definition 2.2 A n itemset S is said to be large or frequent if the support of S is larger than a given threshold value. Whether an itemset is large or not depend on the threshold value. We refer to this 10 threshold value as the support threshold. Moreover, we often refer to itemsets of size k as k-itemsets. We also denote the set of all large fc-itemsets by L)~. We are now ready to define an association rule. Definition 2.3 Given a support threshold s and another input parameter c referred to as confidence, an association rule is a rule of the form X Y ii X UY is a. large itemset, and among all the transactions that contain X, at least c % of them also contain Y. X and Y are usually referred to as the antecedent and the consequent set respectively. The problem of association rules is the problem of finding all the possible association rules given a support threshold and a confidence. The motivation of association mining and some of the examples are given in Section 1.2. Here, we note that finding association rules can be divided into two phases. The first one is to find all the large itemsets, i.e. potential candidates for X UY. The second phase then constructs all possible association rules of the form X =>• Y from each large item set X U Y by checking whether the support ratio of XUY over X exceeds the confidence threshold. For example, if the set {A, B, C} is found to be large, then, in the second phase, we check whether each one of the rules {A,B} {C}, {A,C} => {B}, {B,C} => {A}, {A} and {C} =>• {A, B} satisfies the confidence requirement. {B,C}, {B} => {A, C), Many previous studies have shown [AIS93, AS94] that the first phase is the bottleneck of the computation. Therefore, most of the literature on mining association rules focuses on the first phase of finding all frequent itemsets. In the following discussion, our main concern is also on the problem of finding large itemsets. Association mining, in principle, requires consideration of all possible combination of the items. Figure 2.1 shows all the possible itemsets that can be formed from five items arranged in a lattice. The lines in Figure 2.1 connect a set to all its 11 {A,B,C,D,E} (A,B,C,D) {A,B,C} {A,B,D} {A,B} {A,C} {A,B,C,E} {A,B,D,E} {A,C,D,E} {B,C,D,E} {A,B,E} {A,C,D} {A,C,E} {A,D,E} {B,C,D} {B,C,E} {A,D} {A,E} {B,C} (B,D) (B,E} {CD} (A) {B} {C} {D} {B,D,E} {C,E} {C,D,E} {D,E} {E} Figure 2.1: The lattice space formed from five items { A , B , C , D , E } supersets and subsets. The number of sets in this lattice space increases exponentially with the number of items. For practical applications, the number of items is at least of the order of several hundreds. Therefore, association mining is a challenging problem and any naive or brute-force approach will certainly fail. To facilitate our discussion, we hereafter refer to the association rules discussed in this section as the classical association rules to distinguish from other extensions. 2.2 Apriori Algorithm Agrawal and Srikant [AS94] proposed an efficient algorithm called Apriori listed below for finding all the large itemsets. Algorithm 2.1 (Apriori) 1 C\ consists of all itemsets of size 1; k — 1; Ans = 0; 2 while (Ck not empty) { 2.1 conduct database scan to form 12 from Ck', 2.2 3 form Ck+i from Lf~ based on Cf ; k + +;} req for each set S in some L^: add S to Ans. Algorithm Apriori follows a level-wise generate-and-test framework. In Step 1, it first generates a list of size one candidate sets, denoted by C\. Then it counts the support of each candidate in the list to find all the size one large itemsets, L\. From L\, the algorithm constructs a list of size two candidate sets, C2. The process is then repeated until there is no more candidates, i.e. until Ck is empty (see Step 2). Finally, each large itemset is added to the Ans, the output of the algorithm. The performance of such generate-and-test algorithms is usually very poor because the number of candidates increases exponentially with the number of items. The contribution of Agrawal and Srikant [AS94] is that when they generate C^+i from Lk in Step 2, they make use of an important property of the support of an itemset: the support of an itemset cannot be larger than the support of any of its subset. To prove such a property, one just needs to realize that if a transaction supports (i.e. contains) an itemset S, it supports every subset of S. Thus, a size k + l itemset cannot be a large itemset (i.e. cannot be in Ck+i) unless all its size k subsets are large (i.e. in L^)- Therefore, Agrawal and Srikant can a apriori drop a size k + l candidate set if one of its size k subsets are not in L^. This greatly reduces the number of size k + l candidate sets, which in turn further reduces the number of size k + 2 candidate sets, etc. Thus their Apriori Algorithm can be very efficient even for large numbers of items and for large databases. Example 2.1 We illustrate the Apriori Algorithm in Figure 2.2 with a simple example. We assume that there is only five items A,B,C,D,E and the support threshold is 10. Each set in Ck+i(k = 1,2) is constructed by taking the union of two sets in that differ by only one item. The algorithm then checks for all the 13 c l Support (A) count support (B} (C) c Support (D) (E) Step 2.1 Step 2.2 2 {A) 60 (B) 58 |C) 8 (D) 70 (D) (B,E) (E) 50 (E) (D,E| —^ (A) {Bl —— Support (A,B) (A,D) (A,E| (B,D) count support C 2 Support C Step 2.1 (A,B) (A,D) 30 (A,E) 34 (A,D) (B,D) 9 (A,E) (B,E) (D,E) 29 28 7 {A,B( 3 Support Step 2.2 (A,B,E) 12 |B,E) Figure 2.2: A n example illustrating Apriori Algorithm with five items size k subsets. If any one of the subsets is not in Lk, the set is pruned away from Ck+\- In our example, the set {A, B,D} is pruned away from C3 because the set {B,D} is not in L2. O n the other hand, {A, B,E} is not because all the size two subsets, {A,B},{A,E}, 2.3 and {B,E}, are in Li- Related Works As we mentioned i n Chapter 1, there has been numerous studies on mining association rules. In this section, we first review some of the works that are relatively more important in the development of association mining. These works roughly fall into three categories. The first one aims at generalization of the notion of classical association rules. In other words, the focus is on improving the effectiveness of the association rules. The goal of the second category is more on improving the effi- 14 Food Product Dairy Milk l%Milk 2% Milk Large Small Orange Juice Apple Juice Coke Pepsi Bud Figure 2.3: A n example of a taxonomy (concept hierarchy) ciency of the mining algorithms. The third category simply covers all the works that do not fall into the first two categories. A n example of the third category would be solving other problems with the association concept. Generalization of Association Framework Among the first attempt to generalize association rules is the work by Han and Fu [HF95]. Their idea is that while an association rule such as "80% of customers that purchase milk may also purchase beer " is interesting, it could be more informative to also show that "75% of people buy Budweiser if they buy 2% milk ". The association relationship in the latter statement is expressed at a lower concept level where more specific and concrete information is provided. Conversely, sometimes it may be more desirable to have associations at a higher concept level. A n example is that the rule "90% of people who live on 123 East 2nd Street fly at least once a year" is obviously less interesting to a travel agent than the rule "70% of people living in Vancouver fly at least once a year". Han and Fu suggested that association mining should be done at multiple concept levels. They call the different concept levels a concept hierarchy. Srikant and Agrawal [SA95] referred to concept hierarchies 15 Miller Salary 18K 30K 31K 80K 81K 82K Equi-depth Interval Distance-based Interval 18K, 18K 18K,30K 30K,31K 31K,80K 80K.82K 81K, 82K Table 2.1: Equi-depth vs. distance-based partitioning as taxonomies and had studied essentially the same problem as Han and Fu. A taxonomy (or concept hierarchy) is a is-a hierarchy on a set of items with a more general description of the items at the higher levels of the hierarchy. A n example of a taxonomy {i.e. a concept hierarchy) is shown in Figure 2.3. This taxonomy says that 1% milk is-a milk, 2% milk is-a milk, milk is-a Dairy, Dairy is-a Food Product, etc. The problem of multiple-level association rules is to find association rules that span different levels of the taxonomy. Both Han & Fu and Srikant & Agrawal have designed efficient algorithms that can generate association rules at multiple levels. Another group of studies is on extending association rules to more general type of attributes. tributes. So far, association mining had been focused on categorical at- Srikant and Agrawal in 1996 [SA96] introduced the problem of mining association rules with quantitative attributes. A n example of an association rule containing both quantitative and categorical attributes might be "10% of married people between age 50 and 60 have at least 2 cars". They deal with quantitative attributes by partitioning the values of the attribute and then combining adjacent partitions as necessary. In other words, they converted quantitative attributes to categorical attributes by creating finite number of partitions for the values of the numeric attributes. A n important question is what should be the number of intervals and what the criteria in constructing a "good" interval should be. Srikant and Agrawal [SA96] used an equi-depth method where the intervals are determined by 16 their relative ordering and their support. Here the depth of a partition means the support of the partition. For a depth d, the first d values (in order) are placed in one interval, the next d in a second interval, etc. However, Miller and Yang [MY97] pointed out that equi-depth partitioning may not work well for skewed data, and may not be semantically suitable under certain circumstances. Instead, they proposed a distance-based partitioning where the distance between items are taken into account. As an example, Table 2.3 shows the different partitions obtained by the above two methods on a Salary attribute. Equi-depth partitioning results in three partitions, 18K,30K, 31K,80K, and 81K,82K, while distance-based method gives 18K,18K, 30K,31K, and 80K,82K. Miller and Yang [MY97] argued that distancebased partitions are more consistent with our intuitive understanding of the data and intervals that include close data values (such as 81K,82K) are more meaningful than intervals involving distant values (such as 31K, 80K). In our example, a rule involving the interval 31K, 80K will be of less interest than rules involving the interval 30K,31K. Finally, Fukuda et al [FMMT96] considered various performance issues in mining association rules with numeric attributes and designed fast algorithms that borrow techniques from computational geometry. Other studies that extend the classical association relationship include the work by Brin, Motwani, and Silverstein [BMS97] where the authors suggested correlation as a significant metric of an association, instead of the confidence. Their motivation is to allow associations to generalize beyond market baskets data. This is because under more general settings (i.e. , non basket data such as census data), the classical association rules is only one of the many types of recurring patterns that could or should be identified as "associations". Correlation is among the most useful information regarding two variables. So far, all of the mentioned works assume that the database is fixed and the focus is on mining different types of association rules. However, in practical application of database mining, another problem we need to address is how to update, 17 maintain and manage the rules discovered. Whenever the database is updated, new association rules may be introduced while some existing ones may be invalidated. Therefore, efficient maintenance of discovered association rules is a non-trivial problem. Cheung et al [CHNW96] study this problem and have developed an incremental updating technique. Improving Efficiency of Mining Algorithms Apart from extending the association rule framework, many researchers aim to improve the efficiency of the mining algorithm. Park, Chen, and Y u [PCY95] proposed another approach totally different from the Apriori. Their algorithm is referred to as D H P (Direct Hashing and Pruning). Their observation is that the initial candidate set generation, especially for the large size two itemsets, dominates the total execution cost. This can be explained by the reason that unless the support threshold is very selective, L\ is usually very large, which in turn results in a huge number of candidate sets in C2 (as j C j = ). The step of determining L2 from C2 by 2 V 2 ) scanning the whole database and testing each transaction against C2 is hence very expensive. The idea of the D H P Algorithm is to generate a much smaller C2 by using a hashing technique to filter out unnecessary itemsets. When the support of candidate fc-itemsets is counted by scanning the database, D H P accumulates information about candidate (A; + l)-itemsets in advance in such a way that all possible (k + l)-itemsets of each transaction are hashed to a hash table. This also allows D H P to trim progressively the size of the transaction database which can reduce the processing time in later iterations. Their experimental evaluation shows that D H P is about four times faster than Apriori. A major bottleneck in computing frequent sets is in counting the support of candidate sets. At each level k, a database scan is required to find the support of the sets in C . Because of the large database size, the I / O cost for each scan can fc 18 be huge. Therefore, it will be beneficial if one can reduce the number of database scan. Brin et al [BMUT97] proposed a dynamic itemsets counting algorithm (DIC) to reduce the number of database scan. The idea is to start counting (k + l)-itemsets before finishing counting fc-itemsets. In one of their examples, Brin et al show that the number of database scan required by D I C is 1.5 passes while that of Apriori is 3. However, the implementation of the D I C Algorithm requires keeping track of many itemsets (in memory). As the number of all possible itemsets increases exponentially with the number of items, this requirement may pose serious limitations on the algorithm. To speed up the mining process, another major approach is to use parallel algorithms. Both Agrawal and Shafer [AS96] and Han, Karypis, and Kumar [HKK97] have designed parallel algorithms for mining association rules and have studied various performance issues. Their idea is that the Apriori Algorithm can be divided into several sub-problems such as counting support, candidate generation, and rule generation. While these sub-problems depend on each other, the algorithms that solve these sub-problems may be executed in parallel. For example, rule generation from Lfc can be executed independently from counting support of for k' > k. Interesting applications of the association mining framework include the work by Agrawal and Srikant [AS95], where association mining is applied for mining sequential patterns, and the investigation by A l i , Manganaris, and Srikant [AMS97] where it is used for partial classification. Summary While all of the above works represent important extensions, improvements and/or applications of the classical association rules [AS94], none of them has solved the problems presented in Section 1.3. In particular, no current framework addresses adequately the question of "interestingness" of the discovered rules. Different users 19 have different interests and different needs. Support threshold and confidence are two very rough and generic quality (or "interestingness") measures. They may not capture what a user wants. Moreover, they may lead to thousands of rules or no rules at all. As far as we are aware of, no one has talked about putting constraints on the frequent sets to capture the "interestingness" of rules. The only exception is the recent work by Srikant, Vu, and Agrawal [SVA97] which considers only membership constraints on an item taxonomy. Nevertheless, their constraints represent only a small subclass of the types of constraints we study in this thesis, which include domain and SQL-style aggregation constraints. Furthermore, our focus is on the analysis of the pruning properties of the constraints and our Algorithm C A P can handle a much broader class of constraints with significant pruning power than the algorithms given in [SVA97]. Meo, Psaila, and Ceri [MPC96] consider a language for mining association rules with conditions. However, they do not consider pruning optimizations provided by the conditions. Tsur et al [TUACMN97] attempt to generalize association queries to parameterized queries with filters (conditions applied to the result of a query), which they call "query flocks". However, the filter is confined to lower bound constraints on the number of tuples returned by a query. Moreover, they do not have the general notion of constraints nor have they classified the constraints or studied the role of the constraints in optimization. 20 Chapter 3 Constrained Association Queries In this chapter, we introduce the notion of Constrained Association Queries ( C A Q ) to address the several problems of the classical association queries discussed in the Chapter 1. Our goal is to let a user specify constraints expressible in SQL-like languages and to enable the user to efficiently carry out exploratory and ad hoc association data mining activities. Our design is based on the five principles presented in section 1.4 3.1 Definitions D e f i n i t i o n 3.1 A constrained association query (CAQ) is a query of the form {(•Sii S2)\C}, where S\,S2 are set variables and C is a conjunction of a set of constraints on S i , S^. The above definition is fairly general. In particular, we do not have the notion of antecedent and consequent set in the definition, although typically S\(S2) represents the antecedent (consequent respectively) set in an association relationship. The reason is that for this generalization such identification is not required in 21 refinement of metric, metric threshold, type of relationships, candidate sets Initial constrained association query Phase 1: Finding Constrained Frequent Sets selection of Phase 2: Computing Relationships and their Significance metric, threshold, type of relationships, candidate sets refinement of constraints, support threshold PHASE I PHASE II Figure 3.1: A n Architecture for Exploratory Association Mining finding all large itemsets. The basic requirement of the support of Si being larger than some given threshold can be thought of as a constraint. We embed this frequency constraint, written as freq(Si), in the constraint C. Before we elaborate on the above definition, we first present an architecture for processing C A Q s . 3.2 Architecture Our proposed architecture, which is shown in Figure 3.1, is divided into two phases. In phase I, the system processes constrained association queries and outputs all the large itemsets satisfying the set of constraints C specified in the queries. As mentioned in the last section, each constraint in C may be applicable to the antecedent, or the consequent, or both and C also includes the frequency constraint. Upon seeing the output of the query, which is in the form of a list of pairs of candidates (S , S ), a c for the antecedent and consequent satisfying C, the user can either modify his or her query by adding, deleting, or refining the constraints etc, or change the support 22 thresholds. Then, the system accepts and processes the refined query. The process is repeated as many times as the user desires. When the user is satisfied with the list of large itemsets found, the user can instruct the system to proceed to Phase II. The main function of the system in Phase II is to let the user specify the significance metric, the corresponding metric threshold, and any other further conditions that may be imposed on the antecedent and consequent. For example, as in classical association queries, a user could choose the confidence as the significance metric, specify the confidence threshold and require that (S , S ) be frequent. Similar to Phase I, Phase II is also iterative. Upon seeing a c the final association relationship, the user can modify his or her specification of the metric, and/or thresholds values etc. The system will then process the refined query starting from the beginning of Phase II or even Phase I if necessary. One important consequence of allowing user feedback at various breakpoints in the architecture is that the user can have the control and final approval in authorizing costly operations. For instance, one possible significance metric is correlation between the antecedent and consequent [BMS97]. Correlation is an expensive computation. It will be a waste of C P U time if the user can find out only at the end that the final answers of the query involve sets of items that are not of interest to the user. Another important feature of the proposed architecture is that it is downward compatible. In other words, the system can answer classical association queries and support the classical mode of interaction. To do this, a user would simply set all the appropriate parameters at the beginning and turn off all the breakpoints so that the system will not prompt for feedback. Since a classical association query is a special case of a constrained association query, our formalism (i.e. the notion of C A Q and the proposed architecture) is a true generalization of the classical association query. But, of course, the real power of the generalized framework lies in the fact that 23 it addresses the problems of the classical association rules listed in section 1.3 by supporting efficient human-centered exploration for association mining. 3.3 Syntax and Examples In Section 3.1, we briefly describe what is a C A Q . Here, we discuss it in more details and formally introduce the syntax used in C A Q . We also present some examples of CAQ. In definition 3.1, C represents a set of constraints, where the constraints can be divided into two classes. A single variable constraint (1-var) is a constraint containing only one set variable. It is used in conditioning the antecedent and/or consequent separately. O n the other hand, a two variable constraint (2-var) is a constraint with two set variables and is useful in expressing joint conditions on both the antedecent and the consequent. To be more precise, C is a conjunction of constraints on S\,S2 drawn from the following classes of constraints. 1. Single Variable Constraints: A single variable (1 — var) constraint is of one of the following forms. (a) Class Constraint: It is of the form S C A, where S is a set variable and A is an attribute. It says S is a set of values from the domain of attribute A. (b) Domain Constraint: It is of one of the following forms. i. S0v, where S is a set variable, v is a constant from the domain that S comes from, and 6 is one of the boolean operators =,7^, < , < , > , > • It says that every element of S stands in relationship 9 with the constant value v. 24 ii. v9S, where S, v are as above, and 8 is one of the boolean operators €, This simply says the element v belongs to (or not) the set S. iii. V9S, or S9V, where S is a set variable, V is a set of constants from the domain S ranges over, and 9 is one o f C , g , c , c / , = , ^ . (c) Aggregate Constraint: It is of the form agg(S)8v, where agg is one of the aggregate functions min, max, sum, count, avg, and 9 is one of the boolean operators =, 7^, <, <, >, >. It says that the aggregate of the set of numeric values in S stands in relationship 9 to v. 2. Two Variable Constraints: A two variable constraint (2 — var) is of one of the following forms. (a) S\9S2, where Si is a set variable and 9 is one of C , c , <£. (b) (Si o S2)9V, where Si, S2 are set variables, V is a set of constants or 0, o is one of U,f1, and 9 is one of =, 7^, C , (c) aggi(Si)9agg2{S2), C , c/. where aggi,agg2 are aggregate functions, and 9 is one of the boolean operators =, 7^, < , < , > , > . We next illustrate the constraint syntax with examples. First of all, we need to specify the view on which we do the mining. We will refer to this as the minable view. It can be simply thought of as a set of relations. For example, t r a n s (TID, I t e m s e t ) can be the relation to represent the transaction. Furthermore, as an example here, we assume that we are interested in the type, price and origin of the items. Therefore, a minable view could be the two relations, t r a n s (TID, I t e m s e t ) , i t e m l n f o ( I t e m , Type , P r i c e , O r i g i n ). Here I t e m represents the set of all possible items and S C I t e m would mean S is a set variable in the I t e m domain. We now consider some 1-var constraint examples. S . P r i c e > 50 says all items in S are of price greater than or equal to $50; S.Type 3 {sodas} insists S include some items whose type is sodas; S . O r i g i n = Canada means all the items are made in Canada. 25 More complicated example can be like S.Typefl {dairy} = 0 A sum(S.Price) < 100, which says S is the set of items that do not include any dairy product and the total price is less than or equal to $100. 2-var constraints can be constructed in a similar manner. Si .Type fl S2-Type = 0 says Si and S For example, have no same type items in common; 2 avg(S\ . P r i c e ) < avg(S2 . P r i c e ) insists that the average price of Si has to be less than the average price of S2. Below we give examples of complete C A Q s . The C A Q {(Si,S )|Si C I t e m & S C I t e m & count{Si) = 1 & count(S ) = 1 & freq(Si) 2 2 2 & asks for all pairs of single items satisfying frequency constraints. Since the body of any C A Q , C, invariably contains the frequency constraints and the domain constraints Si C I t e m & S2 C Item, we hereafter suppress them and simply note that they are implicitly contained in the body of any C A Q . The C A Q {(Si, S )|auc/(Si.Price) < 100 & avg{S -Price) 2 2 < 200} asks for pairs of itemsets, where the average price of Si has to be less than or equal to $100 while that of S has to be less than or equal to $200. This is an example of 2 using 1-var constraints. On the other hand, the C A Q {(Si, S2)\max(Si.Price) < min(S2-Price)} represents an example with 2-var constraints, where the query asks for pairs of sets of cheaper items and sets of more expensive items. We can construct more complicated example by adding conjuncting constraints. The C A Q {(Si, S2)|Si.Type = magazines & S i . O r i g i n = Canada & S .Type = toys 2 & S . O r i g i n ^ Canada & max{S\.Price) 2 < min(S -Price)} 2 finds pairs of sets of cheaper Canadian magazine items and sets of more expensive imported toys. Apart from the I t e m domain, we can also ask for items from 26 freq(S )} 2 another domain. For example, the C A Q {(T T )\T U 2 X C Type & T C Type} 2 finds pairs of sets of types (corresponding to items bought together). Similarly, we can also ask for items from different domains: {(51,52)151 C Item & S i . O r i g i n = Canada & 5 C O r i g i n & Canada £ 5 }, 2 2 which asks for sets of domestic items and non-Canadian origins. 3.4 3.4.1 Classification of Constraints Anti-monotonicity In this section, we identify two properties of constraints that will be shown to be important when we consider performance optimization. The first property is antimonotonicity. The motivation behind it is from the observation that the success of the Apriori Algorithm for classical association mining is based on the fact that if a set violates the frequency constraint, then some of its supersets will also violate the frequency constraint. We can generalize this property to other constraints and therefore have the following definition. Definition 3.2 (Anti-monotonicity) A 1-var constraint C is anti-monotone if and only if for any set S, S does not satisfy C V S ' 2 5, 5' does not satisfy C. The power of the property lies in the fact that if a set is found to violate the constraints, all its supersets can be pruned away from further consideration. Thus, the number of candidate sets can be largely reduced. As we said above, this is the 27 1-var Constraint Anti-Monotone Succinct yes no no yes partly no yes partly yes no partly yes no partly yes no partly no (yes) yes yes yes yes yes yes yes yes yes yes yes weakly weakly weakly no no no no (no) S0v,0e {=,<,>} v GS scv s = v min(S) < v min(S) > v min(S) = v max(S) < v max(S) > v max(S) = v count(S) < v count(S) > v count(S) = v sum(S) < v sum(S) > v sum(S) = v avg{S)0v,9 e {=,<,>} (frequency constraint) Table 3.1: Classification of 1-var Constraints reason why Apriori is efficient. If we can identify which classes of constraints satisfy the anti-monotone property, then we can incorporate the constraints along with the frequency constraint to achieve the same pruning power of the Apriori Algorithm. We have analyzed different classes of 1-var constraints and have classified them according to whether they are anti-monotone. We summarize our findings in Table 3.1. The second column identifies which constraints are anti-monotone. The third column does the same for succinctness, which will be discussed in the next section. The first group of constraints is the domain constraints. S6v, 6 € {=, <, >}, where v is a constant, is anti-monotone. In particular, it is anti-monotone for 0 being <. The reason is that if S j£ v, then one of the elements in S is greater than v, and therefore any supersets of S will violate the same constraint because that one 28 element in 5 that is greater than v is also contained in the superset. On the other hand, v 6 S is not anti-monotone because if S does not contain v, it does not necessarily imply all of its superset will not contain v. A counter example is simply S U {v}. Thus, v £ S does not satisfy the condition for anti-monotonicity. For the min-max type constraints, rnin(S) > v is anti-monotone because if the minimum of S is less than v, then the minimum of all its supersets will also be less than v as the minimum can only be made smaller by adding in extra elements. Similarly, since the maximum of a set can only be increased by adding extra elements, the constraint max(S) < v therefore also satisfies the anti-monotone property. The same idea applies to the constraint count(S) < v. The count(S) (i.e. the number of elements in S) can only be increased if the size of the sets is increased. Therefore, if count(S) > v, count(S') will also be greater than v for all supersets 5". Thus, count(S) < v is anti-monotone as well. We can summarize our discussion by the following proposition. Proposition 3.1 For each constraint C listed in Table 3.1, C is anti-monotone if and only if the table indicates so. Optimization Using Anti-Monotonicity The motivation behind the notion of anti-monotonicity and succinctness is to provide performance optimization based on these properties. One naive algorithm for miningconstrained associations is to run Apriori first followed by constraint checking at the end. However, unless the constraints C is very non-restrictive, only a fraction of the answers found by Apriori would satisfy the constraints. A large portion of the computation may turn out to be unnecessary. However, the time spent on Apriori is typically a lot more than the time spent on constraint checking. The algorithm with constraint filtering at the end is therefore very inefficient. The basic idea of improving performance is then to "push" the constraint checking as early as 29 possible so that the constraints can help us prune away candidate sets before support counting. However, one has to be careful in pushing the constraints into lower levels (lower level here means smaller k—value where k is the size of the itemsets). This is because dropping candidate sets at lower levels may lead to missing frequent itemsets at higher levels. We say an algorithm is complete if all frequent sets satisfying the given constraints can be found. In other words, all solutions are included in the answers returned by the algorithm. On the other hand, we also want to make sure that the algorithm is sound, i.e. all the answers found are valid solutions that are frequent and satisfy the given constraints. Anti-monotonicity allows us to prune away candidate sets at each level in a way similar to the case of of frequency constraint. In fact, the standard optimization used for the frequency constraint in the Apriori Algorithm is based on the following property (PI): S, where \S\ = k, is frequent =>• VS' C S where \S'\ = k — 1, S'is frequent, so that whenever any one of the size k — 1 subsets of a size k candidate set is not frequent, the candidate set can be pruned away. In the case of a constrained association query { 5 i , S2IC}, if C am consists of all the anti-monotone constraints in C, including the frequency constraint, then a similar optimization technique can be used based on the following property (P2) generalized from the above property (Pi): ==> V S ' C S where | S ' | = k — 1, S' satisfies Cam- In other words, if Lfc-i consists of all the sets of size k — 1 that satisfy C , then S, where \S\ = k, satisfies C am am the set Ck of candidate sets of size k can be generated in exactly the same way as in the Apriori Algorithm. Then, Ck can be further pruned to become Cj? checking whether each element in Ck satisfies every constraint in C am m by other than the frequency constraint. Any element that violates any constraint in C arn can be dropped away from support counting. As will be seen in Chapter 5, this optimization is incorporated in Algorithm C A P . 30 3.4.2 Succinctness While anti-monotonicity provides strong pruning power, it involves an iterative generate-and-test process: at each level, a list of candidates is generated and then tested for the satisfaction of the constraints. It would be better if we can eliminate the generate-and-test paradigm. This leads to the the following two questions: (i) Under what circumstances can we succinctly characterize the set of all itemsets that satisfy a given constraint ? (ii) Given a constraint that has a succinct description, how can we generate all the itemsets that satisfy the constraint without admitting spurious itemsets ? The answer for the first question leads to the notion of succinctness discussed below while the second question lead us to the notion of a member generating function. To help understand the definition, we consider the sample example in section 3.3, i.e. the minable view consists of the relations t r a n s ( T I D , itemlnfo(Item,Type,Price,Origin). I t e m s e t ) and We denote the set of itemsets that satisfy any 1-var constraint C by SvlTc(Item), which we refer to as the pruned space of C. For example, if C\ is the constraint S . O r i g i n = Canada, then SATc (Item) x contains all those itemsets whose origin is Canada. In the following, a selection predicate refers to any predicate that is allowed to appear as a parameter of a selection operation in relational algebra and a (I) p represents the subset of I containing all the elements of / that satisfy the selection predicate p. In our example, o~ (I) with v I C I t e m is therefore {i G I\3t G t r a n s oo i t e m l n f o : L i t e m = iandt satisfies p}. Moreover, we use the notation 2 to denote the strict powerset of I , i.e. the set of 7 all subsets of I except the empty set. We can now define succinctness as follows. Definition 3.3 (Succinctness) 1. / C I t e m is a succinct set if it can be expressed as cr (Item) for some selection p predicate p. 31 2. SP C 2 I t e m is a succinct powerset if there is a fixed number of succinct sets I t e m i , I t e m * ; C I t e m such that SP can be expressed in terms of the strict powersets of I t e m i , I t e m / ; using union and set difference. 3. Finally, a 1-var constraint C is succinct provided SATQ(Item) is a succinct powerset. Intuitively, the definition states that a constraint on a set of items is succinct if its solution set can be expressed as sums of sets that can be obtained by applying some selection predicate onto the set of all items. The constraint Ci = S . O r i g i n = Canada considered above is succinct because its pruned space SATc {ltem) is simply 2 1 I t e m c , where Item^ = ao ig canada(I'tem). in= r Now, let's consider a more complicated example. If C2 = {snacks, sodas} C S'.Type, then the pruned space consists of all the sets that contain at least one item of type snacks and at least one item of type sodas. Let Item.2, Itera.3, Item4 be the sets o"Ty e=snacA;5(Item),a yp P T e=S0 das(Item),anda Typ e^ „ fc AType^sodas(Item) s ac S respectively. Then, C2 is succinct because the pruned space SATc (Item) can be expressed as: 2 oltem oItem2 altem3 nltem4 nItem2UItem4 oItem3UItem4 Similar to the anti-monotone property, we have classified the representative class of constraints according to whether they are succinct or not in Table 3.1. P r o p o s i t i o n 3.2 For each constraint C listed in Table 3.1, C is succinct if and only if the table says so. We have proved two positive cases above. max(S.A) Now, consider the constraint > c. Let I t e m i = f j 4 ( l t e m ) . Then the pruned space is the powerset J <c of all the items excluding the powerset of Itemi. In other words, SATc(Item) 2 i t e m _ itemi_ 2 = Q n the other hand, sum{S)6v,9 G {=,<,>} is not succinct. The reason is that given any finite union and differences of succinct powersets, we cannot 32 rule out the possibility in general that there exists additional items not belonging to those powersets, but whose addition to S can still satisfy the constraint. Similar arguments apply to the remaining cases, and therefore, for brevity, we now skip the proof of all the other cases except to make a note of those constraints involving count (S). Consider the constraint count(S) < v. This constraint is not succinct and does not have an M G F of the kind introduced in Definition 3.4 below. However, we can have an M G F based on cardinality constraint, i.e. {X\X C Item & | X | < v}. Therefore, we say that the constraint count(S) < v is weakly succinct. Similarly, the other constraints involving count(S) in Table 3.1 are also weakly succinct. We now turn to the next question of how to generate SATc(ltem) given C is succinct. The key concept here is the member generating function. D e f i n i t i o n 3.4 M e m b e r G e n e r a t i n g F u n c t i o n s 1. We say that SP C 2 I t e m has a member generating function (MGF) provided there is a function that can enumerate all and only elements of SP, and that can be expressed in the form { X i U • • • UX \Xi C cr (Item), 1 < i < n, &3k < n pi n : Xj ^ 0,1 < j < k}, for some n > 1 and some selection predicates 2. A 1-var constraint C is pre-counting prunable provided SATc(ltem) has an MGF. For the constraint C\ = S . O r i g i n = Canada discussed above, an M G F is simply {X\X C Itenic&X ^ 0} where Itenic = cr i i canada(Item). As for the 0r constraint C2 = {snacks,sodas} Item & Xi /0& X 2 2 g n= C S.Type, an M G F is {X\ U X U X \Xi 2 C Item &:X2 fb k. X 3 3 Item4 are as defined earlier. 33 3 C C Iten^}, where Item ,Item , and 2 3 Optimizations Using Succinct Constraints The relationship between succinctness and M G F is that every succinct constraint has an M G F . P r o p o s i t i o n 3.3 If C is succinct, then C is pre-counting prunable (i.e. SATc(Item) has an M G F ) . P r o o f Our argument is based on an induction on the number of minus operator in the succinctness expression. When there is no minus operation, SATc(ltem) 2 I t e m i with Itemi = a (ltem) and p being the predicate in C. Then the M G F p is simply = C Itemi & X ^ 0}. We assume that the proposition is true whenever the expression of SATQ(Item) in terms of succinct powersets involves k minus operations. Now, consider any constraint such that SATc(ltem) minus operations, i.e. SAT (Item) c = 2 I t e m - 2 2 I t e m i - 2 I t e m f c Item has k + 1 '=+ , where 1 Itemj are succinct sets, Iteirij = a (Item) for some predicate qi, i = 1,..., k+1. B y gj the induction assumption, SAT (ltem) = {XiU-• -UX \Xi n c which this is equal to {Xi U • • • U X \Xi C a -, (ltem)} n PiA gk+1 C o- .(ltem)}-2 Itemfc p + , 1 of the form stated in Definition 3.4. (One of the Xj has to be non-empty, otherwise SATc(Item) is too trivial). Thus, the proposition is also true for k + 1. The importance of this proposition is that, for succinct constraints, we can now operate in a generate-only fashion, instead of in a generate-and-test manner. Furthermore, the satisfaction of the constraint alone is not affected in any way by the result of the iterative support counting. So far, we have considered only a single constraint. To obtain M G F s for a more general constraint consisting of multiple constraints, we can simply combine the M G F s of two constraints at a time. In other words, suppose C\ and C2 are constraints with M G F s {S\ U • • • U S \Si m a qj C <7 .(ltem)} and {T U • • • U T \Tj C Pi (Item)} respectively, then the M G F for C\ & C2 is simply {R x u 34 n U• • •HRmn\Rij Q o- A (ltem)}. pi Qj We incorporate the optimization idea presented in this section into our proposed C A P Algorithm. Before we present a detailed description of the C A P Algorithm, we turn to some of the basic algorithms in the following chapter. 35 Chapter 4 Basic Algorithms In this chapter, we develop a few algorithms for computing frequent sets that satisfy a set of given constraints. In the next chapter, we will present the Algorithm C A P which takes advantage of the optimization ideas given in Section 3.4. We will compare the performance of C A P and the other algorithms in Chapter 6. 4.1 Apriori + The first algorithm is a straightforward extension of the classical Apriori Algorithm [AS94]. The idea is to run Apriori first, followed by constraints checking at the end. It is called Apriori" " and is listed below. 1 Algorithm 4.1 (Apriori" ") 1 1 C\ consists of all sets of size 1; k = 1; Ans = 0; 2 while (Ck not empty) { 2.1 conduct database scan to form Lk from Ck\ 2.2 form Ck+i from based on Cf ; k + +;} req 36 3 for each set S in some L^: add S to Ans if it satifies (C — Cf ). req Here C is used to denote the set of given constraints, one of which is the frequency constraint, Cf . A l l the other symbols (Ck,Lk, and Ans) have the usual req meaning discussed before in Chapter 2. A p r i o r i is almost the same as the Apriori + (Algorithm 2.1) except in Step 3 where each of the frequent sets computed from Step 1 and Step 2 is verified against the remaining constraints (i.e. C — 4.2 Cf ). req Full Materization Algorithm Apriori" " is not bad if the frequency constraint Cj 1 req the remaining ones (C — Cj ). Teq is more selective than However, the converse may be true. In other words, it is possible that most of the frequent sets found do not satisfy the remaining constraints. In that case, checking (C — Cf ) first, followed by verifying req Cf , req might be more efficient. This lead to the following algorithm which we call Full Materization ( F M ) . A l g o r i t h m 4.2 (FM) 1 C\ consists of all sets of size 1; Ans = 0; 2 C* 3 For every possible S such that for all S' C S A \S'\ = 1 => S' = G Ci; d: add S' to C* if 5" satisfies (C - 4 C ); freq conduct database scan for sets i n C*; add sets that are frequent to Ans. 37 While Algorithm F M avoids counting support for sets that do not satisfy the given constraints, it is not without its own problem. Its problem is in Step 3 where all the sets are generated and tested against the constraints in (C — Cf ). req The number of sets involved grows exponentially with large problem size (i.e. the number of items). In that case, Algorithm F M becomes very inefficient. 4.3 Hybrid(m) If the frequency constraint Cf req straints i n (C — Cf q), re constraint (C — Cf ) reg is much more selective than the remaining con- Algorithm A p r i o r i + is more efficient. Conversely, if the are much more selective, then Algorithm F M performs better. The two algorithms, in fact, represent two extreme cases where either the frequency constraint Cf req or the remaining constraints (C — Cf ) req is clearly more selective than the other. However, the selectivity of the two sets of constraints (Cf req C — Cf ) req and are often comparable. It may be a good idea to combine the above two algorithms in a divide-and-conquer approach where we run one algorithm up to a certain level and switch to the other afterwards. We call the result Hybrid(m), where m is the level when the switching occurs. W i t h m treated as a parameter, Hybrid(m), in fact, represents not only a single algorithm but a class of algorithms. Algorithm 4.3 (Hybrid(m)) 1 C i consists of sets of size 1; k = 1; Ans = 0; 2 while (Ck not empty and k < m) { 3 2.1 conduct database scan to form Lk from Ck; 2.2 form Ck+\ from Lk based on Cf ; k + +;} req if m > 0 then for each set S in some Lk; add S to Ans if S satisfies (C — 4 if Ck not empty then { 38 Cf ); req 4.1 C * — C _|_i; 4.2 for every possible S such that for all S' C S A \S'\ = m + 1 m S" £ C m + i : add S" to C * if 5" satisfies (C 4.3 C ); freq conduct database scan for sets in C*; add sets that are frequent to Ans.} If m = 0, Steps 2 and 3 will not be executed and the algorithm (Hybrid(O)) becomes Algorithm F M . Otherwise, Algorithm Hybrid(m) runs A p r i o r i up to level + m (m > 1) in Step 1 to Step 3. It checks Cf req standard C +i m for m iterations to produce the set which consists of all candidate sets of size m + 1. This takes advantage of the pruning affected by the frequency constraint. Then, instead of continuing with A p r i o r i , Hybrid(m) switches to F M in Step 4. The motivation for + the switch is to reduce the I / O costs involved with the database scanning in each iteration of Apriori"*". A tradeoff is a higher C P U cost in Step 4.2 where all the possible sets are generated and tested again the constraints (C — Cf ). However, req compared with Algorithm F M , the C P U cost for the same step is much less and may become acceptable. In fact, the C P U cost will be much less in Hybrid(n) than in Hybrid(m) whenever n <m. The reason is that the generation in Step 4.2 needs to consider a set S only when all the subsets of size m + 1 of S are in C i. m+ increases, the number of sets in C i m+ As m decreases rapidly, thus reducing the number of sets required to consider in Step 4.2. To summarize, the decision of when the switching should occur (i.e. the value of m) represents a tradeoff between C P U and I / O costs. W i t h small m value, the amount of database scanning required is less and the I / O costs is reduced, at the expense of increasing C P U cost in Step 4.2. On the other hand, if m is larger, the number of candidate sets in Step 4 is smaller and the C P U time is reduced, at the expense of increasing the I / O cost of Step 2. Moreover, the two extreme 39 cases of m = 0 and m = oo correspond to Algorithm F M and Algorithm Apriori respectively. 40 Chapter 5 Algorithm C A P While all the algorithms considered in the previous chapter have their own merits, they all fail to exploit the properties of the constraints. In particular, the anti- monotone and succinct properties introduced in Section 3.4 are not being used. Algorithm C A P discussed in this chapter attempts to make maximum use of these properties by pushing the constraints as deep "inside" the computation as possible. To this end, we classify all the constraints into four categories: I. Constraints that are both anti-monotone and succinct (e.g., min(S) > v); II. Constraints that are succinct but not anti-monotone (e.g., min(S) < v); III. Constraints that are anti-monotone but not succinct (e.g., sum(S) < v); and IV. Constraints that are neither (e.g.,avg(S) < v). A different strategy is tailored for each of the above cases. 41 5.1 Succinct and Anti-monotone Constraints When the constraint C is both succinct and anti-monotone, we propose that the general form of the M G F for SATc{Item) in Definition 3.4 reduces to the form {S|S C crp(Item) & S ^ 0}. Then the set C\ of candidate sets of size one in the Apriori Algorithm can simply be replaced by C f = {e|e G C\he £ cr (Item)}. p Moreover, since C is anti-monotone, sets containing any element not i n C f need not be considered. Therefore, we have the following strategy proposed for succinct anti-monotone constraints. Strategy I: • Replace C\ in the Apriori Algorithm by C f defined above. 5.2 Succinct but Non-anti-monotone Constraints In the case of succinct non-anti-monotone constraints, one cannot guarantee that all sets satisfying the constraints must be subsets of Cf. For example, consider the constraint S.Type D {sodas} which says S includes at least one soda item. It is succinct because SATc{ltem) has a M G F {Si U S2IS1 C axj -< d (Item) & S i ^ /pe 0 & S 2 C Item}. i0 as If a set S does not contain a soda item, S's supersets may. Therefore, the constraint is not anti-monotone. For this constraint, Cf would contain all the size 1 itemsets where the items in the itemsets are all sodas. However, sets such as {Coke, Milk} satisfy the constraint but are not subset of Cf. Thus, Strategy 1 is not complete. Fortunately, in this case, we can make use of the structure given by the M G F of the constraint. As in our example, we consider M G F ' s that are of the form {Si U S2IS1 C tT (Item) & S2 C cr (ltem)}. pi p2 Strategy II: 42 • Define Cf = cr (ltem) and C f pi c = <7 P2 (Item). Define corresponding sets of frequent sets of size 1: L\ = {e|e G C{ Sz freq(e)}, and L^ c = {e\e G Cr & /reg(e)}. c • Define C2 = {{e,/}|e 6 G (Lf U I ^ ) } , and L2, as usual, the set of 0 frequent sets in C2, i.e. , L = {SIS G C2&/reg(S)}. 2 • In general, Cfc+i can be obtained from Lk in exactly the same way as in the Apriori Algorithm with only one modification. In the classical case, a set S is in Cfc+i, if all its subsets of size k are in Lk, i-e. , VS' : S' C S and | S ' | = k =*• S' G X f c In the case of succinct, non-anti-monotone constraints, we only need to check subsets S' that intersect with L\. In other words, subsets S" that are disjoint with L\ are excluded in the above verification. These are sets that only contain elements from L^ , and that are not counted for support. Because we do not c know whether they are frequent or not, we give them the benefit of the doubt. Hence, we have the following modification in candidate generation. A set S is in Ck+i if for all subsets: V S ' : S' As usual, L \ k+ C S and S' D L\£ 0 => S' G L. k = { S | S G C +\ & freq(S)}. k Example 5.1 Suppose that Item is the set { 1 , . . . , 100}, Cf is { 1 , . . . , 50}, and C^ c is { 5 1 , . . . , 100}. Furthermore, suppose that L\ and L ^ turn out to be { 1 , . . . , 20} c and { 7 1 , . . . , 100}. Then, C2 is constructed by considering the Cartesian product { 1 , . . . , 20} x { 1 , . . . , 2 0 , 7 1 , . . . , 100} which gives all the size 2 sets that can be frequent and that contain at least one element from C\. We further suppose that {1, 71} and {1, 72} are frequent, and that {1, 73} is not. Now, in considering whether {1, 71, 72} should be in C 3 , we only check whether {1, 71} and {1, 72} are frequent. 43 In the classical case, we would check {71,72} as well. However, we do not check it here because its support is unknown. Given the benefit of the doubt, {1,71,72} is included in C . However, neither {1,71,73} nor {1,72,73} is in C , because {1,73} 3 3 is not in L . 2 We can generalize the above Strategy for more general M G F . For example, the constraint C = {snacks, sodas} C S.Type is succinct, but not anti-monotone. 2 Its M G F is {Xi U I U X \Xi 2 3 C Item & X x 2 ^ $&X 2 7^ 0 & X C Item & X 2 3 3 C I t e m i } , where Item2, Item3, and I t e n u are as defined earlier. Our strategy is to first form the sets L* { = {e\e € Itemj+i & freq({e})} for each Xi,i — 1,2,3. Then we generate the candidate set C = L* 1 2 x From C , we can find L 2 2 the support of the sets in C ). Next, we form C = L x ( i f 2 3 2 1 Ui f 2 (by counting ULf ). After 3 that, Lk and Ck+i are computed as usual, with the only modification in candidate generation that S is in C k+l and S ' n L f 5.3 2 7^ 0 iff: for all subsets S' C S and \S'\ = k and S ' n i f 1 7^ 0 S' G L*. Non-Succinct but Anti-monotone Constraints So far, we have assumed that the constraint is succinct. If the constraint C is not succinct, then it does not admit an M G F which can be used to avoid the generate-and- test paradigm completely. Fortunately, we can still make use of the anti-monotone property. This is done by checking whether the candidate sets, generated at each level, satisfy C, before counting is done. Because of anti-monotonicity, candidate sets that do not satisfy C can be safely dropped right away without counting for their support. Thus, we have the following strategy. Strategy III: • Define C as in the classical Apriori Algorithm. Drop a set S E C from countk k 44 ing whenever S fails satisfying the constraints, i.e. , constraint satisfaction is tested before counting is undertaken. 5.4 Non-Succinct and Non-anti-monotone Constraints If the constraint C is neither succinct nor anti-monotone, then it seems that we cannot use the constraint to perform any optimization. Fortunately, from our experience, many such constraints induce weaker constraints that might be anti-monotone and/or succinct. If such constraints can be found, then they can be exploited using one of the strategies outlined above. In other words, the weaker constraints are used to find the frequent sets. The only modification is that at the end of the algorithm, one final round of testing the frequent sets for satisfaction of the original constraint C is necessary. It is because the frequent sets, while guaranteed to satisfy the weaker constraints, may not necessarily satisfy C. If the weaker constraints do not admit too many spurious solutions (i.e. frequent sets that satisfy the weaker constraints, but not C ) , then the above strategy should be fairly efficient because the last of constraints checking is relatively trivial. E x a m p l e 5.2 Consider the constraint C = avg(S.A) > v. It is neither succinct nor anti-monotone. However, it induces the weaker constraint C = max(S.A) > v. In other words, every set S that satisfies C is guaranteed to satisfy C. Since C is a succinct, but non-anti-monotone constraint, Strategy II can be applied. After all frequent sets are found, each one of them has to be tested for satisfaction of C to generate the final answers. S t r a t e g y IV: • Induce any weaker constraint C from C. Depending on whether C is antimonotone and/or succinct, use one of the strategies I-III above for the gener45 ation of frequent sets. • Once all frequent sets are generated, test them for satisfaction of C. 5.5 Multiple Constraints So far, we has assumed that we are dealing with a single constraint. However, the set of constraints C in a C A Q may more often contain multiple constraints. A n important question is then how the strategies we proposed for individual constraints can be combined to handle multiple constraints. To this end, the set of constraints C is divided into four sets of constraints according to whether the constraints are succinct and/or anti-monotone. We denote the set of constraints that are both C . succinct and anti-monotone by sam succinct but not anti-monotone; C am but not succinct; Finally, C none Similarly, C suc denotes constraints that are denotes constraints that are anti-monotone denotes constraints that are neither anti-monotone nor succinct. Strategy for Handling Multiple Constraints: • Combine the M G F s of all the constraints in C m- Apply Strategy I with the sa combined M G F . • Combine the M G F s of all the constraints in C c- Apply the Strategy II with SU the combined M G F . • For all constraints in C a T O , follow Strategy III. • For each constraint in C e, induce weaker constraints and apply Strategy non IV. Example 5.3 Let C be the constraints min(S.Price) > 100 & S.Type 2 {sodas}. Then, C sam = mm(S'.Price) > 100 and C suc 46 = S.Type 2 {sodas}. To find fre- quent sets that satisfy C, we apply Strategy I with the M G F of the constraint m m ( S . P r i c e ) > 100 and Strategy II with the M G F of the constraint S.Type D {sodas}. 5.6 Summary The basic algorithms discussed in the previous chapter do not take into account the property of the constraint. In designing our optimization strategies, we make use of the two identified properties, namely, anti-monotonicity and succinctness, and try to apply constraints checking as early as possible. We design four different strategies ( S t r a t e g y I - I V )depending on whether the constraint is anti-monotone and/or succinct. We also discuss how to handle multiple constraints i n Section 5.5. We incorporate all the optimization strategies discussed in this chapter into the following algorithm called C A P (Constrained Apriori) for finding frequent sets that satisfy a set of given constraints C. A l g o r i t h m 5.1 ( C A P ) 1 if C sam UC suc UC none is non-empty, prepare C\ as indicated in Strategies I, II and IV; k = 1; 2 3 if C suc is non-empty { 2.1 conduct database scan to form L\ as indicated in Strategy II; 2.2 form C2 as indicated in Strategy II; k = 2; } while (Cfc not empty) { 3.1 conduct database scan to form L from C \ 3.2 form C +i from L based on Strategy II if C k k k k suc and Strategy III for constraints in C ;} am 47 is non-empty, 4 if C none is empty, Ans = \JL . Otherwise, for each set S in k some Lk, add 5* to Ans iff S satisfies C . none Algorithm C A P is one of the most important contribution in this thesis. As we will see in the following chapter, the performance of Algorithm C A P is much better than the other basic algorithms discussed in Chapter 5. To conclude this chapter, we state the soundness and completeness of Algorithm C A P as a proposition. Proposition 5.1 A constraint C can be in one of the categories: I Succinct and Anti-monotone II Succinct but Non-anti-monotone III Non-succinct but Anti-monotone IV Non-succinct and Non-anti-monotone If C is in category X (X=I-IV), then Strategy X proposed in Sections 5.1-5.4 is sound and complete with respect to computing the frequent sets that satisfy C. 48 Chapter 6 Experimental Evaluation In this chapter, we evaluate the performance of the various mining algorithms given in Chapter 4 and 5 based on our experimental results. 6.1 Implementation and Experimental Environment We implemented the algorithms Apriori" ", Hybrid(m), and C A P in C. Instead of 1 writing three different programs, we wrapped all the algorithms under one single program which we call caq. The rationale is to try to share as much data structures and library modules as possible so that a fair comparison can be achieved. In running caq, choosing which mining algorithm to use is simply a matter of supplying different flags. The command caq - a p r i o r i (-hybrid(m) , -cap) finds all the frequent itemsets with the Apriori" " (Hybrid(m), C A P respectively) Algorithm. The different 1 constraints were hard-coded using compiler directives. One major decision in the implementation was what data structures should be used in representing itemsets. Our initial attempt was to use bit vectors. Set operations such as determining subset relationship were achieved by bitwise oper- 49 ations which were very efficient. However, we later realized that using bit vectors leads to a memory problem. W i t h 1000 items, each itemset would require 125 bytes or 32 integers (assuming 4 bytes for an integer). The number of itemsets in Ci could be of the order of 500K. This implies that at least 60MB is required just to hold C2! Therefore, we gave up the bit vectors representation. Instead, we used lists. A size k itemset is represented by a list of k integers where each integer corresponds to the item's identification number. Since the maximum size of a large itemset is typically less than 10, we save at least 2/3 of the memory when compared with bit vector representation. We used the program gen developed at I B M Almaden Research Center [AS94] to generate transactional databases. Gen is a program that can generate data synthetically for both associations and sequential patterns. It was downloaded from the I B M Q U E S T web site (http://www.almaden.ibm.com/cs/quest). It requires several input parameters. The most important ones are the number of transactions in the databases and the number of items. While we experimented with various databases, the results cited below are based on a database of 100,000 transactions and a domain of 1000 items. Moreover, the average number of items in a transaction is 25 and the maximal size of a large itemset is set to 10. The items are represented by integers from 1 to 1000. The page size used was 4 Kbytes. Our experiments were conducted in a time-sharing SPRAC-10 environment. The memory size was 128MB and the typical number of users was about five. We scheduled our experiments mainly at night so that the workload for other processes was relatively low and uniform. We also made sure that our program was the only major running job. 50 6.2 Succinct and Anti-monotone Constraints Our first set of experiments compares the various algorithms using the succinct and anti-monotone constraint max(S.Price) < v. Different values of v correspond to different selectivities of the constraint. If there are x-% of the items whose price is less than or equal to v, then we say the selectivity is x-%. The translation between selectivity and v is as follows. We number the items from 1 to 1,000. Without loss of generality, the P r i c e value of each item is taken to be exactly its number. Then, the selectivity for a particular v value is simply We then run our experiments with different selectivities and different support values. We show our results by plotting speedup against selectivity and support threshold. The speedup of various algorithms is denned to be the execution time of Apriori " divided by that of the algorithm being considered. In other words, we 4 measure the speedup relative to Algorithm Apriori" ". Figure 6.1 plots the speedup as 1 a function of constraint selectivity, with support threshold set at 0.5%, and Figure 6.2 shows the relationship between the speedup and support threshold, with the selectivity fixed at 30%. We first discuss our results for the Hybrid(m) algorithms. We find that Hybrid(O), Hybrid(l), and Hybrid(2) all take much too long when compared with Apriori ". This shows that the C P U cost of finding all the sets that satisfy the 4 constraints in a generate-and-test mode is prohibitive. Hybrid(3), Hybrid(4), etc. take more or less the same time as Apriori" ", almost always within 5% of each other. 1 Algorithm Hybrid(m) is the same as Algorithm Apriori" " at the first m iterations. It 1 then tries to reduce remaining database scanning time at the expense of increasing C P U time. However, the I / O cost is dominated by the first few iterations. Therefore, for m — 3,4,.., savings on the I / O cost corresponding to later iterations do not lead to any observable reduction in execution time when compared with Apriori" ". It is 1 quite possible that with a larger database size, Hybrid(m) algorithms will become 51 Speedup vs Selectivity 0 10 20 30 40 item selectivity (%) 50 60 Figure 6.1: Speedup vs Selectivity for a Succinct and Anti-monotone Constraint 52 70 Speedup vs Support 0.2 0.4 support threshold (%) Figure 6.2: Speedup vs Support for a Succinct and Anti-monotone Constraint 53 0.6 selectivity 5 10 30 50 70 Li L L U L L L L 19/362 0/66 0/91 0/105 0/77 0/35 0/9 0/1 40/362 0/66 0/91 0/105 0/77 0/35 0/9 0/1 112/362 14/66 10/91 5/105 1/77 0/35 0/9 0/1 66/362 26/66 21/91 15/105 6/77 1/35 0/9 0/1 254/362 41/66 45/91 40/105 22/77 7/35 1/9 0/1 2 3 5 6 7 8 Table 6.1: Number of frequent sets that satisfy max(S.Price < v) /number of frequent sets for different selectivities (v/10) with support = 0.5% more efficient than Apriori" ". However, it is quite unlikely that their efficiency can be 1 comparable with Algorithm C A P . Therefore, in the following discussion, our focus will be on Algorithm C A P . We include Table 6.1 and 6.2 to help illustrate our results. In both tables, each column corresponds to a different setting. The rows correspond to the sizes of the frequent sets. Each entry is of the form a/b, where a is the number of frequent sets satisfying the constraint, and b is simply the total number of frequent sets of that size. Table 6.1 contains results for the various selectivities used in Figure 6.1, whereas Table 6.2 shows the results for different support thresholds corresponding to the settings used in Figure 6.2. We now consider the relationship between speedup and item selectivity. As one can see from Figure 6.1, Algorithm C A P clearly outruns A p r i o r i . + At 10% selectivity, the speedup for C A P is about 80 times. Even at a selectivity of 30%, C A P still runs 10 times faster than Apriori" ". 1 One can understand such a huge speedup (> 80) of C A P for selectivities less than 10% by examining the first two column in Table 6.1. For a 5% selectivity (i.e. the first column), A p r i o r i has to find all the 362 frequent sets of size 1, 66 + 54 support 0.6% 0.5 % 0.4% Li L U U L Le L 98/313 1/12 0/1 0/0 0/0 0/0 0/0 0/0 112/362 14/66 10/91 5/105 1/77 0/35 0/9 0/1 131/421 22/160 11/186 5/210 1/70 0/18 0/8 0/2 2 5 7 L 8 0.3% 0.2% 159/503 40/333 17/390 6/427 1/309 0/140 0/36 0/4 174/582 79/969 29/1140 8/1250 1/934 0/451 0/132 0/20 Table 6.2: Number of frequent sets that satisfy max(S.Price) frequent sets for different support values at 30% selectivity < v) /number of frequent sets of size 2, 91 frequent sets of size 3, • • • and so on , before it stops. O n the contrary, C A P stops after the first iteration. The number of frequent sets it has to find is only 19. Furthermore, the number of database scan it has to perform is only 1 compared with 8 for Apriori" ". Similarly, C A P can stop after the first 1 iteration for 10% selectivity. A t 30% selectivity, C A P stops after 5 iterations and the speedup is smaller than 80. But still, the speedup is as large as 10 times. This is because the number of frequent sets (i.e. 112, 14, 10, 5, and 1 for frequent sets of sizes 1, 2, 3, 4, and 5 respectively) C A P has to process is much smaller than that of A p r i o r i ' s (i.e. 362, 66, 91, 105, and 77 for frequent sets of sizes 1, 2, 3, 4, and 5 + respectively). The trend shown in Figure 6.1 is that as the constraint becomes less restrictive (i.e. as selectivity increases), the speedup of C A P decreases. This can also be understood by examining the entries in Table 6.1. As the constraint becomes less restrictive, the pruning power provided by the constraint decreases. The number of frequent sets which need to be found by C A P increases (as shown in Table 6.1). Savings obtained by pushing the constraint in the C A P Algorithm become less obvious, thus the speedup decreases. 55 We now turn to the relationship between speedup and support threshold. The graph in Figure 6.2 shows that the speedup of C A P is about 7 to 11 times for support threshold from 0.2% to 0.6%. We choose the support thresholds such that A p r i o r i + requires between 5 and 8 iterations. The speedup shown in Figure 6.2 decreases slowly with increasing support threshold. When the support threshold is low, the number of frequent sets is large. Most of the frequent sets may turn out not to satisfy the constraint. For example, for a support of 0.2% and sets of size 4, A p r i o r i finds 1250 frequent sets, but only + 8 of which need to be found by C A P ! Moreover, Apriori" " has to find frequent sets 1 of size 6 and up, whereas C A P can stop after size 5. Thus, A p r i o r i + is relatively much less efficient than C A P leading to a speedup of more than 11 times. When the support threshold is high, the total number of frequent sets is smaller, and Apriori" " behaves relatively better. 1 At 0.6% support, Apriori" " only 1 iterates one more time than C A P . However, it still processes too many frequent sets potentially violating the constraint, and the speedup for C A P is still as large as 8 times. To conclude, if the support thresholds are chosen to be small, A p r i o r i would + require more iterations and the overall speedup of C A P would be larger. If the support thresholds are chosen to be higher so that number of iterations required by A p r i o r i decreases, the speedup achieved by C A P would be lower, but it would still + be substantial. The reason is that the major savings contribution achieved by C A P come from the very first few iterations. 6.3 Succinct but Non-anti-monotone Constraints In the next series of experiments, we use the succinct but non-anti-monotone constraint {soda} C S'.Type. Similar to the previous case, we illustrate our results by 56 selectivity (%) 10 13 15 20 50 70 90 Li L 40/362 0/66 0/91 0/105 0/77 0/35 0/9 0/1 54/362 12/66 15/91 20/105 15/77 6/35 1/9 0/1 60/362 19/66 36/91 55/105 50/77 27/35 8/9 82/362 35/66 67/91 89/105 71/77 34/35 9/9 184/362 57/66 87/91 104/105 77/77 35/35 9/9 254/362 65/66 91/91 105/105 77/77 35/35 9/9 324/362 66/66 91/91 105/105 77/77 35/35 9/9 1/1 1/1 1/1 1/1 1/1 2 L3 Li L Le L L 5 7 8 Table 6.3: Number of frequent sets that satisfy {soda} C S.Type /number of frequent sets for different selectivities with support = 0.5% plotting speedup against item selectivity (see Figure 6.3) and speedup versus support threshold (see Figure 6.4). A n x-% selectivity in Figure 6.3 means that there are x - % of the items whose type is soda. Once again, C A P clearly outruns other algorithms. For example, for selectivities of 5%, 10%, and 20%, C A P runs 9, 4, and 2.5 times faster than Apriori " 4 respectively (see Figure 6.3). At a selectivity of 13%, C A P can achieve a speedup of about 3.5 times for support thresholds between 0.3% and 0.6% (Figure 6.4). To capture the pruning achieved by C A P , we show the number of frequent sets that satisfy the constraint (a) and the number of frequent sets (b) in the form a/b in the Tables 6.3 and 6.4. Table 6.3 corresponds to the settings used in Figure 6.3, whereas Table 6.4 uses the same settings as that in Figure 6.4. When compared with the previous case (i.e. the case of succinct and antimontone constraints), the gain here shown by C A P comes entirely from the succinctness of the constraint. For example, at 5% selectivity, the speedup is about 8 times. This shows the relevance of the succinctness property in performance optimization. When compared with the previous case, the speedup is smaller. A t 5% selectivity, it was 80 times in the previous case compared with 8 times here. This 57 Speedup vs Selectivity 20 30 40 50 60 item selectivity (%) Figure 6.3: Speedup vs Selectivity for a Succinct and Non-anti-monotone Constraint 58 70 Speedup vs Support 6 5 - 4 - CAP ~~~~ Q. 1 3 - CD CL CO 2 - 1 Apriori+ / Hybrid(3) 0 I 0.3 1 1 0.4 0.5 support threshold (%) Figure 6.4: Speedup vs Support for a Succinct and Non-anti-monotone Constraint 0.6 support 0.6% 0.5 % 0.4% 0.3% Li L 49/313 3/12 0/1 0/0 0/0 0/0 0/0 0/0 54/362 12/66 15/91 20/105 15/77 6/35 1/9 0/1 60/421 39/160 64/186 85/210 70/154 34/70 9/18 1/2 72/503 96/333 157/390 215/427 180/309 90/140 25/36 3/4 2 u L 5 L L L 6 7 8 Table 6.4: Number of frequent sets that satisfy {soda} C S'.Type /number of frequent sets for different support values at 13% selectivity demonstrates that a constraint that is both succinct and anti-monotone effects much more pruning than a constraint that is only succinct. Succinctness combined with anti-monotonicity can be exploited to produce a powerful compound effect on performance optimization. One way to understand why the speedup in this case is smaller than the previous case is to compare the number of frequent sets that satisfy the constraints in Table 6.1 with that in Table 6.3. A t the same selectivity, the number of frequent sets that satisfy the succinct but non-anti-monotone constraint {soda} C S.Type is larger than that for the succinct and anti-monotone constraint maa;(S.Price) < v. For example, at 50% selectivity, the number of frequent sets that satisfy the constraint {soda} C S.Type is 184, 57, 87, and 104 for sizes 1, 2, 3, and 4, whereas those numbers are 66, 26, 21, and 15 for the constraint maa;(S.Price) < v. This is reasonable as the latter constraint is also anti-monotone and therefore intuitively it should be more restrictive. The speedup for a more restrictive constraint should be higher which is consistent with what we observe in our experiments. One cannot compare directly Table 6.2 with Table 6.4 because the selectivity for the two tables are set at different values. Nevertheless, we can still see that the 60 number of frequent sets that satisfy the constraint m a x ( S . P r i c e ) < v decreases much more rapidly with the size of the frequent sets than that of the constraint {soda} C S.Type. This is again consistent with the observation that the speedup in this case is less than the previous case. 6.4 Non-Succinct but Anti-monotone Constraints Next, we consider the case of non-succinct but anti-monotone constraints. constraint used is sum(S.PTice) < MaxSum. The Similar to the case of succinct and anti-monotone constraints, we assume that the P r i c e value of each item is exactly its number, i.e. from 1 to 1,000. MaxSum is set to be 500, 1000, 2000 and so on. In contrast with the previous cases, there is no simple translation from to item selectivity. Thus, we plot speedup against MaxSum MaxSum in Figure 6.5, instead of speedup against item selectivity. Contrary to the previous two cases, the dominance of C A P drops very rapidly as the value of MaxSum for MaxSum increases. While the speedup is above 7 times 500, C A P shows no gain over Apriori " for MaxSum 4 > 2000. It can be explained by Table 6.5. When MaxSum is 500, the constraint helps reduce the number of frequent sets at level 1 by a half (i.e. 184 out of 362). This reduction is compounded at subsequent levels. Such pruning provided by the constraint leads to a speedup of over 7 times. When MaxSum reaches 1000, all the size 1 frequent sets pass the constraints. Pruning provided from the constraints only starts at level 2 (i.e. 39 out of 66). This modest reduction is then compounded at higher levels, giving an overall speedup of about 2 times. However, when MaxSum is increased to 2000, there is no pruning at the first two levels. Pruning only comes at higher levels. Since the major portion of the computation is spent on lower levels, the pruning coming from 61 Speedup vs MaxSum 1000 2000 MaxSum Figure 6.5: Speedup vs Selectivity for a Non-succinct and Anti-monotone Constraint 62 3000 Speedup vs Support 0.4 0.5 support threshold Figure 6.6: Speedup vs Selectivity for a Non-succinct and Anti-monotone Constraint 63 0.6 MaxSum 500 1000 2000 3000 Li L L U L U L 184/362 16/66 4/91 0/105 0/77 0/35 0/9 0/1 362/362 39/766 28/91 15/105 1/77 0/35 0/9 0/1 362/362 66/66 86/91 77/105 40/77 13/35 1/9 0/1 362/362 66/66 91/91 105/105 72/77 30/35 8/9 0/1 2 3 5 7 LB Table 6.5: Number of frequent sets that satisfy sum(S.Pxice) < MaxSum) ber of frequent sets for different MaxSum with support = 0.5% support 0.6% 0.5% 0.4% 0.3% Li L Lz L L Le L 160/313 1/12 0/1 0/0 0/0 0/0 0/0 0/0 184/362 16/66 4/91 0/105 0/77 0/35 0/9 0/1 207/421 32/160 5/186 0/210 0/154 0/70 0/8 0/2 250/503 58/333 13/390 1/427 0/309 0/140 0/36 0/4 2 4 5 7 L 8 Table 6.6: Number of frequent sets that satisfy sttm(S'.Price) < MaxSum) ber of frequent sets for different support values with MaxSum = 500 64 /num- /num- later levels does not help. Therefore, we can conclude that in order for any pruning optimization to be significant, it has to be effective as early as possible at the first two levels. In fact, any reduction from lower levels is compounded to higher levels giving rise to a significant speedup. In Figure 6.6 and Table 6.6, we show our results for the relationship between speedup and support threshold. Similar to the other cases, if the constraint produces fairly large pruning, the speedup is as large as 5 to 8 times for various support thresholds from 0.3% to 0.6%. 6.5 Non-Succinct and Non-anti-monotone Constraints Finally, the last category to investigate is non-succinct and non-anti-monotone constraints. In particular, we consider the constraint Avg(S.Price) < v. This constraint is difficult to optimize. Fortunately, it induces the weaker constraint m m ( S . P r i c e ) < v. In order for S to satisfy the first constraint, S has to satisfy the latter constraint. Since the constraint m m ( S . P r i c e ) < v is succinct but nonanti-monotone, we can use Strategy II in Section 5.2 to compute frequent sets that satisfy it. Then, for each of the frequent sets found, we check whether it satisfies the original constraint Avg(S.Price) < v. Figure 6.7 and 6.8 shows our results. Similar to the other cases, we also supplement the figures with two tables (Table 6.7 and 6.8) to illustrate what is going on. Instead of just a/b, the entries in these tables are in the form a/b/c where a, b, and c are the number of frequent sets that satisfy the original constraint, the induced constraint and no constraint respectively. Figures 6.7 and 6.8 show that the speedup of C A P over Apriori" " is very sim1 ilar to the case for a typical succinct but non-anti-monotone constraint, even though only a small portion of the frequent sets that satisfy the induced constraint does 65 Speedup vs Selectivity 20 30 item selectivity 40 Figure 6.7: Speedup vs Selectivity for a Non-succinct and Non-anti-monotone Constraint 66 50 Speedup vs Support SCD 2 CL W 0 0.4 0.5 support threshold (%) Figure 6.8: Speedup vs Support for a Non-succinct and Non-anti-monotone Constraint 67 0.6 AvgPrice 50 130 145 200 500 Li L L 19/19/362 0/3/66 0/0/91 0/0/105 0/0/77 0/0/35 0/0/9 0/0/1 54/54/362 1/12/66 0/15/91 0/20/105 0/15/77 0/6/35 0/1/9 0/0/1 59/59/362 2/19/66 0/36/91 0/55/105 0/50/77 0/27/35 0/8/9 0/1/1 82/82/362 9/35/66 7/67/91 5/89/105 1/71/77 0/34/35 0/9/9 0/1/1 184/184/362 39/57/66 64/87/91 77/104/105 63/77/77 30/35/35 8/9/9 1/1/1 2 3 LA L Le L L 5 7 8 Table 6.7: Number of frequent sets that satisfy Avg(S.Price) < AvgPrice / number of sets that satisfy m m ( S . P r i c e ) < AvgPrice / number of frequent sets for different values of AvgPrice with support = 0.5% support 0.6% 0.5% 0.4% 0.3% Li L L L L Le L 74/74/313 0/5/12 0/0/1 0/0/0 0/0/0 0/0/0 0/0/0 0/0/0 82/82/362 9/35/66 7/67/91 5/89/105 1/71/77 0/34/35 0/9/9 0/1/1 96/96/421 18/79/160 10/131/186 5/174/210 1/141/154 0/68/70 0/18/18 0/2/2 114/503/503 38/158/333 23/268/390 12/350/427 2/282/309 0/136/140 0/36/36 0/4/4 2 3 A 5 7 L s Table 6.8: Number of frequent sets that satisfy Avg(S.Price) < AvgPrice / number of sets that satisfy m m ( S . P r i c e ) < AvgPrice / number of frequent sets for different support values with AvgPrice = 200 68 happen to satisfy the original constraint. This indicates that the final verification step takes a relatively small amount of time and the idea of using induced weaker constraints works quite well. 6.6 Summary To summarize, among all the various algorithms discussed in Chapters 4 and 5, C A P is the most efficient. Hybrid(m) m > 3 are comparable with Apriori" ". O n 1 the other hand, Hybrid(O), Hybrid(l), and Hybrid(2) are the slowest. While the exact speedup of the C A P Algorithm over the Apriori" " depends on many factors 1 and varies under different situations, we find that C A P can easily outrun Apriori" " 1 by a factor of 5 to 10 in many cases. We also find that the speedup increases with decreasing item selectivity (i.e. with more restrictive constraints) and decreases with increasing support thresholds. The main factor in determining the speedup of C A P is on how much pruning the constraints provide. Both succinctness and anti-monotonicity produce significant pruning power and their combined effect is even stronger. Thus, the speedup of C A P can be as much as 80 times in the case of succinct and anti-monotone constraints. 69 Chapter 7 Conclusions 7.1 Conclusions There has been an increasing demand for discovering useful information from very large databases. Data mining, which can be defined as an automatic process of discovering hidden useful patterns from large databases, has recently generated a high degree of interest. Among the most popular data mining patterns discovered in large transactional databases are association rules. A classical association rule takes the form of X Y, where X U Y is a set of items that appears at least minsup% of the times in the database and among those transactions that contain X, at least c% of them also contains Y. Minsup and c are two input parameters referred to as the support threshold and the confidence respectively. The motivation of mining association rules is to find out what items tend to appear together. In a classical association query, the user only need to specify the values for the support threshold and the confidence. The answer to the query is a list of all the association rules found. However, this classical association mining model suffers from (i) lack of user 70 exploration and control, (ii) lack of focus, and (iii) rigid notion of relationship. In particular, there may be no association rules found or thousand of rules found with most of them not being what the user is looking for. In this thesis, we have defined a model where a user can specify additional constraints in the query. The type of constraints can include class, domain, and aggregation constraints. Then, the system can focus the mining task on associations that satisfy the specified constraints. This allows the system to find associations that have a better chance of satisfying the user's need. We have suggested a two-phase architecture for the mining process providing numerous opportunities for user feedback, control, and approval. More importantly, we develop techniques for pushing the constraints deep inside the mining process to optimize performance so that the work performed by the system is commensurate with the focus expressed by the user via constraints. Towards this end, we have identified the anti-monotonicity and succinctness properties according to which we characterize constraints into various categories. For each category, we have developed specific strategies for finding frequent itemsets that satisfy the given constraints. The resulting algorithm, called C A P (Constrained Apriori), is based on the basic idea of exploiting the given constraints as early as possible in the mining process. We have implemented both C A P and several other basic algorithms that do not use the anti-monotonicity and succinctness properties offered by the constraints. A series of experiments were performed to compare the performance of the various algorithms. We found that C A P clearly outruns all the other algorithms. The speedup of C A P over other algorithms is about 5-10 times on average and more than 80 times in the best case. To enable the technology of data mining to reach its full potential, many researchers believe that it is important to have a framework where a user can interact with a mining system in much the same way that the user does with a Relational Database Management System today (e.g, [IM96]). We believe that the work presented in this thesis represents an important step in this direction. 71 7.2 Future Work While we believe that the work presented i n this thesis is significant, due to time and scope limitation, several important issues have been left out for future work. They are discussed below. (1) Phase II: The entire thesis has been focused on Phase I of finding frequent itemsets. The implementation of Phase II, the pros and cons of different significance metrics in the association relationship, and the corresponding algorithms are not discussed. In fact, all of these issues will be investigated in another thesis work [Phase II]. (2) 2-var Constraints: Our investigation has been focused on 1-var constraints in this thesis. A similar idea of pushing constraints deep into the computation can also be applied to 2-var constraints. We have already carried out a detailed analysis of 2-var constraints. Details will be found in a forthcoming paper. (3) Segment Support: Next, we discuss an optimization technique which is not specific to C A Q mining, but general enough that it can be applied to classical association queries and its variants. The technique involves dividing transaction databases into segments. Given a transaction database V, we partition it into M segments, denoted by T>i(i = 1,..., M ) , in' some arbitrary way. Then, instead of using just one integer for the support of a candidate set, we use M integers where each integer corresponds to the support of the set over one transaction database segment. We refer to the M integers as the segment supports: Definition 7.1 (Segment Support) Given a transaction database V and a partition {V^i = 1..M} of the database (i.e. V = lifLi^h TJ^Vj = 0 for i ^ j), the ith segment support of any set S is defined to be the support S over T>i and is denoted 72 S {A,B} {A,C} {B,C} Pi 100 200 200 P 200 200 100 P 300 400 300 {A,B,C} 100 100 200 2 Table 7.1: A n example to illustrate the idea of segment support by Supporti(S). The sum of the segment supports gives the total support which is the support in usual sense. We note that segment support satisfies the same anti-monotone property of the total support. In other words, VS, L, S C L =>• Supporti(S) > Supporti(L). The motivation of introducing segment support is best illustrated with an example. Consider the following simple example where the number of segments is two (i.e. M = 2) and the support threshold is 250. The first three rows in Table 7.1 show the segment support for the three set {A,B},{A,C}, column shows the total support of the three sets. and {B,C}. The last Since all (total) supports are larger than 100, the support threshold, all the sets, {A, B}, {A, C}, and {B, C } , are size two frequent sets. Now, consider the generation of size three candidate sets. Without the information provided by segment support, the set {A, B, C} would be a valid candidate set in C3. However, the maximum possible segment support for the set {A, B, C} in each segment has to be equal to the minimum among the segment supports over the same segment of its subsets. The reason for that is because of the anti-monotone property of the segment support discussed above. In segment one, the minimum of the segment supports among {A, B}, {A, C}, and {B,C} is 100. Therefore, the segment support of {A, B, C} can be at most 100. Similarly, 73 the maximum segment support of {A, B, C} in segment two is 100. Thus, the total support of {A, B, C} must be equal or less than 200. This implies the set {^4, B, C} cannot be frequent and can be pruned away from C 3 before counting for support. The above example illustrates that we can prune away many potential candidate sets in the generation of Ck+\ from Lk- The power of this pruning power is based on the following inequality. For any set S G 'P(ltem), M Support(S) < ^2 min {Supportj j=i ' 5 where Support(S) (Si)}, (7.1) c 5 means the total support of S and the minimum operator runs over all the S"s subsets. The right hand side of the above equation can be taken as the estimated support for the set S. To prove the above inequality, we note that the total support of S is equal to the sum of the support of S over each database segment Di, i.e. Support(S) ^2jL\ Supportj(S). However, as a support function, we have Supportj(S) VSj C S, and thus, Supportj(S) < = Supportj(Si), < mins s{Supportj (Si)} which leads to the above iC inequality. The pruning power of the segment supports increases with the number of segments. In fact, if one goes to the extreme case where the number of segment is the number of transactions, the estimated support will be the actual support. The actual choice of the number of segments represents a tradeoff among computer resources. If there is enough memory resources, one would like to use a larger number of segments because the pruning power provided by the segment supports is higher. W i t h fewer candidate sets, C P U costs (and related I / O costs) are reduced. On the other hand, if there is not enough memory, a large number of segment supports will consume up memory that could have been used for other purposes such as holding transaction data. This will increase the I / O costs. 74 Experimental investigation of the implication of the segment support is underway. Applications of the above idea to real life data should be an interesting future work. (4) Jumping Levels: Finally, another direction for future work is based on the observation that for each frequent set found at the end of the Apriori" " Algo1 rithm, every one of its subsets has been counted. For example, in order to find the frequent itemset {A, B,C, D}, Apriori" " needs to count support for every one 1 of the sets {A},{B}, {C},{D}, {A, B}, { A , C } , {A, D}, {B, C}, {A,B,C},{A,B,D}, {A,C,D}, and {B,C,D). {B,D},{C,D}, However, if somehow we count {A, B, C, D} before counting its subsets and realize that it is frequent, then we can a posteriori prune away all its subsets from counting because we know all its subsets would be frequent. This leads to the idea of jumping level. In other words, Apriori" " is a level-by-level algorithm and the pruning of candidate sets is based on 1 our knowledge on lower levels. We can prune away a candidate set if any of one of its proper subsets is not included in the list of frequent sets found in lower levels. However, if we have already counted some higher level to get all the frequent sets of size larger than the current size of the candidate sets, then we can also prune away those candidate sets that are subsets of some frequent sets at higher levels. This is because those candidate sets are guaranteed to be frequent and therefore can be dropped from counting. Using a scheme of jumping ahead to count higher levels can provide additional pruning power. However, there are still many issues one has to resolve. For example, counting higher levels can pay off only when there are a significant number of frequent sets to be found. If all the candidate sets turn out to be non-frequent, then no information is gained. Moreover, it is not clear which level to "jump". Intuitively, if there are still a significant number of frequent sets, one would like to jump to higher levels as the chance of payoff is higher. However, more analysis is required to make use of the above idea and to come up with an efficient algorithm. 75 Bibliography [AIS93] R. Agrawal, T. Imielinski and A . Swami, Mining association rules between sets of items in large databases, SIGMOD 93, p. 207-216. [ALSS95] R. Agrawal, K . Lin, H . S. Sawhney, K . Shim, Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases, VLDB 95. [AMS97] K . A l i , S. Manganaris, and R. Srikant, Partial Classification using Association Rules, American Association for Artifical Intelligence 1997. [AS94] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, VLDB 94, p. 487-499. [AS95] R. Agrawal and R. Srikant, Mining Sequential patterns, Proc. 1995 International Conference on Data Engineering, March 1995. [AS96] R. Agrawal and J. C. Shafer, Parallel mining of association rules: Design, implementation, and experience, IEEE TKDE, 8, p. 962-969, 1996. [BMS97] S. Brin, R. Motwani, and C. Silverstein, Beyond market basket: Generalizing association rules to correlation, SIGMOD 97, p. 265-276. [BMUT97] S. Brin, R. Motwani, J . Ullman, and S. Tsur, Dynamic itemset counting and implication rules for market basket data, SIGMOD 76 97, p. 255-264. [CHNW96] D. W . Cheung, Jiawei Han, V . T. Ng, and C. Y . Wong, Maintenance of Discovered Association Rules in Large Databases: A n Incremental Updating Technique, Proceedings of the International Conference on Data Engineering, Feb. 1996. [FMMT96] T. Fukuda, Y . Morimoto, S. Morishita, and T. Tokuyama, Mining Optimized Association Rules for Numeric Attributes, PODS 96, p. 182-191. [HKK97] E . - H . Han, G . Karypis, and V . Kumar, Scalable Parallel Data Mining for Association Rules, SIGMOD 97, p. 277-288. [HF95] J . Han and Y . Fu, Discovery of multiple-level association rules from large databases, VLDB 95, p. 420-431. [IM96] T. Imielinski and H . Mannila, A database perspective on knowledge discovery, Communications of ACM, 1996, p. 58-64. [KN97] E . Knorr and Raymond T. Ng, A Unified Notion of Outliers: Properties and Computation, Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997, p. 219-222. [MAR96] M . Mehta, R. Agrawal and J . Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining, Proc. of the Fifth International Conference on Extending Database Technology, March 1996. [Meta Group] Mining Your Own Business, Information Week cover story, March 16, 1998. [MPC96] R. Meo, G . Psaila, and S. Ceri, A new SQL-like operator for mining association rules, VLDB 96, p. 122-133. [MY97] R. J . Miller and Y . Yang, Association rules over interval data, 97, p. 452-461. 77 SIGMOD [NH94] R. T. Ng and Jiawei Han, Efficient and Effective Clustering Methods for Spatial Data Mining, VLDB 94, p. 144-155. [NLHP98] R. T. Ng, L . V . Lakshmanan, J . Han, and A . Pang, Exploratory Mining and Pruning Optimizations of Constrained Associations Rules, SIGMOD 98, p. 13-29. [PCY95] J . Park, M . Chen, and P. Y u , A n Effective Hash-Based Algorithm for Mining Association Rules, SIGMOD 95, p. 175-186. [Phase II] Teresa Mah, Master of Science Thesis, Computer Science, University of British Columbia, 1999. [QUEST] R. Agrawal, A . Arning, T. Bollinger, M . Mehta, J . Shafer, R. Srikant, The Quest Data Mining System, KDD 96. [SA95] R. Srikant and R. Agrawal, Mining generalized association rules, VLDB 95, p. 407-419. [SA96] R. Srikant and R. Agrawal, Mining quantitative association rules in large relational tables, SIGMOD 96, p. 1-12. [SVA97] R. Srikant, Q. V u , and R. Agrawal, Mining association rules with item constraints, KDD 97, p. 67-73. [TUACMN97] D. Tsur, J . Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov, Query flocks: A generalization of association rule mining, SIGMOD 98, P. 1-12. 78
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Efficient data mining of constrained association rules
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Efficient data mining of constrained association rules Pang, Chiu Yan (Alex) 1998
pdf
Page Metadata
Item Metadata
Title | Efficient data mining of constrained association rules |
Creator |
Pang, Chiu Yan (Alex) |
Date Issued | 1998 |
Description | With the recent advances in information technology, companies are now collecting more and more data related to their business. Companies are very interested in decision support systems that can discover knowledge from data and help them gain insight into their data. Data mining with the goal of discovering non-trivial information or patterns hidden in large databases has, therefore, recently become one of the most active research areas in database technology. Association rules relate items which tend to occur together in a given event or record. Mining association rules represents one of the most important problems in data mining. However, the current framework suffers seriously from the lack of user interaction and focus. In this thesis, we propose a new paradigm called Constrained Association Rules where (i) the mining of the rules is divided into two phases with various breakpoints for user feedback, and (ii) users can associate constraints with their queries. We analyze many SQL-style constraints and introduce the notions of succinctness and anti-monotonicity for their classification. We design a new algorithm called CAP for mining association rules that satisfy a set of given constraints. The idea is to check for satisfaction of the constraints as early as possible by exploiting the properties of anti-monotonicity and succinctness of the constraints. Several optimization techniques are developed. Our experimental evaluation indicates that CAP runs much faster and can sometimes outrun several basic algorithms by as much as 80 times. |
Extent | 3271577 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-05-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051263 |
URI | http://hdl.handle.net/2429/8197 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1998-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1998-0571.pdf [ 3.12MB ]
- Metadata
- JSON: 831-1.0051263.json
- JSON-LD: 831-1.0051263-ld.json
- RDF/XML (Pretty): 831-1.0051263-rdf.xml
- RDF/JSON: 831-1.0051263-rdf.json
- Turtle: 831-1.0051263-turtle.txt
- N-Triples: 831-1.0051263-rdf-ntriples.txt
- Original Record: 831-1.0051263-source.json
- Full Text
- 831-1.0051263-fulltext.txt
- Citation
- 831-1.0051263.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051263/manifest