K N O W L E D G E D I S C O V E R Y O F T E M P O R A L D A T A B A S E S B y Xiaoding Y i B . Sc., Beijing Normal University M . Sc. (Mathematics). Simon Fraser University P h . D . (Mathematics), Simon Fraser University A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R OF S C I E N C E in T H E F A C U L T Y O F G R A D U A T E S T U D I E S C O M P U T E R S C I E N C E We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A Xiaoding Y i , 1999 © Xiaoding Y i , 2000 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of Br i t i sh Columbia, I agree tha.t the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Computer Science The University of Br i t i sh Columbia 2366 M a i n M a l l Vancouver, Canada V 6 T 1Z4 Abstract This thesis deals with a problem of mining sequential patterns from temporal databases. Data Mining is the non-trivial extraction of impl ic i t , previously unknown, and poten-tial ly useful information from data. A sequential pattern, is an ordered list (ordered by increasing event's time) of events. The main contribution of this thesis is the develop-ment of a general model of sequential patterns as well as algorithms for discovering such patterns. We investigate the generalized sequential pattern in which time isn't just part of the constraints but also a part of the discovered pattern. We have developed several al-gorithms to discover the generalized sequential patterns from a given temporal database. We have implemented four of these algorithms in C++- The results we obtained from applying these implementations to various datasets are matching our expectations. Table of Contents Abstract ii List of Tables vi List of Figures vii Acknowledgement viii Dedication x 1 Introduction 1 1.1 Databases and Data Min ing 1 1.2 Temporal Databases and Sequential Patterns 3 1.3 Motivations and Contributions 4 1.4 Thesis's Outline 6 2 Background and Related Works 8 2.1 Association Rules and Conceptual Hierachies 8 2.2 Sequential Patterns 10 i i i 2.3 Frequent Episodes 12 2.4 Summary 14 3 Problem Formulation and Examples 16 3.1 Problem Formulation 16 3.1.1 Basic Definitions 16 3.1.2 Problem Definition 22 3.2 A n example 22 4 Algorithms 27 4.1 Basic Steps 27 4.2 Basic Algor i thm for Containment Checking 28 4.2.1 Containment Checking Algor i thm CC-1 29 4.2.2 A n Example 31 4.2.3 Correctness of Algor i thm CC-1 32 4.2.4 Complexity Analysis 35 4.3 Generalized Algor i thm of Containment Checking 35 4.3.1 Algor i thm Bridge . 35 4.3.2 Algor i thm CC-2 37 4.4 Algorithms for Finding Strong Sequences 39 I V 4.4.1 Algor i thm: G S P T C 40 4.4.2 Diagonalization 44 4.4.3 M i n i m a l Sequences 49 4.4.4 Other Optimizations 52 5 Implementations and Results 59 5.1 Sample Da ta and Experimental Environment 59 5.2 Issues of Implementation . 60 5.2.1 Data Structures 60 5.2.2 Memorization 61 5.3 Performance 61 5.4 Summary 64 6 Conclusions and Future Works 65 6.1 Conclusions , 65 6.2 Future Works 66 Bibliography 69 v List of Tables 3.1 A database of telephone calls 23 3.2 A modified database 25 5.3 C P U time with support = 0.8 62 5.4 C P U time with same dataset (Data 1) 63 5.5 Times for particular methods with support = 0.7 63 5.6 Numbers of Candidates, M i n i m a l Sequences 63 vi List of Figures 2.1 A conceptual hierarchy for continents, countries and provinces 9 3.2 Conceptual hierarchies for starting/ending time, events and length . . . . 24 4.3 Level-Major Process 44 4.4 Length-Major Process 46 4.5 Diagonalization Process 47 4.6 M i n i m a l Sequences and Forced Join 51 4.7 M i n i m a l Sequences Process 51 vi i Acknowledgement I am indebted to Dr . George Tsiknis for his help and guidance throught the course of my study here. When 1 decided to pursue a degree at the University of Br i t i sh Columbia , his encouragement, help and advices were grateful. His comments concerning the drafts of this work and implementation of algorithms were invaluable. His time which was spent on this thesis is greatly appreciated. I would like to express my appreciation for the guidance which I have received from Dr. Raymond Ng. His comments concerning the algorithms and their implementation were invaluable. I would like to express my appreciation for the help and encouragements which I received from Drs. Jack Snoeyink and Son Vuong when 1 decided to pursue a degree at the U B C . Thanks also to M s Joyce Poon for her wonderful service. I would like to express my appreciation for the University of Br i t i sh Columbia and the Department of Computer Science for financial support. You wi l l be grateful for your investment. Before beginning my study here, I spent a number of years as a Mathematican. The transition to Computer Science, while difficult at times, has opened up a new window of opportunity for me. 1 would like to thank some of my colleagues for their encouragement and help. Special thank goes to Drs. Barry Cooper, Al is ta i r Lachlan, Manuel Lerman and Steffen Lempp. Thanks also go to many of friends to make my life here fun and interesting. v i i i Final ly, I would like to thank my wife, Xiaofan, and my daughters, Cissy and Li l ly , for their unconditional encourgement and support. Without them, I might not be here! I would also like to thank my mother and my parent-in-law for their encourgement and support. Dedication To Xiaofan, Cissy and Li l ly x Chapter 1 Introduction 1.1 Databases and Data Min ing A t its core, a database system is a computerized record-keeping system; it stores and provides access to information. Reduced to its basic components, a database system con-sists of data, hardware, and software. The last few years have witnessed an unparalleled movement toward data of increasing complexity. The simple business-data-processing information consisting of numbers and character strings, while it is s t i l l an important part of the commercial databases, it has been expanded to include large numbers of mult imedia "document" images, time-series and other complex data forms. Addi t ion-ally, low-cost, high-speed hardware components, such as multiprocessors based on fast and inexpensive microprocessors, have become widely available. Computer hard disks and memories show increased capacity and reduced cost each year. Many organizations now have extremely big databases that encapsulate enormous amounts of data related to various aspects of their business. These databases are information mines that could be exploited in order to improve certain decisions made within the organizations. Not only is the computing infrastructure changing, but also the user community is un-dergoing a similar revolution. Nearly every enterprise now includes computerized infor-mation processing as an integral component of its operation. The Internet and W W W is growing at an astronomical rate. Databases and database technology wil l play a crit ical 1 Chapter 1. Introduction 2 role in this information explosion. W i t h the rapid growth in size and number of available databases used in commerical, industrial, administrative and other applications, it is necessary and interesting to ex-amine how to extract knowledge automatically from huge amount of data. Traditionally database systems were used to support business data processing applications, and much Database Management System ( D B M S ) research was focused in this direction. However, an important new use for the technology have recently emerged-which is known as "data mining". There is today a great deal of enthusiasm for data mining, the extraction of (hidden) information from large bodies of data often accumulated for other purposes. Data mining (sometimes called "knowledge discovery in databases") is defined as the non-trivial extraction of implici t , previously unknown, and potentially useful informa-tion from data. Through the extraction of knowledge, large databases wi l l serve as a rich, reliable source for knowledge generation and verification, in information manage-ment, query processing, decision making, process control and many other applications. Therefore, data mining has been considered as one of the most important research topics in databases by many database researchers [11], [12]. Data mining usually involves the identification of patterns or relationships in data, orig-inally collected for a different purpose than data mining. Da ta mining queries pose some unusual problems: they tend to involve aggregations of huge amounts of data, and, often, cannot be precisely formulate by the user. Instend, the user actual question is more like "F ind me something interesting" [12]. There are several different applications of data mining: association rule, classification and sequential patterns. In [1], the data mining problem of association rules was presented by Agrawal, Imielinski Chapter 1. Introduction 3 and Swami. Association relationships or association rules represent one of the most popular patterns pursued in data mining processes. The basic idea of an association rule is to capture the sets of items that tend to appear together. A n example is the rule "70% of the people who buy milk also buy egg". The 70% here is the confidence of this rule. Another measurement of an association rule is support. The support of the above rule is the number of people who buy both milk and egg together. Given a set of transactions P , the process of assoication rules mining is to generate all association rules that have support and confidence greater than the user-specified min imum support and min imum confidence respectively. Sequential patterns are discussed in [4] [5] [8] [9]. The basic idea of a sequential pattern is to capture the events that tend to appear in the order of time. There are several meansurements of whether a given sequence of events is a sequential pattern. In ad-dition to confidence and support we ha.ve discussed above, one can use the maximum and min imum gaps to constraint the gaps between two adjacent events within the given sequence. 1.2 Temporal Databases and Sequential Patterns Time is an important aspect of all real-world phenomena. A n y information having a time component can be represented in a general way in a' temporal database. Conventional databases model an enterprise that changes dynamically by a snapshot at a particular-point time. Sequences of events are an important common form of data that arises in several contexts, including supermarket sales data, telecommunications and operating systems, etc, that can contain important knowledge to be discovered. The first data mining problem of discovering sequential patterns, was introduced by Chapter 1. Introduction 4 Agrawel and Srikant in [4]. The input data is a large database of customer transactions, where each transaction consists of customer-id, transaction-time and the items bought in the transcation. A sequential pattern is an ordered list (ordered by increasing transaction-time) of sets of items bought by a customer in some transcations. For instance the list "95% of customers who bought ' P C Software' came backs to buy 'Laser Printers '" is an example of a sequential pattern. This pattern can be represented as ( ' P C Software', 'Laser Printers ') . Customers who bought other items in between P C Software and Laser Printers also support this sequential pattern. Elements of a sequential pattern need not be simple items. The problem is to find all sequential patterns that have a user-specified min imum support. This problem was ini t ial ly motivated by applications in the retailing industry, including mai l marketing, add-on sales, and customer satisfaction. The results apply to many scientific and bussiness domains; for instances, medical diagnosis, D N A research, retail marketing, Internet/system usage analysis and financial forecast. 1.3 Motivations and Contributions The general Sequential Pattern Problem can be stated as following: Given a sequential database, find all relations that exist in the database. Various serial algorithms have been proposed so far to mine sequential patterns. In [4], this problem was introduced and discussed without considering the time constraints. Three algorithms, A p r i o r i A l l , AprioriSome and DynamicSome, for solving this problem were also presented. A similar problem of discovering frequent episodes in a sequence of events was presented in [8]. A n episode can be view as a collection of events. There are two relation between events: sequential (for event occurs after another) or parallel (two events occur at the Chapter 1. Introduction 5 same time). More precisely, their episodes are arbitrary directed acyclic graphs ( D A G ) . Given a fixed time-window length and an episode, they studied the number of windows occurs whthin another episode which contains the given episode. A window can be understood as an collection of events which are in an interval of time. In [9], this model was generated for using different windows for body and head. [5] presents a generalization of [4] with the consideration of time constraints and a rigid definition of transaction and taxonomies. More details of these works wil l be presented in Chapter 2. However, we would like to indicate that the sequential patterns which are discussed in the above papers have the following limitations: 1. The time isn't one of the attributes involved in the sequential patterns. 2. The t ime window is scanned throughout the entirely database. The particular t ime period were not addressed within patterns. For example, one may be interested in the sequential patterns on Sundays' information of a telecommunication database. 3. The length of time an event occurs was not considered. This is very crit ical in many cases of application, i.e., telecommunication, W W W , etc. 4. In both [4], [5], a transaction-type sequences of events were considered. In general, the events may occur in very complex manner. For example, the processes in the operating system can occur concurrently. In this thesis, we generalize the problem definition given in [4] and [5] to incorporate sequences that form arbitrary directed acyclic graphs, event times that are described by an arbitrary time-tree and an arbitrary length conceptual tree. We wi l l consider events with length, and therefore, events can occur concurrently or sequentially. A conceptual tree is a hierachy of arbitrary events, which we wi l l discuss in the next chapter. One Chapter 1. Introduction 6 should notice that one of the assumptions in [8]'is that every episode can be seen as a recursive combination of parallel and serial episodes, but this is true only if we use duplication of nodes (events). B y using starting times and lengths of events we can avoid this duplication when we define the notion of "subsequence". We can more efficiently recognize whether a given event sequence is a subsequence of another; a check that is very frequently performed in this type of problems. Furthermore, our formualism is more general and can describe phenomena which can't be described with the previous charac-terizations. For example, one may think that event sequence {(e, NovdO, 1997,1 hour)} isn't a subsequence of {(e, Dec.7,1997,2 hours)} since there is nothing in common among the intervals of the events. Bu t it is hardly distinguishable without the notion of starting times and the lengths of the events since both sequences include the same basic event e. 1.4 Thesis's Outline The remaining chapters of this thesis are orginized as follows: Chapter 2 presents a brief survey of related works. It reviews the various formulations of data mining problems on temporal databases, and the algorithms and methodologies to solve these problems. It also discusses the difference between these problems. Chapter 3 explains the basic concepts of data mining and sequential patterns. It defines the problem of Min ing Se-quential Patterns which the thesis is focusing on and shows some examples. In Chapter 4, the main algorithm is presented in great detail. The framework of how all algorithms are working together is discussed in Section 4.1. Both basic and generalized algorithms for containment checking are discussed in Section 4.2 and 4.3 respectively. The main algorithm and its optimizations for solving the problem of Min ing Sequential Patterns is presented in Section 4.4. Chapter 5 describes some of the issues related to the imple-mentions of the main algorithm and its optimizations. Final ly, chapter 6 presents the Chapter 1. Introduction conclusion and suggestions for future works. Chapter 2 Background and Related Works As we have already mentioned in the last chapter, the problem of sequential patterns was introduced by Agrawal and Srikant in [4]. A similar problem of discovering frequent episodes in a sequence of events was presented in [8]. In the current chapter, we briefly summarize their works and other related works and concepts we wi l l use in this thesis. First , we briefly describe an easier data mining problem, that deals with association rules, and conceptual hierachies. 2.1 Association Rules and Conceptual Hierachies One of the earlier data mining problems that is concerned with the discovery of as-sociation rules, was first introduced by Agrawal, Imielinski and Swami [1], and ex-tensively studied by many researchers (see [2], [3], [7]). Association rule mining is the discovery of associations among objects. A n association rule is in the form of uAi A ••• A 4 i —> B\ A • • • A Bmv which means objects Bi, - • •, Bm tend to appear frequently together with objects A\, • • • , An in the target data. For instance, the discov-ery that the rule {bread, milk} —>- {butter} is frequent means that it is very likely that customers purchased both bread and mijk, bought butter in a same transaction. Earlier works mainly focused on mining association rules at a single "conceptual level", 8 Chapter 2. Background and Related Works 9 which means that no event is a part of another event. A conceptual hierachy is a finite collection of events such that one event could be part of another one. A n example of such hierachy is the relation among Vancouver, B . C . and Canada. A conceptual hierachy is represented by a finite directed graph, and in most cases, a finite tree. A n example of conceptual tree is the tree that represents the relations among continents, countries, and provinces. Figure 2.1: A conceptual hierarchy for continents, countries and provinces There are applications which need to find associations at mult iple conceptual levels." For example, besides finding that 75% of customers that make a call to Ch ina may -also call Canada, it could be informative to also show that 70% of customers that call Beijing may also call Bri t ish Columbia. The association relationship in the latter statement is expressed at the lower conceptual level and often carries more specific and concrete Chapter 2. Background and Related Works 10 information than that in the former. The necessity for mining multiple-level association rules or for using taxonomy information to aid the mining has been observed by Agrawal and Srikant [2],[3], and Han and Fu [7]. 2.2 Sequential Patterns In [4], Agrawal and Srikant developed a framework to analize a temporal database. Their goal is to find something interesting and similar to the association rules but in database of events which are ordered by the time they occur. They presented three algorithms called AprioriAll, AprioriSome and DynamicSorne for mining sequential patterns in a large database of customer transactions. Each transaction consists of customer-id, transaction-time and the items purchased in the transcation. No customer has more than one transaction with the same transaction-time. A n itemset is a non-empty set of items, and a sequence is an ordered list of itemsets. A sequence (a\a2 • • • an) is contained in another sequence (b\b2---bm) if theer exist integers Zi < %i < • • • < in such that a\ C bi1,a2 C fe,-2, - • • , a n C fe,-n. A customer supports a sequence s i f s is contained in the customer-sequence for this customer. The support for a sequence is defined as the fraction of the total customers who support this sequence. The problem of Min ing Sequential Patterns is stated as following: Given a database "D'of customer transactions, the problem of mining sequential patterns is to find the maxima] sequences among all sequences that have a certain user-specified min imum support. Each such maximal sequence represents a sequential pattern. The algorithms to solve this problem have five phases: 1. Sort Phase. In this phase, the database V is sorted, with customer-id as the major Chapter 2. Background -and Related Works 11 key and transaction-time as the minor key. 2. Litemset Phase. In this phase, all large itemsets (the itemsets which have at least minimal support). 3. Transformation Phase. This phase will filter the database D with the large itemsets we discovered during the Phase 2. This phase is an optional part of the algorithms. 4. Sequence Phase. Use the set of large itemsets to find the desired sequences. The fundamental fact used during this phase is that subsequences of a large sequence are also large. This is the key phase of these algorithms, and the three algorithms differ at this phase. These three algorithms ma,ke multiple passes over the database. In each pass, they use a seed set to generate candidate sequences, and count support for each candidate sequence. A t the end of the pass, it is determined which of the candidate sequences are actually large. These large candidate sequences become the seed for the next pass. 5. M a x i m a l Phase. F ind the maximal sequences among the set of large sequences. In [5], Agramal and Srikant extended their framework of the problem of sequential pat-terns as follows. First , they added time constraints that specify a min imum and/or max imum time period between adjacent elements in a pattern. Second, they relaxed the restriction to accept sequential pattern items that come from the same transac-tions whose transaction-time are within a user-specified time window. T h i r d , given a user-defined conceptual tree (or taxonomy) on items, they allow sequential patterns to include items across all levels of the taxonomy. A n algorithm, G S P , is developed to solve this generalized problem. Chapter 2. Background and Related Works 12 2.3 Frequent Episodes In [8] and its revised version [10], Manni la , Toivonen, and Verkamo considered the prob-lem of discovering frequently occuring episodes in sequences. Roughly speaking, an episode is defined to be a collection of events that occur relatively close to each other in a given partial order. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. In the following we present more details of their approach. Given a set E of event types, an event is a pair of (A,~t), where A (E E is an event type and t is an integer, the (occurrence) time of the event. A n event sequence s on E is a triple (s,Ts,Te), where s = {{Al,tl),{A2,t2),--- ,{An, /,„)) is a sequence of events such that Ts < < < Te for all i < n. Ts and Te represent the starting and ending time of the period within which all episodes occur. To find all interesting frequent episodes from a class of episodes, they defined a notion of window as a slice of an event sequence. Formally, a window on an event sequence s = (s,Ts,Te) is an event sequence w = (w,T's,T'e), where Ts < T's,Te > f/'e', and w consists of those pairs (A, t) from s where T's <t< T'e. The time span te — ts is called the width of the window. A n episode is a directed acyslic graph on the event types E. Given a window width win, the frequnecy of an episode is the fraction of windows with width win in which the episode occurs. The problem of frequent episodes Manni la , Toivonen and Verkamo considered is: Given an input sequence of events, a class of episodes, a window width, and a frequency thresh-old, find all episodes of the class that occur frequently enough. Chapter 2. Background and Related Works 13 There are three key ideas which are used in their algorithms: • A l l subepisodes of an episode are at least as frequent as the episode itself. • Episodes can be recognized efficiently by "sliding" a window on the input sequence. Typical ly two adjacent windows have a lot of overlap. • Every episode can be seen as a recursive combination of parallel and serial episodes. The algorithms consist of the following steps: 1. Generation of candidate episodes: They start with the set of a l l frequent episodes of size 1 and generate all frequent episodes. The crucial point in the candidate generation is the following: If an episode a is frequent in an event sequence s, then al l subepisodes of a are frequent. B y using this crucial fact, the collection of candidates is specified to consist of episodes such that all smaller subepisodes are frequent. This criterion safely prunes from consideration episodes that can not be frequent. 2. Recongnizing episodes in sequences: In this phase, the algorithms find out al l fre-quent episodes of these candidate episodes which are discovered in step 1. Two adjacent windows W ; , W t +i are typically very similar to each other. After recognizing episodes in W 2 , They make incremental updates in their data structures to achieve the shift of the window to obtain W ; + ] . The algorithms start by considering the empty window just before the input sequence. There are two typical episodes which were dealt with different methods. • Parallel episodes: For each candidate parallel episode a a counter a. events count is defined that indicates how many events of a are present in the window. When a.event-count equals to a is included in the window. Chapter 2. Background and Related Works 14 • Serial episodes: Serial candidate episodes are reconized in an event sequence by using state automata that accept the candidate episodes and ignore all other inputs. The idea is that there is an automaton for each serial episode a,and that there can be several instances of each automaton at the same time, so that the active states reflect the disjoint prefixes of a occuring in the window. A generalized version of the formulation of the problem and the algorithms to discover the frequent episodes is presented in [9]. 2.4 Summary As we mentioned in the last chapter, the sequential pattern analysis which is discussed in papers [4], [5], [8] and [9], has the following limitations: 1. The time isn't one of the attributes involved in the sequential patterns. 2. The time window is scanned throughout the entirely database. Patterns for sub-databases were not addressed. For example, one may interest the sequential pat-terns that involve only Sundays' information of a telecommunication database. 3. The length of t ime an event occurs was not addressed. Event length is a very crit ical attribute in many application, like, telecommunication, W W W , etc. 4. In both [4], [5], a transaction-type sequences of events were considered. In general, (e.g., the processes in the operating system), the events may occur in a very complex manner. In the next chapter, we will formulate a problem of mining sequenetial patterns which wi l l be a generalization of the frameworks in [4] and [5]. In our definition of mining sequential Chapter 2. Background and Related Works 15 patterns, we wi l l consider sequences of (generalized) events with four components: event type, starting time, ending time and length of the event. Chapter 3 Problem Formulation and Examples In the present chapter, we formulate the problem of mining sequential patterns we focus in this thesis. As the reader may have already noticed, the problem of discovering frequent episodes in a sequence of events was presented in [8] and [9]. Their patterns are directed acyclic graphs, where each vertex corresponds to a single event and an edge from event A to event B denotes that A occured before B. Their algorithm is designed to work with a single sequence of events, while we are interested in finding patterns that occur in many different data-sequences, which is similar to the frameworks of [4] and [5]. Therefore, our definitions and framework are more related to [4] and [5]. 3.1 Problem Formulation Before formulating the problem of sequential patterns, we introduce some notations and definitions for primary and generalized events, conceptual trees of events, t ime and length,subsequence, containment, support and confidence. 3.1.1 Basic Definitions Definition 3.1.1 1. Let £ be a class of event types, also called basic events. Let Q be a directed acyclic graph on the basic events. An edge in Q represents an is-a 16 Chapter 3. Problem Formulation and Examples 17 relationship, and Q represents a set of taxonomies. If there is an edge in Q from p to c, we call p a parent of c and c a child of p. We call y an ancestor of x (and x a descendant of y, denoted as x<y) if there is a (directed) path from y to x. 2. A primary event is a tuple (A,s,~/,,/) where A C £, s,t,l are natural numbers, and I < t — s. The standard interpretation is that s, t are the starting time and closing time, respectively and I is the length of the event. An event sequence is just a well-ordered (on starting time and then on ending time) sequence of events. Examples of Pr imary Events and Sequences. A typcial example of primary events in the telecommunication industry are the events of phone calls. For example, ((604)222 — 1111,1997 : Nov : 30 : 13 : 00,1997 : Nov : 30 : 14 : 02,62) represents that a pri-mary event of call to a phone number (604) 222-1111 at 1997:Nov:30:13:00 which last 62 minutes. If we start with a fixed piont, say midnight of Nov. 29, 1997, and repre-sents any time by the number of minutes of the time after that fixed point. We can represent 1997:Nov:30:13:00 as 2765. Therefore, the above event can be re-written as ((604)222-1111,2765,2827,62). A n example of sequence of primary events is: (((604)222-1111, 2765,2827, 62), ((604)222-2222, 2770, 2800, 30)). But the following sequence (((604)222-2222,2770,2800, 30), ((604)222-1111, 2765,2827,62)) isn't a sequential pattern since it isn't ordered by starting time. Time tree or hierarchy. We are given a database T> of events. In order to analyze the sequential patterns which are related to the time stamps, we need to completely understand the role time is playing. To discover the generalized sequential patterns, we wi l l treat time as an attribute. The time is a special attribute since there is a hierarchy of event times as well as a hierarchy of the gap between events. In the databases of our examples, each time stamp is a positive integer number which can be understood as Chapter 3. Problem Formulation and Examples 18 the number of minutes from the midnight of Nov. 29, 1997. In practice, one may be interested in the sequential patterns which are related to user-defined periods of time. A telecommunication company may be interested in the sequential patterns of a particular period such that events with certain length. For example, the company may like to compare calls with at least 1 hour on Sundays, with calls wi th at least half hour on weekdays or Saturdays before making a decision on a promotion campain. In this case we want to generalize the time stamps to which only values like Monday, • • •, Sunday are used. A time/length tree is an acyclic graph K, of sets of integers (one may use some meaningful words to denote them) such that if A is a parent of B then B C A and the collection of all leaves { T 1 ? • • • , Tk} is a finite disjoint cover of the set of natural numbers A^, i.e., Ti fl Tj = 0 (i j) and r]\ U • • .• U Tk = N (the set of natural, numbers). Here are some examples of time-trees. 1. K = {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, AM, PM, Sunday AM, Sunday PM, •••, Saturday AM, Saturday PM}. 2- K = ( P H 6), [^(mod 3)|n e N}. From now on, in our presentation we wi l l use a fixed time tree K. (we wi l l use the same tree for both start and end time) and a length tree C. The following definitions formalize the generalized notions of event and sequence. Definition 3.1.2 1. S -<T if either Given two notes S, T in a time tree K we write SeT,orSCT. Chapter 3. Problem Formulation and Examples 19 2. An (generalized) event has the form (A,S,T,L) such that S,T are in K and L is in C. Sometimes, we may use an integer n to denote the set [0,n] in C. 3. We say that (A, S, T, L) generalizes (a, 5 , t, k), (we write it as (a, 5 , t, k) ^ ( A , S, T, L)), ifa<A,s^S,t-<T,k^Land{A,S,T,L)^(a,s,t,k). • 4- A generalized event sequence is a sequence of generalized events ( ( A l 5 S i , T\, / i ) , 5 ( A n , S n , 2 n , liif)• A concept which is colsely related to that of a conceptual hierarchy is the notion of level. It is formalized by the following definition. Definition 3.1.3 1. Given a basic event e in a conceptual tree, the level of e, level(e), is the number of ancestors of e, including the root. 2. Given a generalized, event (A , S,J.',l), the level of (A.,S,T,l), lev el ((A, S,T,l)), is defined as max{level(A),level(S),level(T), lev el (I)}. 3. Given a sequence S, the level of S, level(S), is the maximal level of each individual event of S. From now on we adopt the generalized version of events and sequences. For example, in the telecommunication domain it is valid to talk about an event like (Vancouver, Sunday, Monday, 1 Also, we can write (A , S, S, I) as (A , S, /). Subsequence. We define the notion of subsequence as following. Given two (general-ized) event sequences A= {{Au Si, 7\, h), • • •, (An, Sn, Tn, ln)} and B = {(Bu S'1,T{,1'1), • • •, (Bm, S'm, T'm, l'm)}. B is a subsequence of A if there are i\ < • • • < im such that Chapter 3. Problem Formulation and Examples 20 Bk = A{k,S'k = Sik,Tl. = Tik and l'k = Uk for all k < m. One should notice that from the definition it may be true that for some events E\ and E2, none one is a subsequence of another but the generalized event of one is a subsequence of the generalized event of another. Further, we can slightly change our definition according to the goal of the data mining process. For example, we may define a weaker notion of 'subsequence' as same as above except that l'k C lik for all k. Containment. We are often interested in sequences of events that occur wi thin certain time l imits from each other. These limits are usually defined by specifying a minimum (mm) and maximum (max) gap that should be satisfied by adjacent events in a sequence. min is usually defined by a non-negative integer and max is usually defined by a posi-tive integer. When min and max gaps are used, the containment relation between two sequences is defined as following. Definition 3.1.4 1. A non-generalized event sequence S = ((Ai, si, t i , / i ) , • • • , (An, sn, t. contains a non-generalized event sequence B if (a) B = ((A^, , i , - , , ), • • •, (Aim, 5 , m , tim, lirn)); (b) min < — ^ . < max and ij < for all j = 1, • • • , m. 2. A non-generalized event sequence S contains a generalized event sequence B if B is an event sequence generalized from a sequence which is contained by S. Support. We are given a database V of event sequences. The support for a sequence S is defined as the total number of users whose data (raw data, non-generalized) sequence in V contains S. The database of event sequences is a collection of event sequences. Each event consists of customer id , basic event, starting and ending time of the event. We wil l Chapter 3. Problem Formulation and Examples 21 assume that the event sequences are grouped by customer id . The event sequence for each customer are ordered by starting time and then by ending time. M i n i m u m Support. To find relatively frequently occurring patterns and reasonably strong rules, a user may specify a threshold: minimum support. Notice that for finding multiple-level/tree sequential patterns, different minimum support can be specified at different levels/items. Given a time-tree fC, a length tree C, a taxonomy Q for events, a minimum support function f is a function that maps a, level of a conceptual hierarchy to a natural number. Since the level is also represented by a number between 1 and n , the maximal level in these conceptual trees, this function maps {1, 2, • • •, n} to natural numbers. A sequence of generalized events A is large if the support of A is no less than its cor-responding minimum support threshold f(level(A)). It is strong or frequent if, each ancestor of A, including A itself, is large. Confidence. The confidence of a pattern S\ S'2 is defined as the ratio of the support of S1S2 to the support of Si. M i n i m u m Confidence. To find reasonably strong rules, a user may specify a threshold: minimum confidence. For a multiple level conceptual trees, we can also specify a min imum confidence function g similar to the minimum spport function / . A pattern Si => S'2 is a sequential pattern i f SiS2 is strong and confidence of Si => S'2 excels the minimal threshold of confidence. Chapter 3. Problem Formulation and Examples 22 3.1.2 Problem Definition Problem Definition. Given a database V of sequences, a time-tree 1C, a length tree C, a taxonomy Q for events, user-specified min-gap and max-gap time constraints, a user-specified min imum support function, and a, user-specified min imum confidence function, the problem of mining sequential patterns is to find all sequential patterns in V (with respect to the time-tree and length tree). It turns out that the real difficulty is to find all strong sequences. Once this is done, the rest part of the process is pretty straightforward. Therefore, in some papers (see [4], [3]), the problem of sequential patterns is to find all frequent (or strong) sequences. 3.2 A n example In the current section we discuss an example of the problem we presented in the last sec-tion. The example uses a sample record of customers' phone calls of a telecommunication company. A typical record of a database of this kind consists of the following attributes: the telephone number (or id) of the caller, the area code of the receiver, time and length of the call. Table 3.1 shows a portion of the database containing only three customers. Three conceptual trees are used with the database of this example: a conceptual tree for the area codes, a conceptual tree for the starting and ending times, and a conceptual tree for the length of the events. The event conceptual tree is a hierachy of areas where receivers are located and is displayed in figure 3.2 (a). For example, both B . C . and Ontario are "children" of the North Amer ica node. B . C . has two area codes: 250 and 604. If a receiver's area code is 604, we know that it is part of B . C . , and then part of North America . The other two conceptual trees are represented in the same fashion and Chapter 3. Problem Formulation and Examples 23 Callers ID Area called Time Length (minutes) 1 250 Dec 1, 1997, 23:45 75 1 212 Dec 7, 1997, 15:00 5 1 416 Dec 9, 1997, 13:45 60 2 212 Nov 30, 1997, 13:00 6 2 250 Dec 1, 1997, 10:00 62 2 416 Dec 2, 1997, 15:00 70 3 604 Dec 1, 1997, 23:30 65 3 718 Dec 7, 1997, 15:00 6 3 604 Dec 10, 1997, 23:30 61 Table 3.1: A database of telephone calls are displayed in figures 3.2(b) and (c). In Figure 3.2 (c), each node represents a period of time. For example, O H and N O H represent office hours and non-office hour respectively. In Figure 3.2 (b), each node represents a length of time. For example, (0,10) represents the length (of event) which falls into the interval (0,10), i.e., less than 10 seconds. Theoreticly, we can assume that the starting time at which an event occurs is an integer number which can be interpreted as the number of minutes after a fixed point. Here we assume that the zero point is Nov 30th, 1997, 00:00, which is Sunday. Using that encoding, the first record of caller 1 can be written as (250, 2765,2840, 75) since December 1, 1997, 23:45 is 2765 minutes after midnight of Nov. 29, 1997. To make the example more readable, we also use normal representation with year, month, day and time. A task often performed by the discovery algorithm is the generation of primary events using the above conceptual trees. For example, the first record is a call to area code 250, starting on Monday during non-office hours, ending on Tusday during non-office hours and the length is between 60 (including) and 90 (excluding) minutes. The starting time is 2765 minutes after the zero point and end at 2840th minute. Therefore, the first record .-can be generalized as (1,250, Mon/NOH, Tus/NOH, [60,90), 2765,2840) (the min imal Chapter 3. Problem Formulation and Examples 25 Callers ID Area called Start End Length Start Time End Time 1 250 M o n / N O H T u s / N O H [60, 90) 2765 2840 1 212 Sun Sun [3,10) 10980 10985 1 416 T u s / O H T u s / O H [60,90) 13785 13845 2 212 Sun Sun [3,10) 780 786 2 250 M o n / O H M o n / O H [60,90) 2040 2102 2 416 T u s / O H T u s / O H [60, 90) 3780 3850 3 604 M o n / N O H T u s / N O H [60,90) 2750 2815 3 718 Sun Sun' [3,10) 10980 10986 3 604 W e d / N O H T h u / N O H [60, 90) 15810 15871 Table 3.2: A modified database generalized record) as well as (1, B.C., Mon, Tus, 60+, 2765, 2840), etc. Here we include the starting and ending times in the records. They represent two generalized events: (2bO,Mon/NOH,Tus/NOH,[<oO,90)) and (B.C., Mon,Tus,60+). Table 3.2 shows the minimal generalized events of Table 3.1. The notion of subsequence is t r iv ia l . For example, ((B.C., Mon, Tus, 60+), (416, Tus, Tus, 60+)) is a subsequence of ((B.C., Mon, Tus, 60+), (New York, Sun, Sun, (0,10)), (416, Tus, Tus, 60+)). In the present example, we want to discover all the frequent sequential patterns. We as-sume that the minimal support is 3, the maximum gap is 7 days (which is 10080 minutes) and there is no minimum gap constraint (i.e., the min imum gap is 0). B y the definition in the last section, we know that ((250, 2765,2840, 75), (212,10980,10985,5), (416,13785,13845,60)) contains {(B.C., Mon, Tus, 60+), (New York, Sun, Sun, (0,10)), (416, Tus, Tus, 60+)) but doesn't contain ((B.C., Mon, Tus, 60+), (416, Tus, Tus, 60+)) because the gap between the primary events (250,2765,2840,75) and (416,13785,13845,60) is 13785 - 2840 = 10945 which is bigger than 10080 minutes. Therefore, the generalized event ((B.C., Mon, Tus, 60+), (New York, Sun, Sun, (0,10)), (416, Tus, Tus, 60+)) is supported by caller 1. B y the same token, ((B.C., Mon, Tus, 60+), (New York, Sun, Sun, (0,10)), (604, Wed, Thus, 60+)) Chapter 3. Problem Formulation and Examples 26 is supported by caller 3. Therefore, ((B.C., Mori, Tus, 60+), (New York, Sun, Sun, (0,10)), (NorthAmerica, Weekdays, Weekdays, 60+)) is supported by both callers 1 &; 3. The same sequence isn't supported by caller 2, and therefore it isn't frequent since the min-imal support is 3. On the other hand, the sequence ((New York, Sun, Sun, (0,10)), (NorthAmerica, Weekdays, Weekdays, 60+)) is supported by al l three callers, and there-fore it is a frequent (strong) sequence. It is clear that the confidence of pattern (New York, Sun, Sun, (0,10)) =r- (NorthAm,erica,Weekdays, Weekdays, 60+) is 100% and there-fore it is a sequential pattern. In the next chapter, we wil l discuss how to find all frequent sequences efficiently. Chapter 4 Algorithms In the present chapter, we present several algorithms which are used to solve the problem of mining sequential patterns we ha,ve formulated in the previous chapter. In section 4.1, we first lay out the basic steps of these algorithms. The main difference among these algorithms wil l be the procedure to find all strong or frequent sequences. W i t h i n this procedure, a core element, containment checking, wi l l be common among these algorithms, but the methods for containment checking wi l l be different. In section 4.2, we develop a simple case to show key ideas on how we process the containment checking. In section 4.3, we give a more complicate version of the containment checking algorithm, which wi l l be used in our algorithms. In section 4.4, we present various algorithms and optimizations, for finding strong/frequent sequences. 4.1 Basic Steps Given a collection of data to be analyzed, a time-tree /C, a length tree £ , an event taxon-omy Q, user-specified min-ga,p and max-gap time constraints, a user-specified min imum support function, and a user-specified minimum confident function, the process to find all sequential patterns contains two steps: 1. F inding all sequences which are frequent enough. A sequence s is frequent enough 27 Chapter 4. Algorithms 28 if the number of customers whose data contain the sequence exceeds a threshold of support. This number is denoted as support(s) for future reference. We use a containment checking algorithm to find whether a sequence of data contains a sequence of events. The process to find all such sequences which is developed in section 4.4, wi l l vary depending on several considerations, including the number of database scans, memory size and C P U time. For example, in [2] and [3], Agrawal and Srikant presented some basic algorithms for finding al l frequent association rules. We ' l l bui ld a similar framework for event sequences in this chapter. In some cases our algorithms wil l be more general and better tuned than these prosented in [2] and [3]. 2. For each frequent sequence, we should check whether its confidence exceeds a min-imal threshold. The confidence of a sequence is defined as conditional probability. Here is a straighforward algorithm for this task (presented in [2] for the case of association rules). For every sequence ( a l 5 • • • , a n , a n + i , • • • , a m ) , the confidence of pattern ( a 1 ? •••,an) =4> ( a n + 1 , • • • , am) is the ratio of support((a\, • • • , am)) to support((ai, • • •, an)). If (a i , • • • , an, a „ + 1 , • • •, am) is strong and its confidence ex-ceeds the minimal threshold, it is a sequential pattern. In the rest of the chapter, we focus on the first part of this process. 4.2 Basic Algori thm for Containment Checking Before describing complete algorithms for solving the problem of Sequential Patterns described above, we first discuss several algorithms for containment checking. As we wi l l see in the next section, we wil l apply these algorithms to find whether a given sequence of events is supported by a customer's data. This is the same as checking whether a Chapter 4. Algorithms 29 given sequence is contained in the sequence of events for that to a customer. First we wi l l describe a simple version of the containment algorithm (CC — 1) which demonstrates the key ideas used in these algorithms. 4.2.1 Containment Checking Algori thm C C - 1 ^ We now describe the simplest algorithm of containment checking (CC-1) . Given two sequences X and Y, the algorithm wi l l determine whether Y contains X. This algorithm can be applied to any sequences with single-point events. Here we assume that X, Y are arrays of letters. The position of a letter represents the time of that letter (event) occurs. Input: Sequences X and Y; M i n i m u m and M a x i m u m gaps, min and max. We wi l l use Xi to,denote X[0] • • • X[i], the prefix of X with length i + 1. Output: 0 or 1. 1 represents that Y contains X, i.e., there are i0 < • • • < i\x\ such that X[k] — Y[ik] for all k and min < ik+i — ik < max for all k < \X\. 1. / * For each letter a in X, create a list named L[a]. L[a] is a sorted list of indexes j such that Y[j] = a. */ 1.1. For (i = 0; i < \Y\; i + +) { 1.1.1. If (Y[i] is in X) then, add i to L[Y[i]]\ } Chapter 4. Algorithms 30 /* For each prefix X{ of X, we create a list named E[Xi]. E[Xi] is a sorted list of indexes of X[i] in Y which witness that Xt is contained in Y, i.e., if m is in E[X{] meanst thai the substring of Y that ends at position m contains X{ and Y[m] is the last character of Xi: E[Xi] = {j\(3k0 < ••• < ki){\/u)[ki = jkX[u] = Y[ku}kmin < ku+1 - ku < max}}. The initial value of E[Xi] is the empty list */ 1.2. E[X0] = L[X[0}); 2. For (i = 1; i < \X\; % + +) { Merge two sorted arrays as a new sorted array MS{. If kth element is in E, flag(k) =e, otherwise flag(k)=l 2.1. MSi = Merge-SorliEiX^J^Xii}}); 2.2. For (k = 0;k < \MSz\;k + +) { 2.2.1. If (flag (k)=l) skip to next k; 2.2.2. If (flag (k)=e) { 2.2.2.1. For (u = k + 1; u < k + max; u ++) { 2.2.2.1.1. If (flag(u) = l)k{MS^k] + min < MS',[«] < MSi[k) + max) 2.2.2.1.1.1. A d d MSi[u] into E[Xt]; } / * end IF * / } / * end For-u * / Chapter 4. Algorithms 31 } / * end IF */ } / * end F o r - k * / 2.3. If E[Xi] == 0, return 0; Y doesn't contain X satisfying the gap-constraints. 2.4. If (i == \X\), return 1; X is a subsequence ofY satisfying the gap-constraints. } / * end For-i * / 4.2.2 A n Example We now apply the algorithm above for the following example. Suppose the sequence X is (ababd) and the sequence Y — (cbcaacadbadaccbdcaabdcdbacdbbacdccb). Let min imum and maximum gaps are 2 and 4, respectively. We want to determine that whether X is a subsequence of Y, satisfying the gap-constraints. According to the Algor i thm CC — 1 above, at Step 1 we wi l l define L[a], L[b] and [d] by scanning Y. It is clear that L[a] = {4 5 7 10 12 18 19 25 30}, L[b) = {2 6 9 15 20 24 28 29 35}, and L[d] = {8 11 16 21 23 27 32}. First , at Step 1.2, we define E[a] as L[a]. Now, for i from 1 to 4 we wi l l define -ELX;]. When i = 1, we know X[l] = b. Merge E[a] and L[b]we h.a,ve Chapter 4. Algorithms 32 M S i = {2 4 5 6 7 9 10 12 15 18 19 20 24 25 28 29 30 35}, with flag(k) = I if k G {0,3,5,8,11,13,14,16} and flag(k) = e otherwise. For ea.ch k ranging from 0 to 16, from Steps 2.2 and its substeps, we know that E[ab] — {6 9 15 20 28 29}. B y the similar steps for i = 2,3,4, we know that E[aba] = {10 12 18 19 30}, E[abab] = {15 20} and E[ababd] = {23}. Note that E[ababd] ^ 0. This implies that X is a subsequence of Y. In fact, we can trace the process back to know that X = (F[12]F[15]y[18]Y'[20]Y'[23]). Remark. One should notice that there is an advantage to use notation £{X;] instead of E[i]. When we have a set of candidates C we need to check the containment of each one within a given customer. B y using these notation, we don't need to recalculate some of the E values. For example, suppose both abd and abc are candidates. To check the containments of them within a customer Y, we can share the values of E[a] and i?[a6] in the Algor i thm CC — 1, for both strings. 4.2.3 Correctness of Algori thm C C - 1 We now prove that Algor i thm CC — 1 is correct for determining whether a sequence X is a subsequence of Y wi th the given gap constraints. First , we show two trival facts: L e m m a 4.2.1 For all a, i, 1. L[a] is sorted. 2. E[Xi] is sorted. 3. E[Xi] C L[X[i]\. Proof: It is obvious from the Algori thm CC — 1. Chapter 4. Algorithms 33 L e m m a 4.2.2 For each j and each x G X, if j G L[x] then Y[j] = x. Proof: From the Step 1 of Algor i thm C C - 1 , we only add i into L[F[i]] . Therefore, if j G L[x] then Y[j] = x. For the rset of the verification, we first prove the following theorem: Theorem 4.2.1 For all i < \X\. The following hold 1. For each j G 22LX;] there exist j0 < ji < • • • < ji = j such that X,- = (Y[j0\ • • • Y[ji\) and min < jk+i — jk < max for all k — 0, • • • , i — 1, i.e. X{ is a subsequence of Y satisfying all minimum and maximum gap-constraints. 2. For each j such that there exist jo < j\ < • • • < ji = j with X{ = {Y[jo\ • • • Y[ji\) and min < jk+y — jk < max for all k = 0, • • • , i — 1, then j (E i?hX,-] • Proof: We prove the theorem by induction on i. Let i = 0. B y Step 1.2, we know that E[X0] = L[X[0]]. It is clearly that i f ; € E[X0] then X[0] = Y[j]. Therefore, XQ = (X[0]) = (Y[j]) is a subsequence of Y which obviously satisfying al l minimum and maximum gap-constraints. Further, i f X[0] = Y[j], we know that j € E[X0] since E[X0] = L[X[0]]. Therefore, both 1) and 2) hold for i = 0. We now assume that the theorem, holds for all k < i. Let j G 25[X^]. Note that the only step the algorithm can add j into 22[X;] is Step 2.2.2.1.1.1. From Steps 2.2.2, 2.2.2.1, and 2.2.2.1.1, we know that j G 2,[X[i]] and there is a k such that k G E[Xi-i] and k + min < j < k + max. B y the induction hypthesis, we know that there are Jo < ji < • • • < ji-i = k such that X ^ X = (Y[j0\- • • and min < j t + l - ji < max. Chapter 4. Algorithms 34 Further, X[i] = Y[j] a,nd min < j — k < max. It is then clearly that the conclusion of the part 1) follows if we let ji = j. Now we assume that j is an integer such that there exist j0 < j\ < • • • < ji = j wi th X{ = (Y[jo] • • • Y[ji)) and min < jk+\ — jk < max for all k = 0, • • • , i — 1, j € E[i]. B y the induction hypothsis, we know that £ E[Xi-i]. It is clear that j G and ji-i + min <j< ji-i + max. B y the Steps 2.2.2, 2.2.2.1, 2.2.2.1.1. and 2.2.2.1.1.1, the j is added into Therefore, part 2) follows. This completes the proof of the induction step and then the theorem. Clearly, the correctness of Algor i thm CC-1 is immediately followed from the following theorem: Theorem 4.2.2 Let n = \X\. 1. If E[X{] = 0 for some i < n, then X is not a subsequence of Y satisfying the gap constraints (as return of Step 2.3). 2. If E[Xn] ^ 0, then X is a subsequence ofY satisfying the gap constraints (as return of Step 2.4). Proof: Suppose JS[A^] = 0 for some i < n. One should notice that by Step 2.2.1 we know -S[A"j] = 0 for al l j , i < j < n. We can let i be the least such integer. B y part 2) of the above theorem, we know that there are no integers jo < ji <•••• < ji such that Xi — (Y[jo) • • • ^ [ i i ] ) and min < jk+\ — jk < max for all k = 0, • • •, i — 1, otherwise ji G E[i]. Therefore, X{ and so X is not a subsequence of Y satisfying the gap constraints. Suppose -B[A r 7 l] 7^ 0. B y part 1) of the above theorem, we know that X is a subsequence of Y satisfying the gap constraints. This completes the proof of the correctness of Algor i thm Chapter 4. Algorithms 35 CC-1. 4.2.4 Complexity Analysis In the present section we discuss the complexity of Algor i thm CC — 1. Let p and n be the length of X and Y, respectively. During the Step 1, there are two for-loops, one from 0 to n and another from 0 to p (since we want to check whether Y[i] is in X at Step 1.1.1.). Therefore, the running time cost is 0(pn). We now let I be the max length of L[a] for all a occuring in X. The worst case here is when / = n (then lots of L[a] wi l l be empty). During Step 2 and its substeps, as we know the cost for Step 2.1 is 0(1 + 1). The cost for Step 2.2 and its substeps is 0(1). Therefore, the running time for Step 2 is 0(pl). The total running time of Algor i thm CC — 1 wi l l be 0(pn) + 0(pl) < 0(pn). 4.3 Generalized Algori thm of Containment Checking 4.3.1 Algori thm Bridge One should notice that the Algor i thm CC — 1 can only apply to single-point events sequences. In the following subsections, we will describe a generalized version of CC — 1 which can be applied to regular sequences and the Containment we defined in the last chapter. For the single point events, the gap between two events is the different between the two time-indexes. For the regular events, the gap between two events is the different between starting time of second event and the ending time of first event. App ly ing the Chapter 4. Algorithms 36 algorithm C C - 1 , starting from first event we can get the starting time for next events which satisfy the gap-constraints. But before applying the next step of C C - 1 , we have to get the ending times of all these events. This extra step is the main different between the algorithm CC-1 and the general algorithm of containment checking. Before describing the whole algorithm, we isolate and discuss a central piece of the Algor i thm CC — 1. Given two lists of numbers E and S, this part of the algorithm wi l l calculate all members s in S such that there is /. £ E with min < s — t < max. The algorithm Bridge is actually the portion of Steps (and substeps of) 2.1 and 2.2 in the Algor i thm CC — 1. Input: Sequences of numbers E and 5" and non-negative integers m m , max. Output: A set E* = Bridge(E, S, min, max) which contains all numbers s in S such that there is a t £ E with min < s — t < max. We may use Bridge(E, S) to denote Bridge(E,S, min , max) when the maximum and minimum gaps are clear. Algor i thm: Bridge(E, S) / * Merge two sorted arrays E, S as a new sorted array MS */ 1. MS = Merge.Sort(E,S)\ 2. For (k = 0;k < \MS\;k + +) { 2.1. If (flag (k)=s) skip to next k; 2.2. If (flag (k)=e) { 2.2.1. For {u = k + l\u<k + max;u + +) { 2.2.1.1. If [flag{u) = s)k(MS[k] + mm < MS[u] < MS[k] + max) Chapter 4. Algorithms 37 2.1.1.1.3. A d d u into E*; } / * end IF * / } / * end For-u */ } / * end IF * / } / * end For -k*/ 3. Return E*. 4.3.2 Algori thm CC - 2 As mentioned above, we now describe an algorithm for a general containment checking. To make the new algorithm CC — 2 below more readable, we define ConvSort as a separate procedure. Given a sorted array E of numbers and a sorted array L of pairs of numbers (s,t) (sorted by.s and then t), ConvSort wi l l yield a sorted array of numbers t such that there \s a s G E with (s, t) £ L . It is clear that the complexity of this algorithm wi l l be 0(\E\ + \L\log(\L\)). Input: Sequences X and Y, where X is a sequence of letters and Y is a sequence of (y,s,t) where y is a letter, and s,t are integers; M i n i m u m and maximum gaps, min and max. We wi l l use Xl to denote X[Q] • • • X[i], the prefix of X wi th length i + 1. We also denote Y[i] as ( j / i , S i , t i ) . Output: 0 or 1. 1 indicates that Y contains X, i.e., there a,re z q , • • • , z p _ i such that X = (Vio' •" 2/«p-i)> m m < _ < m a a : f ° r all j = 0, • • • ,p - 2. Chapter 4. Algorithms 38 Algor i thm: get -Containment(Y, X) 1. / * For each letter a, in X, create list names L[a] and S[a]. L[a] is a sorted double array of indexes such that yi = a. and S[a] is a (sorted) collection of all such Si */ 1.1. For (i = 0; i < \Y\; % + +) { 1.1.1. If (t/t- is in X) then 1.1.1.1. add (si,ti) to L[yi] and si to S[yi\; } /* For each prefix Xi of X, create a list name E\Xi\. E\Xf\ is a sorted array of numbers which witness the positions in Y at which a substring Xi that is contained in Y, terminates: E[Xi) - {j\(3k0 < • • • < ki)(Vu)[ki = jkX[u] = ykukmin < sku+1 - tku] < max}. The initial value of E[Xi] is empty set */ 1.2. E[X0] = ConvSort(N,L[X[0]}); 2. For (i= 1; i < \X\; i + +) { /* Apply Bridge for JE^.J] and S[X[i}]. */ 2.1. E*[Xi] = BridgeiEiXi-ilSiXli]]); 2.2. E[Xi) = ConvSort(E*[Xi],L[X[i\]); 2.3. If E[Xi] == 0, return 0; Chapter 4. Algorithms 39 /*Y doesn't contain X satisfying the gap-constraints. */ 2.4. If (t = = | A | ) , return 1; /*X is a subsequence ofY satisfying the gap-constraints. */ } I*end For-i */ There are two steps in algorithm get_containment(Y,X). The first step generates two lists. The first list is a collection of (s,t) for each a such that (a,s,t) occurs in Y. The second list is a list of all s for each a such that there is a t such that (a ,s , t ) occurs i n Y. The purpose of these lists is to caculate the gap between events. The second step consists of a finite loop. For each ini t ia l sequence of X, the algorithm checks whether the ini t ia l sequence is contained in Y. If the asnwer is yes, the algorithm also keeps track of positions of the init ial sequence in Y. It is tr ivial that when the in i t ia l sequence is X itself, the algorithm knows whether X is contained in Y. There is a difference between this algorithm and the algorithm CC — 1 we have discussed in the last section. In this algorithm, we are using a pair (s,t) to define the position of an event. As we discussed in the last chapter, the gap between two events (a,S\,ti) and (6,32,^2) is defined as s 2 — ti- The 75*[X,-] is a collection of s in S"[X[z]] such that there is t £ E[Xi-i] such that min < s — t < max. But the 75*[X;] isn't the final 75[X;] we need to start the next stage of the loop. We have to use a procedure ConvSort to get 75LY;]. In C C - 1 , the E*[Xi] and E[X{] are the same. 4.4 Algorithms for Finding Strong Sequences We now describe several general algorithms for solving the sequential patterns problem. We wi l l use a method wi th a combination of top-down and roll-up. The basic approach Chapter 4. Algorithms 40 is the conventional one which we wil l discuss first. The rest of the section presents the pseudo code for the general algorithm for mining Generalized Sequential Patterns with T i m e Constrains ( G S P T C ) and its optimizations. 4.4.1 Algorithm: G S P T C Input: Database a. time-tree /C, a length tree C, an event taxonomy Q, integers min and max (minimum and maximum gaps), and min imum support function / . Output: The strong sequences of events in V. 1. For (I = 1; / ^ 1 A (S[l - 1,1] ± 0 h I < max.level); I + +) { /* Get candidate sequences at level I */ 1.1. if (I = 1) then 1.1.1. C [ l , l ] = get-candidatesequences()] 1.2. else 1.2.1. C[l, 1] = get-candidatesequences(S[l /* Get all strong sequences at level I */ 1.3. S[l,l] = get-strong_1_sequences (D,l); 1.4. For (k = 2; (S[l, k - 1] ^ 0)&(Z = 1 V S[l l , A : ] ^ 0 ) ; f c + + ) { 1.4.1. i f (/ = 1) then C[/,fc] = get-candidatesequences(S[l,k — 1]); Chapter 4. Algorithms 41 1.4.2. else C[l, k] = get_candidate_sequences(S[l, k — 1], S[l — 1, &]); 1.4.3. For each customer s £ D { 1.4.3.1. Es = get_containment(C[l,k],s); 1.4.3.2 For each candidate c £ Es, do c.support + +; } / * end for each customer s */ 1.4.4. S[l,k] = {c £ C[/,/c]|c.support > / (c )} ; '. } / * end for k */ 1.5. S[l] = UkS[l,k]; } / * end for / * / The algorithm makes multiple passes over the database. The first pass determines (1) whether a given generalized event at the first level occurs in the user-specified sub-database, and (2) the support of the event if the answer of (1) is yes, that is, the number of customer' sequences that contain the event. After the first pass, the algorithm has determined which events at the first level in the database are frequent. A t each level, we first compute the strong sequences S[l, 1], the strong events (sequence with length 1) at level /. Each subsequent pass starts with a seed set: the frequent sequences found in the previous pass. The seed set is used to generate new potentially frequent event sequences, called candidate event sequences. We wi l l use both strong se-quences at level / with, length k — 1 (if A; > 1) and strong sequences at level / — 1 (if / > 1) with length k to generate the candidates at level / with length k (Steps 1.4.1 and Chapter 4. Algorithms 42 1.4.2). The keys to generate new candidates are: 1) each contiguous subsequence (which wi l l be defined below) of a strong sequence is strong, and 2) each sequence which is a generalization of a strong sequence is strong. A t the end of the pass, the algorithm de-termines which of the candidate sequences are actually frequent (strong). These frequent candidates become the seed set for the next pass. The structure of this basic algorithm indicates that the algorithm wi l l scan the database once for each level and length. The outer loop in the algorithm iterates among the level and the inner loop creates and checks sequences of any length within the same level unti l no more strong sequences can be found. F ina l ly the intermost loop iterates over the customer records in the database. The main goal of the optimizations is to reduce the number of databse passes required by the algorithm without increasing the memory requirements. However, before describing the optimizations of the basic algorithm, we describe how the sub-procedures mentioned in the main algorithms are used. The procedure get_strong_l-sequences (DB,k) com-putes the strong events at level k. The procedure get ^strong _/^sequences (DB,k) can be implemented in a straightforward manner. We need to be careful that there is a l imit of memory size. This is the main reason we use the top-down method to count the sup-ports. We wi l l discuss more of the algorithm that generalizes the candidates later on. The procedure get ^containment is used for containment checking. Given a collection of canadidates C[l, k] and a customer s, the procedure get-Containment(C{l, k],s) generates a subset of C[l,k] which contains the event sequences supported by the customer s. The algorithm used for this procedure is based on the algorithm CC — 2 which is discussed in the last section. Given two collections of event sequences S and P (we allow nul l value for one or both of S and P), the procedure get^candidaie^sequence(S, P) generates a set of candidates such that each candidate is an extention of sequences in S (if S isn't Chapter 4. Algorithms 43 null) and there is a event sequence in P (if P isn't null) which is a generalization of this candidate. The implementation of the procedure get ^candidate-sequence is based on the following important obersvations. Definition 4.4.1 Suppose B is a subsequence of an event sequence A = {(Ai, S\, T\, l\), • • • (An, Sn,Tn,ln)). B is a contiguous subsequence of A if one of the following holds: 1. B is obtained by dropping the first or last event of A. 2. B is a contiguous subsequence of C which is a contiguous subsequence of A. Given a sequence L. Let ~L and L~ denote the resulting sequences by dropping the first event of L and the last event of L respectively. Given two sets of strong sequences S[l,k — 1] and S[l—1, k] (one of them may be null) , the procedure gei_candidate-sequences(S[l, k — l],S[l — l,k}) generates the candidates C;,fc. This procedure consists of two phases: Join Phase. A event sequence is generated which is a join of two sequences L\,L^ for each pair Li,L2 € S[l,k — 1] such that ~L\ = • A temporary set C is created which contains event sequences which are generated during the Join Phase. In ca,se that k — 1, the S[l,k — 1] has null value and the temporary set C is the set of al l possible events at level I. Prune Phase. Delete candidate sequences from C that satisfy any of the following: 1. k > 1 and the candidate sequence has a contiguous (k — l)-subsequence which is not strong, or 2. / > 1 and one generalized sequence of the candidate sequence is not strong. Chapter 4. Algorithms 44 The resulting set after deleting sequences through the Prune Phase is the set C[l,k]. In the next section, we wi l l denote the resulting set of this procedure from a set S of strong sequences as C(S). The reader should note that the G S P T C is based on the traditional algorithms that deal wi th similar problems, see [2] a,nd A 4 . Most of the improvements discussed in these papers can be applied to our algorithm, but we will not discuss them here. The follow sections wi l l focus more on some new optimizations for the G S P T C algorithm. These optimizations include Diagonalization, Minimal Sequences, Pairwise Composition, Re-use Previous Computations and Prefix Sharing, etc.. 4.4.2 Diagonalization C [ i , i ] / s [ i , i ; C[l,2]/S[l,2] C[l,3]/S[l,3] levbl C[2,l]/S[2,l] C[2,2]/S[2,2] C[2,3]/S[2,3] C[3,l]/S[3,l] C[3,2]/S[3,2] C[3,3]/S[3,3] Figure 4.3: Level-Major Process Chapter 4. Algorithms 45 The G S P T C algorithm which was presented in the previous section generates and tests for frequent event sequences at one level at a time, starting at level 1. Wi th ing a level /, the algorithm will generate and check the candidate sequences of any possible length before it proceeds to the next level. Moreover, at any level, the algorithm wi l l create and test all the possible candidates of length k before it proceeds to sequences of length k -f 1. Also, for each testing the algorithm scans the da.ta.base at least once. This order of candidate generation and testing is called level-major order. See Figure 4.3 for more details. In Figures 4.3, 4.4, and 4.5, C[i, j] represents the candidates set of length j at level i. S[i,j] represents the set of strong sequences of length j at level i. Alternatively, we can create the candidate sequences in a length-major order. In this case, we process all possible candidates of length k before we proceed to length k -f- 1. Wi th ing a length we process the candidates level by level, starting at level 1. This process is shown in Figure 4.4. When a level-major order is used, the algorithm needs to store all the strong sequences of the current level before it proceeds to the next. This is because in order to generate candidates of length k at level / we need the strong sequences of length k — 1 and level / — 1. Usually, a number of tables is used to store these data. Each table stores the strong sequences of the same length. Similarly, when the length-major order is used, the algorithm needs to store the strong sequences of length k (of any level) before it proceeds to length k + 1. Independent of which method we use, the number of tables and sequences we need to store in the memory may be very large. Many of these sequences wi l l be discarded after the next iteration as being implied by the longer or more specific new strong sequences. Furthermore, in both level-major and length-major process, the number of database scans is at least the number of candidates [i,j] such that C[i, j] isn't empty. Since the cost of a database scan is very high, a better strategy wi th a similar ie Chapter 4. Algorithms length C[l,l]/S[l,l] C[l,2]/S[l,2] C[l,3]/S[l,3] C[2,2]/S[2,2] C[2,3]/S[2,3] C[3,l]/S[3,l] C[3,2]/S[3,2] C[3,3]/S[3,3] Figure 4.4: Length-Major Process Chapter 4. Algorithms 47 vwrrwe memory over-head is required. C[l,l]/S[l,l] diagonal C[1,2]/S[l,2] C[l,3]/S[l,3] C[2,l]/S[2,l] C[2,2]/S[2,2] C[2,3]/S[2,3] C[3,l]/S[3,l] C[3,2]/S[3,2] C[3,3]/S[3,3] Figure 4.5: Diagonalization Process In the following we present an algorithm such, that the maximal number of tables which are necessary to store the current strong sequences in memory and the maximal number of database scans is m such that mjl+k, where 1 and k are the largest level and length for which S[l, k] isn't empty. This method, significatly reduces the number of necessary tables to store strong sequences and the number of database scans. Here we use a table to store a collection of sequences which are at the same level and with the same length. Here is how the diagonalization algorithm works: it first generates the strong sequences of length 1 at level 1. It then generates candidate sequences C[l ,2] of length Chapter 4. Algorithms 48 2 at level 1 and candidate sequences C[2,1] of length 1 at level 2. The reader should note that only these two candidate sets are depended on £ [1 ,1 ] . The diagonalization algorithm then generates strong sequences £[1,2] of length 2 at level 1 and strong se-quences £[2,1] of length 1 at level 2 by a single database scan. From these two sets of strong sequences, the algorithm generates the candidate sets that are on the third diag-onalization: C [ l , 3], C'[2,2] and C[3,1]. This process is continued unti l there is no strong event sequences for the whole diagonal. See Figure 4.5 for details. Here is the diagonal algorithm: 1. For (/ = l ; £ [ / - l ] ^ 0 ; / + + ) { /* Get candidate sequences at level I */ 2. For (k = /; k > 0; k ) { 2.1. if (/ = 1) then C[l,k] — get-candidate-sequences(); 2.2. else if (k = I) then C[l,k] = get-candidate-sequences(S[l,k — 1]); 2.3. else i f (k = 1) then C[/,/c] = get_candidate-sequences(S[l — !,&]); 2.4. else C[l,k] = get_candidate_sequences(S[l,k — !],£[/ — !,&]); 2.5. For each customer s G D { 2.5.1. Es = get-.containment (C [I, k],s); Chapter 4. Algorithms 49 2.5.2. For each candidate c £ Es, do c.support-Jr-\-; } / * end for each customer s */ 2.6. S'f/jfc] = {c £ Ck\c.support > / (c )} ; } / * end for k */ 3. S[l] = UkS[l,k]; } / * end for / * / 4.4.3 M i n i m a l Sequences One of the key problems during implementation is that the current candidate sets contain heterogenuous sequences, that is, sequences with components at different levels. For instance, consider an event E = (e ,5 , T , L) at level 1. Support that at least one of e, S, T, L isn't at level 1. Then we know that there is an event E' such that level(E') = 1 and E ^ E'. Given E = (e,S,T,L) such that all e, S, T and L are at level 1, there are (2 4 — 2) = 14 level 1 events E' such that E ^ E'. If E is a strong sequence, so is E'. Furthermore, when the level(E) or length(E) is bigger, the number of these E' is growing at a very high rate. For this reason, when we generate candidates set from a set of strong sequences, the size of the candidate set becomes very big. The traditional method to generate a candidate set from strong sequences becomes very costly in this case. In this section, we present a partial solution to this problem. We say "part ial" because the complete solution in our mind on this direction isn't satisfied in theory. Due to the t ime constraints, we can't, invest more t ime on this direction. We expect that this method wi l l lead to a solution with good performance, and we propose further investigation in Chapter 4. Algorithms 50 the future. Given a set S of sequences. Let M ( S ) be the minimal subset of S such that for every sequence A £ S there is a sequence B £ M ( S ) such that B -< A. It is clear that M ( S ) consists of exactly all min imal sequences in S. We now define a function FJ on sets of event sequences. First , we define the forced join of two event sequences S i and S2 of same length (denoted as FJ(S\, S2)). If P is the min imal sequence such that ~ S i ^ P and S^ di P, FJ(Si,S2) is the sequence obtained by concatenating the first event in S i , with the sequence P and the last event in S2. Given a set of event sequences M of same length, the forced join of M (denoted as FJ(M)) is the collection of FJ(A,B) such that A , B £ M. It isn't difficult to show that the following theorem holds (see Figure 4.6). Theorem 4.4.1 Given an event sequence S, then M ( C ( S ) ) = FJ(M(S)). Proof: The proof is obtained from the commutativity of the following diagram. In the rest of this section, we describe an algorithm which is called " M i n i m a l Sequences". This algorithm provides a partial solution to our problem. As we discussed above, the size of candidates set is usually very large. Our partial solution allows us instead of calculating the supports of the whole set of candidates, to just calculate the supports of the minimal candidate sequences. As we already know, the size of a set of min imal candidate sequences is significantly smaller than the size of the whole candidate set. B y the theorm above, we know that we can generate the set of min imal candidate se-quences ( M ( C ( S ) ) ) from M ( S ) through the procedure of force join without calculating C ( S ) . In algorithm "Min ima l Sequences", we will replace every candidates set of length k > 1 by the set of its min imal candidate sequences. For example, if E ^ E' are both Chapter 4. Algorithms comparable? join (C) minimal (M) minimal (M) M ( S ) forced jo in (FJ) FJ(M(S)) =M(C(S)) Figure 4.6: M i n i m a l Sequences and Forced j o i n levb C'[1,2] = FJ(M(S[1,1])) S ' [ l ,2] C'[2,2] = FJ(M(S[2,1])) S'[2,2] Figure 4.7: M i n i m a l Sequences Process Chapter 4. Algorithms 52 candidate sequences where E is a min imal sequence, it is possible that E' is strong but E isn't. Therefore, we generate only part of the strong sequences. The performance is very good but this method does not generate all the answers. It can only be used to find some partial results in some situations in which very quick results are required. For example, we can apply this algorithm to data mining for real-time e-commerce applications in which the mining process is performed real-time and speed of the process is very crit ical . For example, an on-line book store may recommend several books based on a person's profile and shopping history. This recommendation list is partial and time-crit ical. In this case, we can use " M i n i m a l Sequences" algorithm to complete this task. The process of algorithm "Min ima l Sequences" is described in Figure 4.7. Furthermore, we still can calculate the complete solution through one of the two ways: 1. generating whole candidate set from FJ(M.'(S)). 2. finding the strong minimal sequences and then generating a collection of candidate sets which are not generilization of a strong minima,] sequence. There is an algorithm in our mind to generate the complete candidate set from FJ(M(S)) but it is very expensive in time. There are sti l l lots space we come continue to work on second method. 4.4.4 Other Optimizations In section 4.3, we ha,ve discussed a procedure, CC — 2, which is used to check, for a given (generalized) event sequence X and a customer's event sequence Y, whether Y contains X. We used this procedure within G S P T C where we need to count the support for a candidate. Since there may be more than one candidate in the candidate set, we can Chapter 4. Algorithms 53 use some optimizations to speed up the checking procedure. Firs t , we introduce some notations. From now on we fix Y to be a sequence of (y,s,t) where y is a letter and s,t are in-tegers. Further, we use min and max to represent the minimal and maximal supports for the mining problem. In the following, a,b represent letters and X represents a se-quence of letters. Following what we have used in Algor i thm CC — 2, let E(X) be E(X\x\) which was defined during the Algor i thm CC — 2 in Section 4.3.2. Intuitively, E(X) is a set of ending time of the last letter in X which witnesses that X is con-tained in Y. Let Py(a) to be the set of (s,i) such that (a,s,t) occurs in Y. In the following we wi l l drop the subscript Y if the context is clear. Given two sets Pi,P2 which are subsets of (N X N)<UJ, the set of al l finite sequences of pairs of integers, we define Pi N P2 to be the set S = ((si, t i ) , • • • , ( s n , £„)) such that there is an i wi th ( (s i , t i ) , • • •, (s,, U)) e Pi and ((s,-, £,•), • • • , (sn, tn)) 6 P2. Similarly, we define Pi © P2 as the set S = ( (s i , • • • , ( s n , tn)) such that there is an i with ( (^ i , ti), • • • , (s;, 6 P 1 ? { ( 5 , - + i , t i + 1 ) , • • •, (sn,tn)) e P2 and min < sl+i — U < max. Using on these operations and the definition of P(a), we ca,u recursively define P(X) for a sequence X of letters: P(ai • • • an+i) = P(ax • • • an) 0 P(an+i). The following theorem is obvious. Theorem 4.4.2 Given a sequence ai • • • an+i. 1. P(ai • • • an+i) = P(ai • • • an) N P(anan+1). 2. P(ai • • • an+i) = P{ai) © P(a2) © • • • © P(an+1). Chapter 4. Algorithms 54 3. E{X) = {t:{3cL,s)[{a,{S,t))eP{X)}. 4- Y contains ai • • • an+i if and only if Ps(ai • • • a„+i) / 0. We can now start, to describe several optimizations for the algorithms we have developed in the last few sections. As we have done before in Algori thm CC — 1 and Algor i thm G S P T C , we continue to treat the resource data for a customer as Y and a generlized event as a letter in previous description. We now have set the background to discuss in more detail these optimizations, and how they can be used by algorithms to enhance the performance. Our first two optimizations are generated from the first two properties of the above theorem. Pairwise Composition by Join. This algorithm is described as following: 1. First , for each level / we find the candidate set C[ / , 1] of length 1 we did as in G S P T C , 1. e., C [ l , l ] = get-candidatesequencesQ or C[l, 1] = gei-candidatesequences(S[l — 1,1])-2. Second, we use a strong version of the procesure getstrong-lsequences to get al l the strong events at each level, as well as to complete the set Py(a) for each customer Y and each strong event a. 3. Final ly , by part 2) of Theorem 4.4.2, we know that P(a,i • • • an+x) = P^a^) N P(a2as) N • • • [x] P(anan+i). Further, by part 4) of Theorem 4.4.2, a customer s supports ax • • • an+\ if and only if Ps(ai • • • an+i) isn't empty. Therefore, the rest of new G S P T C the algorithm can described as follows: 1.4. For (k = 2; (S[l, k - 1] ^ 0)&(Z = 1 V S[l - 1, k] + 0); k + +) { 1.4.1. if (I = 1) then Chapter 4. Algorithms 55 C[/,fc] = get_candidaie-Sequences(S[l,k — 1]); 1.4.2. else C[l, k] = get_candidale-sequences(S[l, h — 1], S[l — 1, k]); 1.4.3. For each customer s G D { 1.4.3.1. £ , = { c G C[/,/c] : P s (c) ^ 0}; 1.4.3.2 For each candidate c G i ? s , do c.support + -f; } / * end for each customer s */ 1.4.4. S[l,k] = {ceC[l,k]\c.support > / (c )} ; } / * end for fc * / 1.5.S[l] = UkS[l,k]; } / * end for / * / Comparing to the original G S P T C , there are two major changes: first, in the procedure that finds all strong events, we find all the detail information how these strong events are supported by the data. Second, instead of a procedure to check the containment for length > 1, we only need to check whether Ps(c) is empty to determine whether s supports c. Obviously, to check whether Ps(a,i • • • an) is empty is very t r ivial if we know all Ps(a,-) (i < n). The problem is whether and how we can store all information of Ps(a) for all customers s and all strong events a. If we can store all these information in memory or in disks, the performance of this algorithm wil l be very good. W i t h the prices of memory and hard disks are continuing to fall , this approach is acceptable for some time-critical Chapter 4. Algorithms 56 applications in which the database size is relatively small. These applications include e-commerce applications since most of companys' the'on line transactions are st i l l small. Incremental Reuse of Results of n — 1 for n. In this algorithm we use part 1) of Theorem 4.4.2: P(cii • • • an) = P(ai • • • a„_ i ) txl P ( a n _ 1 a n ) . This algorithm is similar to the previous one. The only difference is that we only apply part 1) of Theorem 4.4.2 for the candidates of lengths 3,4, etc. Therefore, the algorithm for finding strong events wi l l be exactly the same as original G S P T C (not as same as the previous one). Dur ing the procedure that finds all strong sequences of length 2, we wi l l calculate Pj(c) for each customer and each length 2 sequence c. The advantages and disadvantages for this algorithm would be the same as the previous one: once we can store all information of Ps(c) in memory or in disks, the preformance wi l l be good. We now describe three optimizations which are based on how we caculate Es(a) for each customer s and each candidate a in G S P T C . For a fixed level / and length k, in G S P T C we caculate Es(a) for all candidates a of length k at level / for each customer s. In G S P T C the procedure to caculate Es(a) is independent for a. Knowing that, when we calculate the Es[a/3], we calculate the Es[ce] first, and keep it for later to be used for all candidates that share the same prefix a with Es[af3]. This way can reduce the duplication of caculating Es(a) again. In the following algorithm we use this observation. P r e f i x Sharing. Also, note that sequences a/3 and cry where \a/3\ = \ct~f\, share a. In some sense, when we calculate E[a(3], we need to calculate the E[ct] first. When we calculate E[a^], we don't need to re-calculate the E[a] again, if we keep it from the previous calculation. Chapter 4. Algorithms 57 The structure of Algor i thm with Prefix Sharing wi l l be the same as the structure of G S P T C in section 4.4. The major difference wi l l be the application of containment algorithm CC — 2 (see 1.4.3.1 in G S P T C ) . Instead of applying CC — 2 for each individual candidate X, in this version we will apply CC — 2 to all candidates at the same time for each fix Y (customer). To do this we need to use a tree structure to represent the collection of candidates. We expect that this optimization will improve the preformance of G S P T C . This al-gorithm doesn't increase the requirements of memory size but reduces the number of applications of containment checking procedure which usually taking a lot of t ime. Since we implemented our G S P T C algorithm without using tree structures to represent the set of candidates, we can't implement this algorithm as a simple variant of our implemention of G S P T C . Due to the time limitations, we haven't test the preformance of Algor i thm with Prefix Sharing. We expect that this algorithm will have better results, in particular if it used together with the Diagonalization technique. In the following version, we optimize the containment checking (see 1.4.3.1 in G S P T C ) by applying similar techniques as those we used with the Prefix Sharing. B y examining the properties of E, you can discover that there are certain limitations of E(ot) for some a's. Based on these observations, we can reduce the size of the possible values of E(cx) even before we caculate them. Re-use Preview Computations. Because of frequency constraints, when a/5 is a candidate pattern, both a and (3 are frequent for all a, (3. Further, E(a(3) C E((3). Therefore, we can caculate E(a(3) wi thin E((3). In the next chapter, we wi l l present some of the implementation issues and compare the Chapter 4. Algorithms 58 results we have obtained by implementing and testing the algorithms, G S P T C , Diago-nalization, M i n i m a l Sequences and Pairwise Composition by Join. Chapter 5 Implementations and Results In this chapter, we describe the actual implementation of four sequential pattern data mining algorithms which we have discussed in the previous chapter and evaluate their performance based on our experimental results. 5.1 Sample Data and Experimental Environment The examples of sequential patterns we are interested in involve starting time, ending time, duration and event conceptual hierarchies. These patterns occur very often in stock market and computer system management, and in particular, in the telecommincations customer data. Initially, we believed that we could easily find some sort of data related to telecommunications customer service on-line or directly from the local telephone com-pany. It turned out that no such data were available to public. Since the customer business data is' highly sensitive, it is difficult to get these data even for only edcuational and academic use. We then turn our attention to the system logs of the U B C Computer Science Department. We managed to acquire aset of modified system telnet log files. We wrote a special program to convert this log file into the format of data which can be used directly by our programs. This special program re-organizes the log file so that 59 Chapter 5. Implementations and Results 60 1. Users and events are identified by non-negative numbers. 2. The date and't ime are in the standard format. 3. The dates and times of login and logout of the same event of the user were stored in an same record, together with the user id and basic event id . 4. The data records were ordered by user ids and then by starting time. Our experiments were conducted in a time-sharing S P R A C - 1 0 environment. The memory size was 128MB and the typical number of users was about five. We scheduled our experiments mainly at night so that the workload for other processes was relatively low and uniform. 5.2 Issues of Implementation 5.2.1 Data Structures We implemented the four of algorithms we have discussed in last chapter with C + + . One of the reasons to choose C + + as the implementation language is its classes and object-oriented programming mechanism. One major decision in the implementation was the structure of the classes, and what data structures should be used in representing basic events, time, event length and generalized events (with starting/ending times and length). After performing some analysis of the algorithms, we decided to implement our two basic structures of sequences and table as abstract data types. A generalized event is represented as an array of eight integers with a special meaning in our example. The first digit represents an basic event. For instance, 1 represents the event of "ppp". The next three digits represent the starting time when the basic event. These three digits Chapter 5. Implementations and Results 61 represent season, week and hours respectively. The next three integers represent the ending time of the event. The last integer represents the duration of the event. The sequence class consists of some basic and useful methods and operations on arrays of events. For example, we can check, given two sequences, whether they are comparable, and what is the join if they are. The table class consists of operations and methods for handing arrays of sequences. For the implementation of each algorithm, there are operations and methods related to the particular implementation. Different algorithms may implement different methods and operations using the data and methods of sequence and table classes. For example, the implementation for checking containment may be different from one algorithm to another. 5.2.2 Memorization One of the key problems during the implementation was memory shortage. Since we are dealing with four conceptal trees for the components of an event (basic events, starting time, ending t ime and length), the number of possible combinations of these components are huge, even on a small data set. Further, the number of possible candidates for length n + 1 which can be generated from sequences of length n is huge since each event consists of eight digits. As we wi l l see later, the algorithm that uses Composition is unsuccessful since in this algorithm the details of how a user supports a sequence is stored in memory, which is almost impossible once the data set is big. 5.3 Performance In the current section, we summarize the testing performance of al l four algorithms. One should notice that the algorithm " M i n i m a l Sequences" is not a complete solution Chapter 5. Implementations and Results 62 GSPTC Diagonalization Composition Minimal Sequences Data 1 93.29 67.81 virtual memory exceeded 17.27 Data 2 99.27 72.39 vir tual memory exceeded 18.96 Data 3 224.37 144,79 104.44 38.16 Table 5.3: C P U time with support = 0.8 to our problem. It starts with minimal strong events and produces only min imal strong sequences. We expect this algorithm to be much faster than others and we use it here as a reference. In fact, in some situation like v. commerce applications, this algorithm can be useful. Another reason is that we hope that this ini t ia l step wi l l lead to a new optimized algorithm. We have analyzed some possible solutions by using min imal strong sequences. But results of theoretical analysis of these solutions aren't good enough. Due to time and scope l imitat ion, this method has been left out for future work. In Table 5.6, we compare the number of candidates and number of min imal strong sequences we actually discovered at each stage. Our first set of experiments compares the various algorithms using same support but different data sets. In our tests, we used three datasets. We test all four algorithms with three data sets. Whi le we experimented with various databases, the results cited below are based on thres da.ta,bases of 500 - 1200 transactions. Table 5.3 includes running times when support is 0.8. Our second set of experiments compares the various algorithms using same data set and different supports. We include Table 5.4 to help illustrate our results. The rows correspond to running time for the corresponding support. The next set of experiments compares the various algorithms a,nd methods using a same data set and support. There are two main steps in the process of discovering strong Chapter 5. Implementations and Results 63 GSPTC Diagonalization Composition Minimal Sequences Support: 0.7 1184.48 1108.31 virtual memory exceeded 29.55 Support: 0.8 93.29 67.81 104,44 17.27 Support: 0.9 63.71 39.31 34,76 11.37 Table 5.4: C P U time with same dataset (Data 1) GSPTC Diagonalization Composition Minimal Sequences Candidates 69.56 59.87 N / A 2.57 Strong 1114.55 1048.43 N / A 26.96 Table 5.5: Times for particular methods with support = 0.7 sequences. The first step is to generate a candidate set. The second step is to deter-mine which candidates are strong. Table 5.5 includes the time of these steps for each algorithms. It shows that the majority of running time are used to determine which candidates are strong. Final ly, we compare the performance between a regular algorithm and the algorithm which is only interested in finding the minimal sequences. We showed how many can-didates we start with and how many min imal strong sequences have discovered at each level. Each entry is of the form a : b where a is the number of candidates, b is number of minimal strong sequences discovered by the corresponding algorithm, and c is number of strong sequences discovered by G S P T C . GSPTC Minimal Length 1 Length 2 Length 3 Length 1 Length 2 Length 3 Level 1 1023: 20 4624: 64 17576: 0 1023: 20 400: 16 64: 0 Level 2 7892:20 14884:4 0: 0 1885: 20 400: 1 0:0 Level 3 11364:3 2401:0 0:0 3940:3 9: 0 0:0 Table 5.6: Numbers of Candidates, M i n i m a l Sequences Chapter 5. Implementations and Results 64 5.4 Summary To summarize, among all the various algorithms (except " M i n i m a l Sequences") discussed in Chapter 4, Diagonalization is the most efficient. In many case, the speedup is 40% comparing with the standard algorithm. Algor i thm Composition is very efficient when the data set is small , but not in practice in general clue to the memory stortage. It is understandable that when we have large data sets, the number of users and number of their transactions that support a strong sequence is big. Therefore, the size of information we should keep in memory is big. The algorithm " M i n i m a l Sequences" is very fast but it only produces partial results. In cases in which we not need the complete set of results and the speed is very important, we may use this algorithm. A useful example is an on-line data mining application for e-commerce web sites. In this case, we may not need to find out all users' shopping patterns but every singlr pattern we use on the web site should be correct. Chapter 6 Conclusions and Future Works 6.1 Conclusions W i t h the rapid growth in size and number of available databases used in commerical, in-dustrial, administrative and other applications, it is necessary and interesting to examine how to extract knowledge automatically from huge amount of data. Data mining, which can be defined as an automatic process of extraction of impl ic i t , previously unknown, and potentially useful information from large database, has recently generated a high degree of interest. Sequential patterns are among the most popular data mining patterns discovered in large temporal database. A classical sequential pattern has the form of" X\ =>• Xi =£•••• =^ Xn, where < Xi, • • •, Xn > is a list of events that supported by at least minsup of the users. The motivation of mining sequential patterns is to find out what events are closely related to each other and in what order they appear. In the existing framework of sequential patterns problem, the user can get a sequential pattern of basic events. However, this classical sequential pattern mining model has the following limitations: 1. The time isn't one of the attributes involved in the sequential patterns. 2. Patterns for specific time periods were not addressed. For example, one may be 65 Chapter 6. Conclusions and Future Works 66 interested in the sequential patterns that involve only Sundays' information of a telecommunication database. 3. The length of time an event occurs was not addressed. Event length is a very crit ical attribute in many application, like, telecommunication, W W W , etc. 4. In both [4], [5], a transaction-type sequences of events were considered. In general, (e.g., the processes in the operating system), the events may occur in a very complex manner. In this thesis, we formulated a problem of sequential patterns which is a generalization of the frameworks in [4] and [5]. We developed several algorithms to solve the problem and implemented four of them. The resulting algorithm, called Diagonalization, is based on the immediate dependenceies of candidate sets and strong sequences. The Diaganal-ization has same memory requirements as G S P T C , the simple algorithm, has requires minimal database accesses. We found that Diagonalization clearly outruns all the other algorithms (except of M i n i m a l Sequences which does not produce a complete solution to our problem). 6.2 Future Works Whi le the work presented in this thesis is significant, there are several interesting issues have been left out for future work. They are discussed below: 1. The definition of min imal and maximal gaps: In Definition 3.1.4, we define a gap between two events (ei , sn,ty, ly) and (e 2, 62,"/-2, h) a s Chapter 6. Conclusions and Future Works 67 Our algorithm of containment checking is based on this definition. This means that the second event occurs only after the first event is finished. B u t one can consider the gap as the distance between the starting times (or ending times). Therefore, one can repla.ce the condition (b) in Definition 3.1.4 as one of the following (or both): • min < sv,, — Si - < max: • min < ti-,, — t{- < max. 2. The follow up of M i n i m a l Sequences algorithm: There are two ways we can modify the M i n i m a l Sequences algorithm to a full solution to the problem of mining sequen-t ial patterns. First , starting from the set of min imal candidates M(C(S)) (which is the same as SJ(M(S)) by Theorem 4.4.2) we generate a set of candidates C(S). One should notice that every sequence in C(S) is a generalization of a sequence in M(C(S)). Therefore, what we need to do is to find all generalized sequences of sequences in M ( C ( S ' ) ) at the proper level. The problem is that this process is very expensive in time. Second, one can first find the set S' of all strong sequences in M(C(S)) as what we did with "Min ima l Sequences". One can generate a set of candidates C which consists of all sequences in C(S) which are not a generalization of any sequence in 5". As one can image, this set would be much smaller than the set C(S) itself. 3. Implementations of other algorithms: Due to the time and scope l imitat ion of our thesis, we have not implemented several optimizations, including the algorithm of Prefix Sharing. We expect that the combination of Diagonalization and Prefix Sharing wi l l have better performance than Diagonalization itself. Chapter 6. Conclusions and Future Works 68 Whi le we cannot claim that the Diagonalization is necessarily the best approach to solving our problem, we believe that we have shown that this approach performs extremely well. The Diagonalization algorithm can be used for mining Association Rules as well. We expect that Diagonalization would optimize data mining algorithms for both Association Rules and Sequential Patterns if it is in a framework with conceptual hierarchies. Bibliography [I] R. Agrawal, T . Imielinski and A . Swami. Min ing association rules between sets of items in large databases. Proc. of ACM-SIGMOD Int. Conf. Management of Data, pages 207-216, 1993. [2] R . Agrawal and R. Srikant. Fast algorithms for mining association rules. Proc. of the Int. Conf. Very Large Data Bases, pages 487-499, 1994. [3] R. Srikant and R. Agrawal. Min ing generalized association rules. Proc. of the Int. Conf. Very Large Data Bases, pages 407-419, 1995. [4] R. Agrawal and R. Srikant. Min ing sequential patterns. Proc. of the Int'l Conference on Data Engineering, 1995. [5] R . Agrawal and R . Srikant. Min ing sequential patterns: Generalizations and per-formance improvements. Proc. of the Fifth Int'l Conference on Extending Database Technology, 1995. [6] J . Han, Y . Ca i , and N . Cercone. Knowledge discovery in databases: A n attribute-oriented approach. Proc. of Int'l Conf. on Very Large Data Bases, August 1992. [7] J . Han and Y . Fu. Discovery of multiple-level association rules from large databases. Proc. of Int'l Conf. on Very Large Data Bases, August 1995, pages 420-431. [8] H . Manni la , H . Toivonen, and A . I. Verkamo. Discovering frequent episodes in se-quences. First International Conference on Knowledge Discovery and Data Mining, 1995. [9] H . Manni l a and 1:1. Toivonen. Discovering generalized episodes using min imal occur-rences. Second International Conference on Knowledge Discovery and Data Mining, 1996. [10] H . Manni la , H . Toivonen, and A . 1. Verkamo. Discovering frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1997. [II] A . Silberschatz, M . Stonebraker, and J . Ul lman . Database systems: Achievements and opportunities. SIGMOD Record, Comm. A C M , 34:94-109, 1991. 69 Bibliography 70 [12] A . Silberschatz, M . Stonebraker, and J . U l lman . Database research: Achievements and opportunities into the 21st century. SIGMOID Record, 25:52-63, March 1996.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Knowledge discovery of temporal databases
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Knowledge discovery of temporal databases Yi, Xiaoding 2000
pdf
Page Metadata
Item Metadata
Title | Knowledge discovery of temporal databases |
Creator |
Yi, Xiaoding |
Date Issued | 2000 |
Description | This thesis deals with a problem of mining sequential patterns from temporal databases. Data Mining is the non-trivial extraction of implicit, previously unknown, and potentially useful information from data. A sequential pattern, is an ordered list (ordered by increasing event's time) of events. The main contribution of this thesis is the development of a general model of sequential patterns as well as algorithms for discovering such patterns. We investigate the generalized sequential pattern in which time isn't just part of the constraints but also a part of the discovered pattern. We have developed several algorithms to discover the generalized sequential patterns from a given temporal database. We have implemented four of these algorithms in C++. The results we obtained from applying these implementations to various datasets are matching our expectations. |
Extent | 2908634 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-07-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051152 |
URI | http://hdl.handle.net/2429/10512 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2000-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2000-0322.pdf [ 2.77MB ]
- Metadata
- JSON: 831-1.0051152.json
- JSON-LD: 831-1.0051152-ld.json
- RDF/XML (Pretty): 831-1.0051152-rdf.xml
- RDF/JSON: 831-1.0051152-rdf.json
- Turtle: 831-1.0051152-turtle.txt
- N-Triples: 831-1.0051152-rdf-ntriples.txt
- Original Record: 831-1.0051152-source.json
- Full Text
- 831-1.0051152-fulltext.txt
- Citation
- 831-1.0051152.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051152/manifest