EFFICIENTLY DETERMINING A G G R E G A T E PROXIMITY RELATIONSHIPS IN SPATIAL DATA MINING B y Edwin M . Knor r B. M a t h . University of Waterloo A THESIS S U B M I T T E D IN PARTIAL F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R O F S C I E N C E in T H E F A C U L T Y O F G R A D U A T E STUDIES C O M P U T E R S C I E N C E We accept this thesis as conforming to the required standard T H E UNIVERSITY O F BRITISH C O L U M B I A August, 1995 © Edwin M . Knorr , 1995 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Computer Science The University of British Columbia 2366 Main Mall Vancouver, Canada V6T 1Z4 Date: Abstract This thesis deals with a nearest-neighbour problem. Specifically, we identify proximity relationships between a cluster of points and nearby features (polygons). Since points are often non-uniformly distributed within a cluster, and since the shapes and sizes of clusters and features may vary greatly, aggregate proximity is the level of expressive-ness we need. Additionally, we require that our algorithm be scalable and incremental. The main contribution of this thesis is the development of Algorithm C R H which uses encompassing circles, isothetic rectangles, and convex hulls to efficiently determine prox-imity relationships via successive convex approximations. Pointwise operations are then performed to compute the aggregate proximity statistics, and to rank the features. Our case study shows that an implementation of Algorithm C R H can examine over 50,000 features and their spatial relationships with a given cluster in less than 1 second of C P U time, delivering expressiveness and scalability. We also show that by using memoization techniques, Algorithm C R H provides for extremely efficient incremental processing. 11 Table of Contents Abstract ii List of Tables vi List of Figures vii Acknowledgement viii 1 Introduction 1 1.1 Data Mining 1 1.2 Spatial Data Mining 6 1.3 Problem Definition 9 1.4 Contributions 14 1.5 Outline of Thesis 15 2 Background and Related Work 17 2.1 Terminology 17 2.2 Types and Sources of Map Data 18 2.3 Cluster Generation 21 2.4 Spatial Database Systems 21 2.5 Spatial Indexing 22 2.5.1 Point Data Structures 23 2.5.2 Region Quadtrees 24 2.5.3 R-trees and Variants 25 in 2.5.4 Summary 26 2.6 Filter and Refine Strategies 27 2.7 Voronoi Diagrams 28 2.8 Alpha-Shapes 30 3 Evaluating Relevant Computational Geometry Algorithms 32 3.1 Distance Measurement Techniques 32 3.2 Usefulness of Polygon Intersection Tests 33 3.3 Efficiency 34 3.4 Matching Cluster-Feature Combinations with Effectiveness Techniques . 37 3.4.1 Results from the Cluster-Feature Combination Table 42 3.5 The Dependency Diagram 44 3.6 Filtering 47 4 Algorithm CRH 49 4.1 Use of Successive Approximations in Filtering 49 4.2 Thresholds 52 4.2.1 Reaching the Thresholds 54 4.3 Shape Enlargement 55 4.3.1 Linear Mode vs. Bisection Mode 56 4.4 Memoization 58 4.5 Computation of Aggregate Statistics 63 4.6 Summary of Processing 64 5 Case Study and Implementation 66 5.1 Acquisition of Data 67 5.1.1 Digitization 68 iv 5.1.2 Volume of Data 69 5.1.3 Choice of Clusters 70 5.2 Linear, Bisection, and Memoization Modes 71 5.3 Thresholds 72 5.4 Computation of Aggregate Statistics 73 5.5 Memoization 74 5.6 Performance 76 5.6.1 Scalability 77 5.6.2 Mix of Clusters and Features 81 5.6.3 Incrementality 85 6 Conclusions 87 6.1 Summary 87 6.2 Future Work 88 6.2.1 Abstract Features and Generalizations 88 6.2.2 Boundary Shape Matching 91 6.2.3 Other Computational Geometry Techniques 93 Bib l iography 95 Appendices 99 A Sample Outpu t 99 v List of Tables 3.1 Identifying Possible Cluster-Feature Relationships 38 3.2 Summary of Possible Relationships 39 5.3 The Effect of Bucket Sizes and Bucket Intervals on C P U Time during Circle Filtering, for an Input Dataset Containing 50,275 Features . . . . 76 5.4 Breakdown of C R H Processing Time (in seconds) for the Top 25 Features 80 5.5 Breakdown of C R H Processing Time (in seconds) for the Top 10 Features 80 5.6 Performance when Convex Hull Filtering is the Sole Filter 81 vi List of Figures 1.1 Model of Operation 9 1.2 Proximity Relationships 10 3.3 Shallow Non-Convex Cluster Close to Non-Convex Part of Feature . . . . 41 3.4 Dependency Diagram 45 3.5 Disjoint, Non-Convex Polygons 46 4.6 Encompassing Circle 50 4.7 Isothetic Rectangle 51 4.8 Convex Hull 51 4.9 Meeting the Threshold via Shape Enlargement 55 5.10 Ranking of Top 10 Features 73 5.11 Detailed Results for One of the Top 10 Features 74 5.12 Scalability of Algorithm C R H 77 5.13 Scalability of the Circle Filter 78 5.14 Example Using a 250 Hectare Cluster 82 5.15 Example Using a 40 Hectare Cluster 83 5.16 Example Using a 25 Hectare Cluster 84 6.17 Fissures in an Alpha-Shape 92 vii Acknowledgement I am indebted to Dr. Raymond Ng for his guidance throughout the course of my work, and for introducing me to some of the frontiers of database research. His comments concerning the initial drafts of this work were invaluable in helping me clearly define the scope, content, and structure of this thesis without getting lost in sea of detail. These technical writing skills will be of tremendous value to me in my future work. I would like to thank my second reader, Dr. David Kirkpatrick, for the time that he spent reading and reviewing my work. Dr. Kirkpatrick is well respected in the computational geometry community, and his comments and suggestions were very helpful. I would also like to thank Dr. Jack Snoeyink for introducing me to various compu-tational geometry literature and algorithms, and for providing the code for computing common tangent lines between disjoint polygons. Before starting this thesis, I had no idea what computational geometry was, or for that matter, what it was used for. A special thank-you goes to Michael McAllister and Vishwa Ranjan for their help in explaining various computational geometry and graphics concepts, and for introducing me to some of the software packages that were related to my work. Vishwa's on-going encouragement (beginning with my first few weeks at UBC) has been a blessing. I am grateful to Dr. Hans Schreier for allowing me to spend 4 days using UBC's GIS Laboratory in Resource Management (Environmental Studies). Thanks also to Yao Cui for giving me a crash course in digitization. Before beginning the Master of Science program, I spent a number of years as a Systems Programmer and a Database Analyst in large, corporate (mainframe) computing environments. The transition to academia and research, while difficult at times, has vm opened up a new window of opportunity for me. Although I have picked up many new skills, I acknowledge that there is still much to be learned. Indeed, the challenge continues. Finally, I would like to thank my Lord and Saviour, Jesus Christ, who made this Masters degree a possibility. ix Chapter 1 Introduction 1.1 Data Min ing Data mining, also called "knowledge discovery in databases" is a relatively new area of database research. Knowledge discovery is defined as the non-trivial extraction of im-plicit, previously unknown, and potentially useful information from data [Frawley et al, 1991]. The purpose of data mining is to discover relationships or patterns in data, with little human intervention. One typically associates the term "mining" with looking for valu-able minerals in a large body of ore. In data mining, the minerals are the patterns, and the ore is the huge data repository. Normally, the data being mined was originally collected for a different purpose than data mining, and was accumulated over a period of months or years. The patterns discovered are deemed to be interesting to a user, although the term "interesting" is highly subjective. As we shall see in the examples below, often a user already has a general question in mind (involving data relationships), but does not know which specific entities (if any) are related. The purpose of a data mining program is to find those relationships and present them to the user. It is clear that Database Management Systems (DBMS's) have a pivotal role to play in the access, integrity, and storage of data. Due to the exponential growth in the amount of data, and due to the new data types which have to be supported (e.g., satellite images, audio, video), it will be very difficult and time consuming for users to construct their own 1 Chapter 1. Introduction 2 queries to find interesting relationships among their data. This is especially true when there may be many patterns or anomalies in the data. Data mining programs attempt to remove the burden of having users explicitly define search criteria. Most DBMS's simply do not provide the capabilities to support the kinds of queries that may be asked in a data mining context. Furthermore, we often need to deal with approximations or uncertainty, either in the data itself, or in the kinds of statements that are generated. Data mining has attracted much corporate interest [Lewinson, 1994] [Frawley et al., 1991] For example, American Express targets certain subsets of its customers (e.g., based on income levels or on the frequency and amount of previous purchases) for specific promo-tions, so as to reduce the company's printing and mailing costs, and yet still maintain a reasonable response to the promotion. American Airlines uses data mining programs against its frequent flyer database to target customers for specific marketing promotions. Farm Journal mines its subscriber database to publish hundreds of customized issues for its diverse clientele. Insurance companies use data mining to analyze claim patterns. Banks use discovered patterns to enhance their predictions of loan defaults, and to speed up loan processing. Brokerage firms use data mining in technical stock analysis to try to forecast the range of stock fluctuations and the general trend of the market. To clarify the concept of data mining, consider a well-known application, namely a supermarket's check-out system. The computerized bar code system is designed to identify the name and the price of the product being scanned, update the store's inventory (thereby facilitating reordering and shelf stocking), handle customer promotions (e.g., bonus points), handle volume discounts, etc. At first glance, it would seem that most of the data collected by such a system can be discarded fairly quickly since there can be tens of thousands of customers in a short period of time, and literally millions of grocery item purchases. The sheer volume of data suggests that no human would ever be interested in manually scrutinizing this data. Furthermore, organizations do not like storing large Chapter 1. Introduction 3 amounts of data for an extended period of time (unless there is some legal requirement or clear economic justification). But the point is that much of this data can be quite valuable. Besides the immediate uses described in the preceding paragraph, this information can be used to provide ex-ecutive summaries of sales, to highlight customer preferences, to figure out which items (or combinations of items) should be put on sale, or simply to acquire all kinds of mar-keting information. For example, a data mining system may be able to detect or identify interesting patterns such as the following: • which items are frequently bought in combination (e.g., corn flakes and bananas; wieners, hot dog buns, mustard, and relish; hot salsa and antacid; pop, chips, and dip; cookies and peanut butter) • which items are frequently bought by families (a family might be determined by the purchase of products that typically target children) • which items are frequently purchased by people making small purchases (with "small" being defined by a dollar amount, or by the fact that a customer used the express check-out counter) • how specials or promotions change the volume of sales for various items • which items are more popular at certain times of the day, week, month, or year Clever marketing strategies might be used to always make sure that one item of a popular combination of items is always on sale, or that two popular items (say pop and snack foods) are located sufficiently far apart to ensure that customers must pass by other tempting items. Note that a user might have a specific question in mind, such as, "Which items are frequently bought in combination?" This is an extremely difficult question to efficiently Chapter 1. Introduction 4 answer us ing a t r a d i t i o n a l D B M S not o n l y because of the p r o b l e m i n f o r m u l a t i n g the query, bu t because of the t r emendous amoun t of process ing tha t w i l l be r equ i r ed to evaluate the re la t ionsh ips be tween each of poss ib ly tens of thousands of i t ems . T h i s is fur ther c o m p l i c a t e d b y the fact tha t an enormous n u m b e r of t ransac t ions (co l lec ted over a p e r i o d of m a n y m o n t h s or years) m a y have to be e x a m i n e d . A l t e r n a t i v e l y , a d a t a m i n i n g sy s t em m i g h t use a p p r o x i m a t i o n s , s a m p l i n g , a n d confidence in tervals i n genera t ing a response to the above query. O b v i o u s l y , a d a t a m i n i n g sy s t em depends qu i te h e a v i l y o n the q u a l i t y of the under-l y i n g da ta . A f t e r a l l , a user w o u l d l i ke to have some degree of confidence i n i n t e rp r e t i ng the results of the d iscovered knowledge . V a r i o u s techniques have been p roposed to dea l w i t h m i s s i n g or co r rup t ed da ta , i n c l u d i n g s t a t i s t i ca l techniques [ C h a n and W o n g , 1991] [Holshe imer and Siebes, 1994] [Matheus et a/., 1993]. C o n s i d e r a d a t a m i n i n g sy s t em w h i c h tries to f ind c o m m o n a l i t i e s a m o n g pa t ien ts h a v i n g the same disease. T h e fact tha t some d a t a is m i s s i n g m a y i n i t se l f be qu i t e re levant . F o r e x a m p l e , the absence of ce r t a in types of i n f o r m a t i o n such as h o s p i t a l i z a t i o n records, p resc r ip t ions , doc to r v i s i t s , or l a b o r a t o r y tests m a y a c t u a l l y i nd i ca t e tha t a pa t ien t was fo rmer ly i n excel lent hea l th . T h i s w o u l d be an i m p o r t a n t observa t ion . O n the o ther h a n d , i f d ie t a ry hab i t s are m i s s i n g (everyone has to eat) , t h e n the d a t a m i n i n g sys t em has to have some way of h a n d l i n g the m i s s i n g d a t a w i t h o u t affecting the resul ts , and hopefu l ly w i t h o u t r e q u i r i n g m a n u a l i n t e rven t ion . W e emphas ize tha t a p p l i c a t i o n -dependent a ssumpt ions are often r equ i r ed i n d a t a m i n i n g app l i ca t ions . W h i l e m a n y of the uses of d a t a m i n i n g do not d i r e c t l y affect specific i n d i v i d u a l s , we note tha t d a t a m i n i n g has the p o t e n t i a l for abuse, jus t as there w o u l d be w i t h a lmos t any c o m p u t e r i z e d or m a n u a l sys t em. T h e r e are some serious p r i v a c y concerns [ O ' L e a r y , 1991] Chapter 1. Introduction 5 [Rosenberg, 1992], especially when names are attached to the underlying data. For ex-ample, government organizations have used pattern matching techniques to go on "fish-ing expeditions" in citizens' private database records [Shattuck, 1984] [Kusserow, 1984]. Some of these types of database probes can associate specific people with insurance, tax, or welfare fraud. Similarly, names can be associated with pattern matches in just about any type of application (e.g., motor vehicle records, medical profiles, insurance claims, retail sales, financial records). Perhaps the most controversial aspect of data mining is the fact that interesting patterns can be brought to the attention of the user without explicitly providing search criteria, opening the door to a variety of possible surprises. Data mining has tremendous potential, and is likely to be a valuable component of any strategic database system. The DBLearn system [Han et ai, 1992] [Han et a/., 1993] is a good example of a data mining system. A survey of various data mining systems can be found in Holsheimer and Siebes [Holsheimer and Siebes, 1994]. We conclude this section by noting some of the reasons for why data mining is receiv-ing a significant amount of attention in the database community: • Users want to discover relationships or patterns in their data; however, users often cannot, or will not, formulate queries to extract the information. In traditional database systems, queries are written using a powerful and relatively easy-to-use language such as Structured Query Language (SQL). Queries are written to return a specific set of data in accordance with search criteria carefully specified by the end-user or the applications programmer. In data mining, the idea is to design systems that are capable of efficiently detecting or identifying patterns, with minimal human intervention. • The amount of data in existing databases is already quite large, and we expect an exponential growth in the amount of data to be acquired and stored in the near Chapter 1. Introduction 6 future. This means that we need to create tools to efficiently analyze this data. • Currently, much corporate data remains untapped. This data could be viewed as a strategic asset that corporations can utilize to their competitive advantage. In other words, the profit motivation is there for users of data mining software (and obviously for the developers of those products as well). 1.2 Spatial Data Min ing A l l of the ideas described in Section 1.1 deal with non-spatial data (i.e., standard al-phanumeric data, with character strings and numeric values as attributes). Let us extend this idea one step further by including the spatial aspect. Spatial data mining is the branch of data mining that deals with "location" data, such as that represented in maps, photographs, satellite images, etc. In spatial data mining, we are interested in determining spatial relationships that exist between certain objects. For example, consider a map that contains a set of points, lines, and regions defining natural or manmade features such as rivers, lakes, parks, golf courses, schools, bridges, etc. Suppose further that one or more clusters of houses were identified on this map, representing specific residential areas. In this example, many of the houses in a particular cluster may be close to a park, beach, school, golf course, etc. Proximity information is one of the most important spatial relationships that exist among clusters and features. A spatial data mining system could be constructed to efficiently report which features are close (or closest) to a given cluster. Besides proximity, there are numerous spatial relationships that could be detected by a spatial data mining system. For example, if part of a cluster's boundary closely matches the pattern of part of the boundary of a distant feature, then this information may be worth reporting to the user. Consider a cluster of houses on a hill that slopes Chapter 1. Introduction 7 up from, and overlooks, a sheltered bay. Even though the houses may be quite far away from the shoreline, the relationship between the cluster and the bay could be quite significant from the user's point of view. Similarly, if part of a cluster hugs the boundary of an irregularly shaped golf course, then not only could the data mining system report the proximity information, but it could also report that some of the cluster's boundary actually matches some of the golf course's boundary. Ideally, the system should report the amount of "boundary matching" in quantitative terms, perhaps as a percentage of the perimeter of either the cluster or the feature. Additionally, we could encounter maps containing "abstract" features where a feature is defined by a qualitative — and often subjective — measure. These features may be represented in a manner similar to contour lines or isobars. In a real estate scenario, abstract features may describe a measure of a location's view or its "liveability" aspects (e.g., peacefulness, crime level, traffic congestion, access to amenities, commute time to the central business core). Since a cluster does not necessarily have a uniform distribution of houses, another spatial relationship is the skewness or the density of a cluster. For example, a system might report that "91% of the houses in the cluster lie near the cluster's perimeter", or that "The density of the cluster is 20 houses per hectare". Features can also be generalized into hierarchies [Han et ah, 1992] [Lu et a/., 1993]. For example, recreational facilities might include natural and manmade features. Man-made features might be grouped into indoor facilities (e.g., swimming pools, skating rinks, coliseums) and into outdoor facilities (e.g., playing fields, parks, golf courses). Weights could be attached to certain types of features. For example, depending on the user, recreational facilities such as parks, beaches, and golf courses might be ranked higher than industrial complexes. The concept of feature generalization is further discussed in Section 6.2.1. Chapter 1. Introduction 8 Obviously, there are many ways of grouping features, and there are many possible combinations of weights. Many of these are subjective and will vary from user to user. Although we desire as little human intervention as possible in data mining, a case can be made for users to define preferences and the types of relationships that they are particularly interested in. This is appropriate since what is "interesting" to one user may be of no interest to another. Of course, spatial and non-spatial data can co-exist in the same database. The DBLearn system has been extended to correlate non-spatial data with spatial data [Lu et al., 1993]. This is useful because many neighbourhoods, for example, have dis-tinct characteristics. A given neighbourhood may consist primarily of families, singles, students, older adults, a certain ethnic group, a certain income class, or one of a host of other characteristics. It can be quite useful to associate these attributes with spatial entities. Spatial data mining has many applications, such as in Geographic Information Sys-tems (GIS's), oil and gas exploration, municipal planning, real estate, retailing, 2-dimensional picture analysis, etc. In this section, we have described various types of spatial relationships. Note that all of the spatial relationships described so far could be programmed into a data mining system. Obviously, the more relationships that a spa-tial data mining system can handle, the greater the value of such a system. We realize that such a general system is well beyond the scope of this thesis, and we confine our research to a reasonably well-defined problem (namely aggregate proximity), defined in the following section. Chapter 1. Introduction 9 Library of Maps Features Spatial Data Clusters —* Aggregate 5 - Proximity Relationships Figure 1.1: Model of Operation 1.3 Problem Definition In this thesis, we deal with a GIS type of application, that is, an application involving map datasets. The problem that this thesis addresses is defined as follows. Given a cluster of data points (each with x and y co-ordinates), and a set of features (i.e., distinct polygons, whose vertices are defined by x and y co-ordinates), determine proximity relationships between the cluster and the features, and then rank and report the results. Specifically, we deal with the problem of finding the aggregate proximity between a cluster of houses (points) and the boundary of various features. Aggregate proximity means that all of the points in the cluster are considered before arriving at a conclusion. Figure 1.1 provides a high-level data model for this process. We would like our data mining system to find the closest features to the majority of houses in the cluster. Specifically, we would like to select the top n features (where n is defined by the user) which minimize the sum of the distances from each house to each feature, ranking those n features from nearest to furthest (in terms of aggregate proximity). The houses may have been selected based on their size, age, current real estate market value, or demographic/census information — perhaps by an SQL query, an interactive front-end graphics program, a cluster analysis program, or some manual Chapter 1. Introduction 10 SCHOOL GARBAGE DUMP Figure 1.2: Proximity Relationships process. How the clusters were determined is not important, but their existence is. In order to describe our problem as clearly as possible, an example is appropriate. Consider Figure 1.2. Note that a number of features (as defined by their polygons) are close to the housing cluster (in terms of boundary to boundary distance). A house is denoted by an "x", and we assume that the houses specified are the only houses in the (single) cluster. In other words, we assume that the empty space within the housing cluster is vacant land. The features that are closest to the majority of the houses in the cluster will normally be of much more interest to the user. This is especially true when trying to give possible reasons for the existence of the cluster in terms of nearby features. This is why we consider aggregate proximity. For example, in Figure 1.2, the closest feature to the Chapter 1. Introduction 11 cluster is the garbage dump, but it is unlikely that that is the dominant reason for the cluster's existence. In fact, we note that very few houses are located near the dump. The fire hall is also very close to the cluster, yet there are not very many houses close to it. It is apparent from the diagram that the vast majority of the houses are located close to the golf course, lake, mall, and sports complex. The school appears to be the feature that is furthest from the cluster, yet more houses are located closer to the school than to the fire hall or the dump. The statements generated by the data mining program need to reflect this information, ranking the features accordingly. Of course, it would still be appropriate to cite the dump as being one of the features of interest, since the user likely wants to be informed about this. It is the data mining program's job to identify nearby features in terms of their aggregate proximity to the cluster. The user is presented with this information and can then accept or discard any part of that information. The program does not make any judgment about whether or not a golf course is more important than a mall, whether or not a sports complex is more important than a school, and so on. That is up to the user when interpreting the results. Since data mining typically involves a large amount of data, common sense tells us that it will be too time consuming to compare every point in the cluster with every feature in the dataset. We will be satisfied with possibly trading off some accuracy for the sake of efficiency. Given the choice of examining a large number of features using pointwise operations, and of examining a much smaller set of features that are likely to contain most of (or perhaps all of) the actual top n features, we would be satisfied with the latter approach. Under this approach, we have no guarantee that the top n features returned will actually be the n closest features in terms of aggregate proximity. Data mining often involves approximations and uncertainties, and we are prepared to accept some uncertainty in the interest of efficiency. In addressing our aggregate proximity problem, we define 3 specific goals that form Chapter 1. Introduction 12 the nucleus of this thesis: • expressiveness • scalability • increment ality Consider expressiveness. Although some level of expressiveness is inherent in the problem definition, we want to make sure that the output statements explicitly convey quantitative information. A statement such as "Most of the houses in the cluster are close to a golf course" is not really very interesting for a number of reasons. First of all, the term "most" does not convey a meaningful figure, since any figure between 51% and 100% would apply to this term. Secondly, "close" is a relative term. A numeric value in metres or kilometres is more appropriate. Thirdly, no mention is made of which golf course. There may be many golf courses in the area. The feature needs to be named specifically. For example, it would be much more interesting to note that "58% of the houses in the cluster are within 250 metres of South Hi l l Elementary School", or that "43% of the houses in the cluster are within 2 kilometres of the University of British Columbia". We desire expressiveness in a form similar to the following example: • 22% of the houses in the cluster are within 250 metres of Queen Elizabeth Park. • 46% of the houses in the cluster are within 500 metres of Queen Elizabeth Park. • 89% of the houses in the cluster are within 1 kilometre of Queen Elizabeth Park. • 100% of the houses in the cluster are within 2 kilometres of Queen Elizabeth Park. Chapter 1. Introduction 13 Note that this example (for one particular cluster-feature combination) explicitly lists the percentage of houses (hence, aggregate proximity), an arbitrary set of distance intervals,1 and the name of the relevant feature. Scalability simply refers to being able to handle large quantities of data without serious degradation in performance. Ideally, we would like to see approximately linear (or better) asymptotic growth. It is clear that quadratic (or worse) complexity is unsuitable for tens of thousands of features. Data mining, by nature, involves large datasets, so it is essential that we deliver scalability. Incrementality is also important because a user may wish to perform several consecu-tive runs on the same input to see more (or fewer) features listed in the output. It simply does not make any sense to reload and reprocess all of the data if most of it has already been examined (and will still form part of the answer set). In order to confirm that we can meet our goals of expressiveness, scalability, and incrementality, an implementation of a spatial data mining system was performed. This thesis deals with a specific application, namely a real estate or GIS scenario; however, it should not matter that the points are houses, it should not matter that a given polygon is a lake, and it should not matter who the intended audience is. These data mining techniques should be transferable to other applications. This thesis will attempt to generalize many of the concepts; however, concrete examples will be provided to make the statements relevant and informative. 1For this example, we chose 4 values that seemed reasonable in an urban setting. It makes sense to provide a spectrum of values (ranges), so that a user can see how the percentage of points varies with respect to distance. Of course, the values or attributes are application dependent. For example, a forestry application may have a cluster composed of trees, and the distance intervals are likely to be much larger. Percentiles could be used to provide some degree of application independence. Chapter 1. Introduction 14 1.4 Contributions As indicated in Sections 1.1 and 1.2, we view data mining as being a valuable extension to existing DBMS's , especially given the exponential rate of growth in the volume of existing and future data. It is clear that tools need to be developed to efficiently perform spatial data mining. Those tools need to be expressive in the results they return, scalable in order to efficiently handle large volumes of data, and incremental so that "reruns" with parameter changes can be supported efficiently. In particular, we deal with the problem of efficiently determining aggregate proximity relationships. Here is a preview of the work described in later chapters of this thesis: • We analyze the suitability of computational geometry techniques for our problem, and determine that no single distance algorithm or polygon intersection technique provides the answer. Instead, we argue that a multiple filtering approach is most appropriate because of its efficiency. • We develop Algorithm C R H (based on encompassing circles, isothetic rectangles, and convex hulls) to efficiently determine aggregate proximity relationships via successive convex approximations (and subsequent pointwise operations). • We use the concept of thresholds in Algorithm C R H to provide quantity control. This ensures that a minimum number of features are retained at any given stage of filtering. Additionally, memoization is performed to store: — the distance between the convex approximation (boundary) of the cluster and the convex approximation of each feature (a bucket approach will be used to store these values) — the result of hull intersection tests Chapter 1. Introduction 15 — the computed aggregate distance statistics The purpose of memoization is to avoid multiple passes through the data, and to facilitate incremental runs (should a user decide to change the thresholds after seeing the results of the program). • We describe an implementation involving clusters and features in the city of Van-couver. The performance results show that Algorithm C R H is scalable and in-cremental. In particular, our implementation examines over 50,000 features and determines (and ranks) the top 10 features — all in less than 1 second of C P U time. Finally, we note that although a number of computational geometry techniques are used and analyzed in this work, we do not attempt to come up with any new com-putational geometry algorithms from a theoretical perspective. Instead, we simply use computational geometry as one of the tools to generate the kind of performance results that database researchers are interested in. 1.5 Outline of Thesis Chapter 2 defines related work and gives a number of definitions used in the context of our work. Arguments are given for and against various computational geometry techniques. Spatial indexing is also discussed here. Chapter 3 elaborates on some of the topics discussed in Chapter 2, and identifies possible approaches to meeting our objectives. In Chapter 4, we describe Algorithm C R H and show how it meets our goals of expressiveness, scalability, and incrementality. A case study involving clusters and features in the city of Vancouver is presented in Chapter 5, thereby supporting the claims of Chapters 3 and Chapter 1. Introduction 16 4. Finally, our conclusions are stated in Chapter 6, along with a number of suggestions for future work. Chapter 2 Background and Related Work This chapter provides the preliminaries necessary to define our research. We also discuss spatial indexing, Voronoi Diagrams, and alpha-shapes. Although these topics are cer-tainly relevant to spatial database systems, we show that for the task at hand, none of these techniques provides the kind of expressiveness, scalability, and incrementality that we desire. 2.1 Terminology To avoid ambiguity in the topics addressed by this thesis, we begin by defining the following terms. A l l definitions are for 2 dimensions only. Formal definitions are given in computational geometry texts [O'Rourke, 1994] [Preparata and Shamos, 1985]. Point A point is a unique 2-dimensional location represented by the ordered pair (x,y). Cluster This is a set of points. Convex Set This is the closed boundary of a set of points such that a line segment connecting any 2 points in the set always lies on or interior to the boundary. Convex Hul l This is the unique convex shape that has the least area of all of the convex shapes that enclose a set of points. Informally, if each point in the set were represented as a nail sticking out of a piece of wood, the convex hull could be determined by wrapping an elastic band around all of the nails. 17 Chapter 2. Background and Related Work 18 Closed Curve This represents the boundary of a region in the plane. The shape defined by a closed curve may be convex or non-convex. We assume that the curve is defined by a finite number of line segments (edges) connected end-to-end.1 The line segments form a cycle, and we assume that non-adjacent line segments do not intersect. Polygon A polygon is represented by a closed curve consisting of k edges and k vertices. We define a polygon as the union of that boundary and the region interior to the boundary. Medoid This is the most centrally located point within a cluster. Let S be a set of points. The medoid is defined as the point m £ S minimizing: where dist(m,p) represents the distance between points m and p. Centroid The centroid of a finite set of points is the arithmetic mean of those points. Specifically, if each of n points is represented in the form (a;,-, j/j), then the centroid is defined as: 2.2 Types and Sources of M a p Data One of our first concerns is to define the type of maps that we will be dealing with. One possibility is to acquire paper maps (of which there are many) and then generate digital 1 A circle or an ellipse is actually a closed curve even though it is not composed of a finite collection of line segments. We would approximate such a curve by a series of vertices defining the endpoints of consecutive line segments. dist(m,p) pes Chapter 2. Background and Related Work 19 versions of those maps. For example, the City of Vancouver (and various municipal, provincial, and federal departments) may have numerous maps describing schools, parks, apartment buildings, shops, golf courses, bridges, cemeteries, heritage sites, beaches, roads, sewers, soils, vegetation, rainfall, temperature, population, airports, etc. A second possibility is to acquire the data in digital form from one or more GIS's. Although a large amount of GIS data has been accumulated, our experience has shown that much geographic data is still in paper form. Of the digital data available, much of it is proprietary, and there are many data formats used by such systems. Some GIS's deal with smooth curves, whereas others deal with vertices and edges. Furthermore, some GIS's will download the endpoints (vertices) of line segments, but will not separate adjacent features that share one or more line segments as part of their respective boundaries. For example, if 2 features to be downloaded have any line segment(s) in common, and we wish to obtain a separate set of vertices for each feature, then we must download each feature separately (i.e., perform 2 executions of the download process). If we do not perform multiple downloads, then we will have to manually separate the single set of vertices that would result (taking care to make sure that each feature has a complete set of vertices). Although there are many GIS's on the market, few provide true interoperability with respect to data formats or data access. Peuquet and Marble describe many of the pop-ular systems [Peuquet and Marble, 1990]. Each GIS may have certain strengths. For example, some systems may have been optimized to draw maps, to perform polygon intersections, or to handle certain types of well-defined, user-specified spatial queries [Laurini and Thompson, 1992]. Many maps that are accessible by computer are stored in free format whereby a feature is merely represented by a collection of x and y co-ordinates. Our techniques need to deal with map data represented in this way. In our work, we do not cater to any Chapter 2. Background and Related Work 20 specific GIS. In fact, since much map data still exists outside a GIS, and since the data from one GIS may not necessarily be usable on another GIS, we will deal with ASCII data, which is the common denominator among most systems.2 The map that we digitized for our case study (to be presented in Chapter 5) was a Universal Transverse Mercator (UTM) projection. Informally, a U T M unit on a map equates to an interval of approximately 1 metre on the ground (latitude or longitude). We emphasize the word "approximately" for a number of reasons: • A l l projections have inaccuracies since a curved portion of the Earth's surface has to be projected onto a flat plane. • The Earth is 3-dimensional. We are dealing with (x,y) co-ordinates, rather than (x, y, z) co-ordinates, so inaccuracies are bound to exist in our distance calculations. • Distances are based on a given reference point. Maps that use different reference points may not yield the same U T M values, although a change of local reference points is likely to yield very minor differences. U T M co-ordinates are a convenient way of describing locations, and for performing ap-proximate distance calculations. For example, the point in Vancouver having U T M loca-tion (489300,5453000) is easily understood to be 99 metres away from (489399,5453000). The digitization process (to be described in Section 5.1.1) allowed us to generate (x,y) co-ordinates to within one nanometre. For the reasons given above, accuracy to a nanometre is meaningless. Also, the accuracy of digitization depends heavily on the care of the person performing the digitization (e.g., steadiness of the hand, making allowances for the width of lines (roads), etc.) as well as the care taken in interpreting 2 Actually, a large amount of topographic data for our preliminary research was stored in E B C D I C format on an M T S mainframe. We converted it to ASCII in order to make it usable on a U N I X system. Chapter 2. Background and Related Work 21 the map. Finally, the reader should note that the construction of the underlying map may be subject to some error. For example, how does one determine the exact boundary of a lake, hill , river, etc., especially when erosion takes place or when a more accurate benchmark is discovered that may alter previous location values? 2.3 Cluster Generation Although there are many ways of generating clusters, one way is through the statistical technique called cluster analysis [Kaufman and Rousseeuw, 1990]. An application of this principle to spatial data mining is found in the C L A R A N S algorithm [Ng and Han, 1994]. In their paper, Ng and Han provide a Vancouver real estate scenario in which 3 distinct clusters were found: one around Queen Elizabeth Park, one along the city's southwestern shoreline, and one in the downtown core. (The clusters were comprised of housing units falling into well-defined size and price ranges.) How the clusters were generated is beyond the scope of this thesis. We assume that a cluster has been provided to the data mining program as a set of points — the points being defined by (x,y) co-ordinates. The cluster points are very relevant to this thesis because we deal with aggregate proximity relationships. 2.4 Spatial Database Systems We have already been introduced to GIS's; however, it is important to realize that GIS's are complex applications with diverse functionality [Peuquet and Marble, 1990] that use spatial database systems for their data management functions. GIS's are perhaps the prime motivator for research into spatial DBMS's . A spatial D B M S differs from a traditional D B M S in the following ways [Guting, 1994]: • it supports spatial data types (e.g., points, lines, and regions) Chapter 2. Background and Related Work 22 • it supports spatial queries (e.g., "area > 5") using techniques such as plane-sweeps, approximations (e.g., minimum bounding boxes), and stored unary function values • it has efficient algorithms for performing spatial joins • it provides spatial indexes for: — points — e.g., grid files [Nievergelt et al, 1984] and k-d trees [Bentley, 1975] — regions — e.g., R-trees [Guttman, 1984], quadtrees [Hanan, 1990a] [Hanan, 1990b], and filter and refine strategies [Brinkhoff et al, 1993] This thesis does not deal with all of the above characteristics of spatial database systems since we are not concerned with the development of a spatial D B M S . Instead, we focus on one particular aspect of spatial analysis, namely the determination of ag-gregate proximity relationships between features (regions) and the points in a particular cluster (region). The determination of aggregate proximity relationships is not currently supported by any spatial D B M S . The following sections discuss some of the popular techniques used to address issues concerning spatial data storage and point/region queries. We aim to see which of these techniques, if any, can support aggregate proximity requests and satisfy the goals of expressiveness, scalability, and incrementality. 2.5 Spatial Indexing Since we are dealing with spatial data, and since GIS applications typically require spatial indexing techniques [Laurini and Thompson, 1992], it makes sense to see if spatial indexes provide the answer to the objectives described in Section 1.3. Chapter 2. Background and Related Work 23 2.5.1 Point Data Structures We begin our examination of spatial indexing techniques by briefly considering some point data structures. A k-dimensional binary search tree (k-d tree) [Bentley, 1975] recursively partitions the data space into orthogonal subdivisions.3 At each level of a k-d tree, a test can be made to determine the direction of branching. For a 2-d tree, the test involves either the x direction or the y direction. The directions alternate at each level. For example, x co-ordinate values can be tested at any even level (beginning with the root at level 0), and y co-ordinates can be tested at any odd level. The cost of constructing a k-d tree is O(nlogn), where n is the number of points. The worst-case cost for a range search in a complete 2-d tree is 0(y/n) [Lee and Wong, 1977]. Point quadtrees [Finkel and Bentley, 1974], like k-d trees, are recursive hierarchical structures. Point quadtrees require 0(nlogn) construction time (where n is the number of points), and they have a worst-case range search cost of 0(y/n) (for a complete 2-d point quadtree) [Lee and Wong, 1977]. A k-d tree often outperforms a point quadtree because a k-d tree requires fewer pointers per node and has a reduced branching factor [Hanan, 1990a]. Range-trees were designed to efficiently determine all points that lie within a given range of a point [Hanan, 1990a]. In contrast to point quadtrees and k-d trees, range-trees produce the fastest query search times (O((log n) 2)) [Preparata and Shamos, 1985]; however, they require O(nlogn) construction time, and have O(nlogn) storage require-ments. Although some of the literature about point data structures emphasizes efficient region searches, those region searches are actually range queries that simply return a set of points 3 The orthogonality restriction can be relaxed in variants of the k-d tree. Chapter 2. Background and Related Work 24 falling into some specific range (e.g., all points within a radius r of a given point). While point data structures have value in many spatial applications, they are not well suited to our aggregate proximity problem. We need to deal with regions, rather than points. 2.5.2 Region Quadtrees Samet provides a good survey of spatial indexing techniques, especially the various types of quadtrees [Hanan, 1990a] [Hanan, 1990b]. In this section, we limit our discussion to quadtrees that can support regions. Quadtrees are hierarchical data structures that recursively split a 2k x 2k image (area) into 4 squares (northwest, northeast, southeast, and southwest quadrants). Consider a square region that is arbitrarily placed in the image, and which is to be represented in a quadtree. This region may be represented by a single node or perhaps by many nodes, depending on its position in the image. The recursive decomposition of a region continues until the part of the region that falls into a given quadrant (at some level in the quadtree) totally occupies that quadrant.4 Quadrants that are totally disjoint from the region are excluded from further consideration. In general, an arbitrarily placed polygon can require hundreds, if not thousands, of nodes in its quadtree representation. Each node requires storage space for its pointers (up to 7 such fields). Since we expect to deal with tens of thousands of polygons in our problem, we realize that a very large amount of storage may be required to index our map. Also, since the cost of constructing a quadtree is 0(n log re), where re is the total number of nodes in the quadtree [Hanan, 1990a], we have concerns about scalability. Furthermore, finding the nearest m neighbours in a quadtree is not a trivial task because the nodes in the quadtree are not necessarily stored in an order conducive to 4There are 22k pixels in a 2k x 2k image. The lowest level in the quadtree is level zero, at which point a node corresponds to a single pixel. No further decomposition is possible. By "totally occupies", we mean that there is no pixel in the quadrant that does not contain part of the region. Chapter 2. Background and Related Work 25 range searches. For example, two features that are adjacent to one another on a paper map may be widely separated (physically) in a quadtree. 2.5.3 R-trees and Variants R-trees [Guttman, 1984] and their variants are hierarchical structures that deal with approximations of spatial objects using rectangles. Spatial objects such as regions in 2-d space cannot be satisfactorily handled by per-forming indexing on points (individual tuples). R-trees were specifically developed to sup-port the indexing of spatial data in a database system [Guttman, 1984] [Giiting, 1994]. An R-tree is a height-balanced multiway tree. Each isothetic rectangle contained in a leaf node approximates a particular spatial object in the database. Each internal node contains an isothetic rectangle that encompasses all of the rectangles in the next level of the hierarchy. The problem with R-trees is that many rectangles may overlap (intersect). As more and more rectangles intersect, searching becomes less efficient since a very large number of nodes may need to be examined in the data structure. (Like B-trees, searching begins at the root node; however, many more subtrees (paths) may need to be visited.) R+-trees [Stonebraker et al, 1986] [Sellis et a/., 1987] correct this problem by clipping rectangles which intersect at the same internal node level. By clipping all such rectangles (other than those at the leaf level), retrieval time is improved. Unfortunately, many rectangles may be duplicated at the leaf level. The height of the tree is likely to increase, and furthermore, some of the good performance characteristics associated with B-trees (especially with respect to secondary storage access) may be lost since many index pages may now be less than half full (unless costly insertion and deletion routines are invoked [Hanan, 1990a]). Of course, if the index pages reside in main memory, then sparse index nodes are less of Chapter 2. Background and Related Work 26 a concern. To solve some of the performance problems associated with R-trees and R +-trees, R*-trees [Beckmann et a/., 1990] were developed. R*-trees can support both point and region data. They can outperform R-trees (and their variants) largely due to the "forced reinsertion" of (possibly many) entries during an insert operation. In other words, reor-ganization is dynamic. R-trees suffer from the order in which the rectangles are inserted. "Bad" rectangles early on in the building phase can seriously affect the structure of the tree, thereby hurting retrieval times [Beckmann et al., 1990]. R*-trees get around this data distribution problem by reducing the area, perimeter, and overlap of rectangles (in internal nodes). Unfortunately, all of the variants of R-trees still have an 0(n2) worst-case intersection time, where n is the number of rectangles [Hanan, 1990a]. This poses a scalability con-cern since we expect to handle tens of thousands of features in our aggregate proximity problem. 2.5.4 Summary Having considered k-d trees, point quadtrees, range-trees, region quadtrees, and R-trees, we have some concerns about their applicability to our problem. As mentioned in Section 2.5.1, although point data structures nicely support range queries involving points, they are of little value to us when dealing with regions (features). In Section 2.5.2, we noted that region quadtrees consume a large amount of storage, and that it is not clear how efficiently range searching among regions can be performed (or how efficiently aggregate proximity calculations (involving points) can be done). Section 2.5.3 indicated that R-trees and their variants suffer from an 0(n2) worst-case range search. Scalability in a concern in all of the spatial indexes that we have examined, not Chapter 2. Background and Related Work 27 only with respect to possible search times, but with respect to the construction of the underlying data structures. If the spatial indexes (e.g., region quadtrees or R-trees) already exist, then it makes sense to try to use them. Unfortunately, there are literally thousands of maps that our spatial data mining system can draw from. We do not expect that spatial indexes exist for all of these, and we do not believe that it is prudent to construct (and store) a spatial index for each such map. After all, we expect many maps to be represented in free format (ASCII flat files), whereby the boundary of each feature is denoted by a collection of points. Even if a spatial index exists for a particular map at another site (say site B), the index will be of no use to us at site A if: • the sites have different database systems. This is a legitimate concern since het-erogeneous maps exist on heterogeneous GIS's. • the data downloaded is dependent on site B's storage. It is very likely that the index will have to be constructed from scratch at site A . Finally, we note that spatial indexes are useful when the same map is used many times; however, in spatial data mining, we do not know how often a given map will be reused (if at all). Spatial data mining is performed infrequently (probably not even weekly), and we believe that the cost of building and maintaining spatial indexes is not justified. 2.6 Filter and Refine Strategies In the first stage of filter and refine processing [Brinkhoff et al, 1993], the entire search space is filtered and a list of regions (a superset of the answer set) is returned. Brinkhoff et al. approximate a region by using 1 of 6 convex structures: isothetic rectangles, Chapter 2. Background and Related Work 28 minimum bounding rectangles, minimum bounding circles, minimum bounding ellipses, convex hulls, and minimum bounding n-corners.5 The particular bounding structure chosen depends on the application. Based on their studies, Brinkhoff et al. indicate that the minimum bounding ellipse or the minimum bounding rectangle is the "best" choice when storage requirements and "approximation quality" (accuracy) are equally weighted. Whenever a spatial object is inserted or updated in the (main) database, an approximation is also constructed and stored in a database. The second stage is the refinement stage, where the exact representation of each object is examined. This is the most expensive stage of processing, and this can be a problem if the number of candidates returned from stage one is even moderately large (e.g., 50-100). Scalability is not clear, especially if n-corner polygons are used, or if tens of thousands of features need to be examined.6 Furthermore, as indicated in Section 2.5.4, unless there is a need to frequently reuse the feature pool, it is undesirable to construct and maintain any kind of spatial index (or in this case, one or more sets of approximations for each spatial object). 2.7 Voronoi Diagrams A good overview of Voronoi Diagrams and their important dual structure called the De-launay Triangulation (which encapsulates similar information but in a different form) can be found in computational geometry texts [O'Rourke, 1994] [Preparata and Shamos, 1985], 5The authors experiment with 4-corner and 5-corner polygons, and mention that they used the convex hull as the starting point to compute these minimum bounding n-corners. They compute the minimum bounding ellipse by using a randomized algorithm that has an expected linear time complexity. 6Their paper does not give timing results. Instead, they present performance results in terms of block accesses on secondary storage. Chapter 2. Background and Related Work 29 or in specialized survey papers [Aurenhammer, 1991] [Okabe et ai, 1992b]. A more tech-nical and thorough treatment is given by Okabi et al. [Okabe et al, 1992a]. Voronoi Dia-grams have numerous applications in computer science including associative file searching (nearest neighbour problems) and cluster analysis. Given a set S of n points in the plane, a Voronoi Diagram partitions the plane into n regions such that each point pi £ S is associated with the region Ri of the plane closest to that point. This allows us to determine the nearest point p £ S from any point q in the plane7 in O(logn) time. The cost of constructing the Voronoi Diagram is O(nlog n). This application can be generalized to an order-k Voronoi Diagram to allow us to find the k nearest points pi,P2, • • • ,Pk {pi £ S) from any point q in the plane. An order-& Voronoi Diagram can be constructed in 0(k2nlogn) time (where n is the num-ber of points in S), and the k nearest neighbours can be found in O(logn + k) time [Preparata and Shamos, 1985].8 The storage required is 0{k2n) [Aurenhammer, 1991]. Instead of generating a Voronoi Diagram using a set of points, let us now consider using a set of areas (regions). Given a set S of n disjoint features (e.g., parks) in the plane, a Voronoi Diagram can be constructed to partition the plane into n regions such that each feature F{ £ 5" is associated with the region Ri of the plane that is closest to it. This particular application would allow us to determine the nearest feature from any point q in the plane. Furthermore, if we use an order-A; Voronoi Diagram involving areas, then this would provide a way of solving our aggregate proximity problem. For example, since we are satisfied with an approximation of the m closest features overall (in terms of aggregate proximity), we can simply find the nearest m features to each point in the cluster, add up the number of times that a particular feature was identified, and then rank the features based on the number of "votes" that they received. 7This situation describes the well-known Post Office Problem. 8The entire family of order-fe Voronoi Diagrams (k = 1, 2 , . . . , n — 1) can be obtained in 0(n3) time [Preparata and Shamos, 1985]. Chapter 2. Background and Related Work 30 Unfortunately, the time and space complexities for an order-& Voronoi Diagram are unacceptable if tens of thousands of features are involved, and if the value of k is even moderately large (e.g., 50-100). Furthermore, since the user may increase the value of k substantially during incremental processing, we need to provide for efficient incremental-ity. While Voronoi Diagrams are very useful for certain spatial database queries, we con-clude that our spatial data mining problem is not well suited to a Voronoi framework. 2.8 Alpha-Shapes Given a dense cluster of points, most humans can easily visualize a (possibly non-convex) polygon whose shape "matches" the cluster. While the shape of such a polygon may be intuitive to humans, it is undefined to machines. A n alpha-shape [Edelsbrunner et al, 1983] [Kirkpatrick and Radke, 1985] can closely hug the "shape" of a non-convex cluster. While a cluster has a unique convex hull, there is a spectrum of alpha-shapes to consider. Each alpha-shape is defined by a different alpha-value, and hence each alpha-shape has a different "closeness of fit". Edelsbrunner et al. define the convex hull of a set of points to occur at a=0. The minimum bounding circle results when a = l / r , where r is the radius of the set of points. Larger a-values produce cruder shapes for a given set of points. Conversely, negative a-values provide finer resolutions. A n alpha-shape is computable in 0(n log n) time, where n is the number of points in the set. Although alpha-shapes are not used in our work, we believe that it is appropriate to mention them at this point because they can describe the shape of the cluster that we are dealing with. In this thesis, we strictly deal with convex approximations for the boundary of the cluster. Extensions to this work may include the use of alpha-shapes, Chapter 2. Background and Related Work 31 not only for generating a polygon for the cluster, but to facilitate boundary matching between the cluster and the features. We defer further discussion of alpha-shapes to Section 6.2.2. Chapter 3 Evaluating Relevant Computational Geometry Algorithms Although Chapter 2 introduced related work and explored various computational ge-ometry techniques, we use this chapter to extend the discussion and to solidify our understanding of how clusters and features need to be handled. This will provide the framework for Algorithm C R H which is formally documented in Chapter 4. 3.1 Distance Measurement Techniques The determination of aggregate proximity requires point-to-feature calculations for each point in a cluster. There are various types of measurements that can be performed. For example, we could calculate the distance from a point to: • the centroid of a feature • the boundary of a feature • a vertex of a feature We could also characterize a cluster by a single point, namely its medoid. Of course, simply using the medoid will not yield an aggregate proximity value; however, we note that a medoid-to-feature distance may be useful for approximations. When dealing with urban areas, a case can be made for using Manhattan distances to determine the walking distance to a park, beach, university, etc.; however, the difference 32 Chapter 3. Evaluating Relevant Computational Geometry Algorithms 33 between Manhattan and Euclidean distances is not likely to be significant. We restrict our discussion to Euclidean distances. 3.2 Usefulness of Polygon Intersection Tests Eventually we have to perform individual point-to-feature calculations; however, in this section, we explore the idea of treating the cluster and the features as individual collec-tions of points (i.e., polygons) to see how appropriate region-to-region operations (e.g., intersection, separation) might be for the task at hand. At the very least, we will be able to see how polygon intersection tests can be used to determine possible proximity relationships. We begin our discussion by noting that clusters and features fall into various cate-gories: 1. A feature is inside the cluster. For example, a cluster may surround a small lake. 2. The cluster is inside a feature. For example, a cluster may appear in the middle of a forest. 3. A number of features are adjacent to one another. Consider a contiguous region containing a beach, a forest, a university, and a golf course. Houses may appear near any or all of these features. We will treat each of the adjacent features separately. 4. The features are non-adjacent; however, a cluster may be in close proximity to several features. 5. The cluster is not very close to any particular feature. The first and second categories can easily be handled by intersection tests. In each case, there is a 100% overlap for 1 of the 2 polygons because 1 polygon is contained Chapter 3. Evaluating Relevant Computational Geometry Algorithms 34 within another. In fact, for the second category, we note that all of the houses in the cluster are contained within the boundary of the feature, thereby eliminating the need for any point-to-feature distance calculations for this particular case (because the distance is simply defined to be zero). The third and fourth categories can fail a polygon intersection test. For example, when a cluster borders a straight road, and a park is on the other side of the road, then there will be no intersection to report. Fortunately, it is possible to detect an intersection if we introduce a buffer zone (e.g., by enlarging the cluster). Then, if part of the feature intersects the enlarged cluster, we will be able to detect a possible spatial relationship. Similarly, we can approximate the cluster and the features by convex polygons such as circles, rectangles, or convex hulls. A high percentage of overlap between a cluster's approximation and a feature's approximation may suggest that there is a proximity relationship, a small percentage of overlap may suggest the possibility of a proximity relationship, and no overlap may suggest the absence of a proximity relationship. The fifth category will likely fail a polygon intersection test. A small buffer zone is unlikely to help. We may have to enlarge the cluster (or each feature) substantially in order to find any features which intersect the cluster. We conclude that polygon intersection tests have some merit, even though there are bound to be some exceptions. In the following sections, we examine the trade-offs between the effectiveness and efficiency of various approximations (and intersection tests), especially in comparison to the pointwise distance techniques mentioned in Section 3.1. 3.3 Efficiency We now consider the efficiency of the distance and intersection techniques to see how suitable they are in a spatial data mining context. Efficiency is determined by time Chapter 3. Evaluating Relevant Computational Geometry Algorithms 35 complexity. Although a lot of urban features are quite simple (e.g., rectangular regions for most schools and many parks), it is difficult to come up with a concrete comparison of efficiency, especially since many features and clusters vary greatly in shape and size. The time complexity for computing each of our region approximations is as follows, where n represents the number of points: • Encompassing Circle — 0(n) • Isothetic Rectangle — 0(n) • Convex Hull — 0(n log n) 1 Although there are many other types of regions to consider, such as alpha-shapes, minimum bounding boxes, minimum bounding ellipses, minimum bounding circles, etc., we narrow our focus to the "simpler" geometric constructs. This is not to say that other geometric constructs will not be of value. An exhaustive evaluation of the various possible techniques is a lot of work, and it is not clear that any of the less efficient techniques will offer any significant advantage over the 3 region approximations proposed above. In the following efficiency rankings, we assume that there are n points in the cluster and un points in its hull, and that there are m points in a feature and m# points in its hull. Sometimes the medoid is available without incurring any expense, which is the case with a clustering algorithm like C L A R A N S [Ng and Han, 1994]; otherwise, it would have to be computed (the time complexity is 0(n2), where n is the number of points in the cluster). 1 I n our work, each feature is a s imple po lygon (i.e., non-consecutive edges do not share a vertex), whose vertices are already ordered. T h e convex hu l l of a s imple po lygon w i t h n vertices can be constructed i n 0(n) t ime [Preparata and Shamos, 1985]. A cluster, on the other hand, is composed of many points , not jus t its boundary points. Chapter 3. Evaluating Relevant Computational Geometry Algorithms 36 Here are the cluster-feature distance and intersection techniques, provided in an ap-proximate order of decreasing efficiency. We assume that the circles, rectangles, and hulls are known. 1. Constant Complexity • circle intersection • rectangle intersection • distance from a single cluster point to a feature's centroid — if the centroid is known 2. Logarithmic Complexity • distance from a single cluster point to a convex feature's boundary — 0(log m#) • convex hull intersection — O(log n# + log m#) • distance from a convex cluster boundary to a convex feature boundary — 0(log nH + log mH) 3. Linear Complexity • distance from a single cluster point to a feature's centroid — if the centroid is not known • distance from a single cluster point to a non-convex feature's boundary 4. Higher Complexity • distance from all cluster points to a convex feature's boundary — 0(n log m#) • distance from all cluster points to a non-convex feature's boundary — 0{nm) Chapter 3. Evaluating Relevant Computational Geometry Algorithms 37 We ignore complexity constants, although we acknowledge that they may be relevant depending on the mix of features. For example, m and ra# can vary greatly, but will often be quite small. This means that a substantial amount of code could be invoked just to deal with 2 relatively simple 4-point polygons for some of the logarithmic cases. Optimizations can be performed during implementation to test for some of these cases. 3.4 Matching Cluster-Feature Combinations with Effectiveness Techniques In this section, we choose 8 representative techniques (5 distance techniques and 3 polygon intersection techniques), and perform an analysis to see how well they apply to various cluster-feature combinations. These 8 techniques are those we have introduced in Sections 3.1 and 3.2. They are by no means exhaustive; however, they will provide some insight into how to approach our problem. Table 3.1 lists various situations (i.e., combinations of clusters and features, including general shapes and sizes), and it indicates which techniques will be able to identify the "interesting" situations, that is, situations in which the cluster is close to the feature.2 Efficiency is not considered. If size or shape is not mentioned, it can be assumed that those attributes are unimportant for that particular situation, or that there are simply too many possible boundary shapes to categorize. An example of the latter situation occurs when a non-convex cluster is facing a nearby non-convex feature. We have deliberately avoided attaching numerical values to the results shown in Table 3.1 since the situations cannot reasonably be defined in absolute terms. We view Table 3.1 as a starting point to see what general types of situations can be encountered, and which techniques are best suited to those situations. 2 W e assume that the absence of a (close) p r o x i m i t y relat ionship impl ies that the s i tua t ion is not interesting f rom a da ta m i n i n g perspective. W e also realize that i n some cases, the absence of a rela-t ionship is what the user is interested i n (e.g., m a x i m u m distance f rom a prison or a garbage d u m p ) . In this thesis, we assume that close p r o x i m i t y is desired, and we therefore confine our results to this case. Chapter 3. Evaluating Relevant Computational Geometry Algorithms 38 Cluster-Feature Combination (Situation) Technique 1 2 3 4 5 6 7 8 C and F are Far Apart n n n n n n n n C and F Border Each Other P y y y y P P P C and F are Small and Close y y n n n y y y C is Small, F is Large; Close y y P P P n y n C is Large, F is Small; Close p p P P P P n P Shallow Convex C Close to Non-Convex Part of F y y P P P P y P Large Convex C Close to Non-Convex Part of F p p P P P P n P Shallow Non-Convex C Close to Non-Convex Part of F y y P P P P y P Large Non-Convex C Close to Non-Convex Part of F p p P P P P n P Shallow Convex C Close to Convex Part of F y y n n P P y P Large Convex C Close to Convex Part of F p p n n P P n P Shallow Non-Convex C Close to Convex Part of F y y P P P P y P Large Non-Convex C Close to Convex Part of F p p P P P P n P Large C Mostly Encloses Large F; Close p y y y y n P n Large C Mostly Encloses Small F; Close p y y y y P P P Large C is Mostly Enclosed by Large F; Close p y y y y P n P Large C is Mostly Enclosed by Shallow F; Close p y y y y y n P Shallow C Mostly Encloses Large F; Close y y y y y n y n Shallow C Mostly Encloses Small F; Close y y y y y y y y Small C is Mostly Enclosed by Large F; Close y y y y y p y p Small C is Mostly Enclosed by Shallow F; Close y y y y y y y y Table 3.1: Identifying Possible Cluster-Feature Relationships Chapter 3. Evaluating Relevant Computational Geometry Algorithms 39 Summary of Results Technique 1 2 3 4 5 6 7 8 Number of y votes 10 15 9 9 9 4 10 3 Number of n votes 1 1 4 4 2 4 8 4 Number of p votes 10 5 8 8 10 13 3 14 Table 3.2: Summary of Possible Relationships The technique numbers in the heading of Table 3.1 correspond to the following enu-meration: 1. Cluster Point to Feature Boundary (using all points in the cluster) 2. Cluster Boundary to Feature Boundary 3. Convex Hull Intersection (using a small buffer zone) 4. Circle Intersection 5. Rectangle Intersection 6. Cluster Point to Feature Centroid 7. Cluster Medoid to Feature Boundary 8. Cluster Medoid to Feature Centroid In the body of the table, "y" (yes) means that the technique is likely to find that a fair percentage (e.g., > 25%) of the points in the cluster are sufficiently close to the feature (or in the case of a polygon intersection test, that the polygons are likely to intersect). The term "close" implies a relatively small distance (e.g., 50-250 metres in an urban scenario), but not bordering the cluster/feature. The fact that the cluster and the Chapter 3. Evaluating Relevant Computational Geometry Algorithms 40 feature are at least a little bit apart is needed to describe the situation where the buffer zone between 2 convex hulls will not help in detecting any intersection. The buffer zone is assumed to be less than 50 metres for this study. Again, we do not claim that the figures given here apply to all applications, or for that matter, that they are even the best parameters for an urban setting. Our purpose is simply to gain some insight into how various techniques respond to various situations. Conversely, "n" (no) means that the particular situation is not likely to have many points close to the feature (or that the polygons are unlikely to intersect). The third choice, "p", means that the situation is possibly interesting, but there are simply too many possible boundary shapes to consider, thereby making classification difficult. For brevity, "cluster" is abbreviated as " C " , and "feature" is abbreviated as "F" . In order to clarify the contents of the table, a detailed example is appropriate. Consider the situation "Shallow Non-Convex C Close to Non-Convex Part of F" (the 8th situation in Table 3.1). Figure 3.3 provides an example of this situation. For this particular cluster-feature combination, Table 3.1 indicates that Technique 1 (all cluster points to feature boundary) will likely find that a fair percentage of the houses in such a cluster will be sufficiently close to the feature to warrant interest (hence the "y"). Note also that Technique 3 (convex hull intersection using a small buffer zone) may or may not detect an intersection — there are simply too many possibilities to consider; hence, we assign a value of "p" to the entry to indicate a possible relationship. In Figure 3.3(a), we note that this particular instance will result in an intersection of convex hulls; however, we cannot make a general claim since a different non-convex feature may not result in an intersection (e.g., Figure 3.3(b)). Having considered various techniques, it is important to decide which of these tech-niques will be of the most value to our work, and will provide the most qualitative Chapter 3. Evaluating Relevant Computational Geometry Algorithms 41 (a) Hulls will intersect (b) Hulls will not intersect Figure 3.3: Shallow Non-Convex Cluster Close to Non-Convex Part of Feature information. For example, it is likely that we will need to calculate the minimum dis-tance between a cluster and a feature. Unfortunately, merely determining the minimum distance between a cluster and a feature may not be sufficient to identify anything other than a trivial relationship. Consider a narrow cluster and a narrow feature both having an endpoint fairly close to each other, and then extending in opposite directions. Aside from the endpoints at closest approach, there is virtually no relationship between the cluster and the feature. Clearly, there has to be more to a relationship than a single pair of points being close to each other. The techniques that provide the highest degree of effectiveness rely on a distance calculation for each point in the cluster; however, some types of distance calculations can produce misleading information. For example, consider the distance from a house to the centroid of a feature. If the feature is large, the distances are likely to be large for every house in the cluster. Conceivably, this means that a house may have a "better match" with a small cemetery partway across town than with a huge nature park lying directly in front of the house. This suggests that a point to (feature) boundary measurement would be more appropriate. In the following section, we take a closer look at the results of Table 3.1. Our goal is to see what we can learn from the table, in order to gain some ideas about how to Chapter 3. Evaluating Relevant Computational Geometry Algorithms 42 approach our problem. 3.4.1 Results from the Cluster-Feature Combination Table Table 3.2 summarizes the results of Table 3.1. Table 3.2 suggests that Technique 2 (the cluster boundary to feature boundary distance) is likely to identify the most promising candidates (i.e., closest features) because it has the most "y" votes. Of course, this assumes that each of the 21 cluster-feature combinations is equally likely to occur.3 Since Section 3.3 indicated that the distance calculation between a convex cluster boundary and a convex feature is of logarithmic complexity, it appears that the use of Technique 2 might be a quick way of determining a good set of candidates (providing the polygons are either convex to begin with, or can be converted to convex approximations). Table 3.2 also shows that Technique 7 (the cluster medoid to feature boundary case) has the largest number of "n" values. This suggests that Technique 7 might be of value in quickly determining which features are likely to be too far away to be of interest; con-sequently, there may be an opportunity to substantially reduce the pool of candidates quickly, so that there will be many fewer candidates to examine using another technique. For instance, of the 21 generic cluster-feature combinations in Table 3.1, 8 such combi-nations would quickly be discarded by Technique 7 simply because the calculations are bound to yield relatively large distances. Once 8 of the 21 combinations are discarded, this leaves 13 combinations on which to perform a "better" technique, albeit at a higher cost. Section 3.3 indicates that Technique 7 has logarithmic complexity if we are sat-isfied with calculating the distance between a cluster point (e.g., medoid) and a convex boundary. This means that we would be able to reduce the list of candidates in a rea-sonably efficient manner. Of course, the problem with this approach is that it is possible 3 W e do not claim that this will necessarily be the case for any application, even for the urban setting in our example. Chapter 3. Evaluating Relevant Computational Geometry Algorithms 43 that those 8 "rejected" combinations do bear an interesting relationship. For example, if a very large cluster borders a feature, then this is an important relationship; however, Technique 7 might have discarded it because of the large distance that is likely to exist between this cluster's medoid and the feature's boundary. What then can be concluded from Tables 3.1 and 3.2, and the earlier discussion on efficiency in Section 3.3? • Tables 3.1 and 3.2 indicate that there are simply too many combinations that fall into the category of "possibly interesting". This suggests that it is difficult, if not impossible, to come up with general guidelines about when to accept or reject a given feature. • If we are satisfied with approximations, then we can approximate a cluster and a feature by 2 convex polygons, and perform a relatively fast distance calculation (e.g., Technique 2) to determine the separation between them. Then, we can pick the closest features, and perhaps examine them more closely if there are too many candidates. • If we can detect a large number of unsuitable candidates using a low cost algorithm, then we can greatly reduce the prospect pool and perhaps use another technique thereafter. For example, Technique 7 is of low cost and will certainly reduce the prospect pool, but it suffers from premature eliminations. In summary, it appears that no single technique is going to meet our goals of ex-pressiveness, scalability, and incrementality. Although the results of Tables 3.1 and 3.2 are hardly conclusive, it does appear that a combination of techniques would be useful in solving our problem, whereby the list of candidates is reduced after each of several Chapter 3. Evaluating Relevant Computational Geometry Algorithms 44 techniques. Furthermore, if we are satisfied with approximations, then we can gain in ef-ficiency. Finally, we note that there are going to be exceptional cases, whatever approach is decided upon. 3.5 The Dependency Diagram Now that we have examined the cluster-feature combination table, we explore the com-plementary roles of pointwise operations and polygon intersections. Figure 3.4 highlights many of the dependencies hidden by Tables 3.1 and 3.2. We use Figure 3.4 to emphasize that an efficient solution to our problem depends on 2 characteristics, namely aggregate proximity (points) and degree of overlap (polygons) — both of which are represented as separate branches in the diagram. Aggregate proximity will tell us how close each point in a cluster is to a particular feature. Of course, we would like to take every reasonable measure to avoid applying pointwise operations to a large number of features, especially when most features are likely to be so far away from the cluster that it makes little sense to consider them. Let us now turn our attention to polygon intersection techniques. First of all, we note that the size and shape of a cluster or feature is likely to affect the degree of overlap. For example, even though the 2 non-convex polygons shown in Figure 3.5(a) are disjoint, all convex approximations of the 2 polygons will have a high degree of overlap. Figure 3.5(b) also shows 2 disjoint non-convex polygons; however, here most of the tightly fitting convex approximations (e.g., a minimum bounding box) will not intersect at all. The primary motivation for the use of approximations is scalability. We want to be able to deal with tens of thousands of features in a very short period of time. Point-to-point operations are relatively costly to compute. When dealing with just a few features and a small number of points, strictly using point-to-point operations may be justifiable Chapter 3. Evaluating Relevant Computational Geometry Algorithms 45 Effectiveness Aggregate Proximity| | Degree of Overlap ] Near Distant Few Some Many Cluster Feature Shape Size Small Mediurr Large (Bounding! Ellipse Little Some Much Figure 3.4: Dependency Diagram Chapter 3. Evaluating Relevant Computational Geometry Algorithms 46 (a) All convex approximations intersect. (b) Many convex approximations will not intersect. Figure 3.5: Disjoint, Non-Convex Polygons since this would provide us with an accurate top n list of features; however, scalability is also important since data mining, by nature, typically takes a large volume of data as input. As mentioned in Section 1.3, the use of approximations does not guarantee that the (approximated) features which intersect the (approximated) cluster are necessarily the best features in terms of aggregate proximity. Nevertheless, we are willing to trade-off some accuracy for efficiency. Specifically, we would like to consider the use of approxi-mations and subsequent polygon intersection tests to efficiently determine which features are likely to be closest to the cluster. Then, we can examine those candidates in greater detail. (The use of exact shapes (convex or non-convex) rather than approximations can actually be a disadvantage in terms of efficiency because clusters and features are normally disjoint, and if we are using polygon intersection tests, then we are likely to find very few intersections.) Forcing convexity on a non-convex polygon can be accomplished through the use of circles, rectangles, minimum bounding boxes, convex hulls, ellipses, etc. Of the convex figures mentioned, an encompassing circle typically has the largest error, but it also has a small complexity. Isothetic rectangles have less error than circles. Convex hulls have no error for convex features, and they have the smallest error among convex approximations Chapter 3. Evaluating Relevant Computational Geometry Algorithms 47 of a non-convex polygon. We define "error" as the amount of empty space that lies between the boundary of the approximating convex polygon and the boundary of the original polygon. Areas which generally have more error, such as circles, are more likely to intersect. Sometimes, the absence of an intersection suggests that the combination being investi-gated can be discarded. This implies that some error is actually beneficial to narrowing down our list of candidates. In this regard, a relatively crude approximation (such as an encompassing circle) 4 can be used as a "first cut", followed by more expensive approxi-mations as the number of remaining candidates shrinks. 3.6 Filtering Section 3.4.1 has shown that no technique on its own is going to efficiently give us the "best" features. It is clear that to compute aggregate proximity, we need to perform individual distance calculations. We also learned in Sections 3.4.1 and 3.5 that approx-imations will be of value, and that pointwise and polygon operations are both of value. Furthermore, Section 3.3 indicated that some of the polygon intersection operations are quite inexpensive, and hence highly conducive to scalability. These observations lead us to adopting a filtering approach whereby we apply in-expensive filters first, and try to eliminate as many unpromising features as early as possible. This means that fewer and fewer candidates will be left on which to perform more accurate (and hence more expensive) processing. This type of scenario has been well-studied in computer graphics, and we believe that this approach will also be of value to our spatial data mining problem. It makes sense to discard as many "irrelevant" cluster-feature combinations as soon 4 I t is easy to approximate a cluster by an encompassing circle by simply making sure that the radius of the circle is sufficient to enclose all of the points in the cluster. Chapter 3. Evaluating Relevant Computational Geometry Algorithms 48 as possible. Circle intersections can be determined quickly, so if an encompassing circle of a cluster fails to intersect the encompassing circles of neighbouring features, then this would provide a quick way of reducing the feature pool. In other words, this would greatly reduce the number of features that a more computationally intensive technique may have to analyze later on. Pointwise operations (aggregate proximity calculations) should be the last set of calculations to be performed. In performing filtering, we need to consider the following questions: • What techniques should be used? • What is the order of the techniques in the filtering process? • Is it possible to reuse the calculations of a given technique — perhaps incrementally for reruns if the user wants to see more (or fewer) features in the final list? • What happens if there are relatively few features? There is not much point in going through several levels of filtering for just a few features. What might work well for 50,000 features may not be best for 500 features. We answer these questions when providing Algorithm C R H in Chapter 4. Chapter 4 Algori thm C R H In Chapter 3, we evaluated various computational geometry techniques en route to achiev-ing our goals of expressiveness, scalability, and incrementality. We now focus on trans-forming our observations into an algorithm that meets these goals. A complete case study with performance results will be presented in Chapter 5. 4.1 Use of Successive Approximations in Filtering In Section 3.5 we noted the trade-offs between accuracy and efficiency. The most efficient algorithms are those which involve approximations such as encompassing circles and isothetic rectangles. We now describe a filtering approach that uses successive approximations to help us meet our goals. Consider a cluster represented by some irregular polygon, such as that shown in Figure 4.6. An encompassing circle is likely to have a fair bit of "white space" in it. We define "white space" as being the area lying between the circumference of the circle and the given polygon (representing the cluster). The introduction of white space will not adversely affect our results because most features are likely to be so far away from the cluster that we would like to discard those features from consideration as soon as possible. To do so, we can construct an encompassing circle for the cluster and for each feature, and test for the intersection of circles. If an intersection occurs, we will retain the particular feature being tested, in a candidate list. (Intersection can easily be computed 49 Chapter 4. Algorithm CRH 50 "C" marks the centroid of the cluster; "x" represents a point. "C" is also the centre of the circle. Figure 4.6: Encompassing Circle by comparing the sum of the radii of the circles to the distance between their centres.) If no intersection occurs, we will retain the separation distance. This is important because there may not be enough cluster-feature intersections. (This is especially true as the amount of white space in subsequent filters (described below) decreases.) An isothetic rectangle (see Figure 4.7), while not minimally bounding, typically has a smaller amount of white space and can reduce the number of candidates (features) even further. Again, separation distances will be retained in case there are too few candidates. Finally, a convex hull (see Figure 4.8) has the least amount of white space of the 3 geometric features in the set {circle, rectangle, hull}, and it is necessarily minimally bounding. Hull intersection is likely to reduce the list of candidates even further. The actual test for hull intersection can be a binary one because there is no need to compute the degree of overlap. In the absence of intersection, it is necessary to compute the Chapter 4. Algorithm CRH 51 , X X A A X A / L \ / Y V Y Y Y A A A A ^ S A A A A A Y Y \ , - X X X X \ xx x X \ l£ X X X ^ i X X X X y ^ x x* xxxxx XX 1 I i x x I X X I X X ' / X X X x x x X X X X x * x x x X X X / x 7 X X X X . iXX x- • """-^ ---xx-' "x" is a point; the dotted line is the convex hull Figure 4.8: Convex Hull Chapter 4. Algorithm CRH 52 minimum distance between the hulls. Algorithm C R H will use these 3 geometric constructs in a multiple filtering approach whereby the filters are set up in an increasing order of accuracy, but in a decreasing order of efficiency. The circle filter can quickly eliminate a very large number of features that are not promising, and pass the remainder to the next filter, namely the rectangle filter. The rectangle filter can eliminate some more features, and finally pass the remaining features to the hull filter. Finally, for every feature that remains, Algorithm C R H calculates the aggregate proximity of points in the cluster to the convex boundary of the feature. Those features are then ranked based on their aggregate proximity values. 4.2 Thresholds Thresholds are used to impose quantity control on the filtering process. We do not have any control over the number of intersections that occur, so we simply pass all intersecting features to the next filter; however, we must avoid passing too few candidates to the next filter. For example, if only 3 features intersect the cluster during the rectangle test, then that does not mean that there are only 3 features in the entire dataset that are of interest. It would be wise to pass more features (say twice as many as the desired target) to the hull filter and then let the hull filter reduce the number of features to the final threshold. Even after performing the convex hull intersection test, it is likely that only a small number of features (perhaps zero) intersect the cluster. This should not surprise us since we noted earlier that the convex hulls are highly dependent on the size and shape of the underlying non-convex polygon. Also, because a convex hull is minimally bounding, the amount of white space is normally smaller than that for a circle or rectangle. In fact, in an urban setting, it is quite likely that neighbouring features (convex or non-convex) do not intersect the cluster; consequently, we cannot rely solely on polygon intersection tests Chapter 4. Algorithm CRH 53 to determine proximity relationships. Instead, we need to order the candidates according to how far away each candidate is from the cluster (in terms of boundary separation). The number of features retained at any stage of processing is called a threshold. Although a default final threshold is provided, the user can have the option of overriding this value. Within the final threshold, ranking takes place, so that the user can quickly see which features are ranked highest in terms of aggregate proximity. It is possible for one filter to rank the features in a significantly different order than that of another filter. For example, suppose that feature F\ is a long, narrow island located relatively far away from the cluster of houses, and that feature F2 is a small shopping centre that is located fairly close to the cluster. Given the elongated shape of the island, the large circle approximating it may intersect the cluster's circle, whereas the small circle approximating the shopping centre may not. Thus, the island is ranked higher than the shopping centre, according to the circle filter. However, in the next filter, rectangles are used to represent both the island and the shopping centre, and now the features are assigned more realistic ranks. Two conclusions can be drawn from these observations: 1. Due to the large amount of white space, the circle filter is likely to include some features that should (and will) be discarded later on in the filtering process. 2. If the thresholds are too small, some very promising features, such as the shopping centre in our example, are likely to be excluded very early in the filtering process. Such features cannot be reclaimed later, since they are eliminated from further consideration as soon as any filter rejects them. Algorithm C R H uses circle, rectangle, and hull thresholds such as thc = 30,thr = 20, and thh = 10. As will be discussed later, the user has the ability to change the thresholds Chapter 4. Algorithm CRH 54 for an i n c r e m e n t a l r u n (i .e. , an efficient " re run" ) . Regardless of the f ina l t h r e sho ld , the thresholds obey the order: thc > thr > thh-T h e f ina l t h resho ld is the h u l l t h resho ld . T h i s is the target n u m b e r of features tha t A l g o r i t h m C R H a ims to r e t u r n to the user at the end of i ts process ing. If there are m o r e features t h a n thh w i t h the highest r ank (due to t ies) , t hen A l g o r i t h m C R H repor ts a l l of those features (i .e. , features F t h h - m , . . . , Fthh, • • • , Fthh+k)-4.2.1 Reaching the Thresholds It is i m p o s s i b l e to de t e rmine a priori the n u m b e r of features tha t intersect a c lus ter d u r i n g a g iven stage of f i l t e r ing . A s w i l l be demons t r a t ed i n the case s t udy d o c u m e n t e d i n C h a p t e r 5, the n u m b e r of in te r sec t ing features for a g iven filter can va ry w ide ly . In some cases, m o r e t h a n enough intersect ions occur ; however , i n o ther cases, ve ry few intersect ions occur . T h i s suggests a n u m b e r of i m p l e m e n t a t i o n s for A l g o r i t h m C R H : • T h e first p o l i c y involves shape enlargements . W h i l e we c o u l d s i m p l y select the closest n features to the c lus ter , shape enlargements m a y be m o r e appea l i ng w h e n the cost of c o m p u t i n g , s to r ing , a n d sor t ing dis tances (to de t e rmine the closest n features) g rea t ly exceeds the cost of b i n a r y p o l y g o n in te r sec t ion tests. U s i n g th is po l i cy , we enlarge the shape of the convex c lus ter , a n d test for p o l y g o n intersect ions us ing an i t e ra t ive approach i n w h i c h the c lus ter gets larger . S ince the cost of p e r f o r m i n g a b i n a r y in te r sec t ion test is s m a l l , th i s app roach c o u l d p e r f o r m qu i te w e l l i f the n u m b e r of n e i g h b o u r i n g features is large (since few i te ra t ions w o u l d be requ i red) . • T h e second p o l i c y involves c a l c u l a t i n g and r eco rd ing the separa t ion dis tances dur-ing a s ingle pass of a filter. T h i s is referred to as memoization. M e m o i z a t i o n w i l l be wasteful i n the case where the n u m b e r of in tersect ions equals or exceeds the Chapter 4. Algorithm CRH 55 compute m = the number of features that intersect the cluster; while (m < th) { uniformly enlarge the shape of the cluster; recompute m, using the enlarged cluster; } Figure 4.9: Meeting the Threshold via Shape Enlargement threshold for the filter; however, it will be beneficial when not enough intersections occur. Algorithm C R H will support both of these policies; however, the default policy will be memoization. We discuss both of these policies in the following sections. 4.3 Shape Enlargement Shape enlargement is not necessary if the number of features m that intersect a given cluster has already exceeded the threshold th associated with the particular filter because all of those features are simply passed on to the next filter. On the other hand, if m < th, we need to include more features. The program skeleton shown in Figure 4.9 aims to find enough features to equal or exceed the threshold by enlarging the cluster. In each iteration, the cluster is enlarged further until there are sufficiently many features that intersect the enlarged cluster. We say that a shape S is enlarged to Se if all of the points that are originally in S are also in Se, and the area of S is less than that of Se. One of the advantages of the above approach is that shape enlargement is easy to compute. Given a circle with radius r, this circle can be enlarged by simply increasing the radius while keeping the same centre. Similarly, given a rectangle whose diagonals are of length 2/, it can be enlarged by Chapter 4. Algorithm CRH 56 increasing the length to 2lnew, while keeping both the centroid and the slope of the diagonal intact. The following equations show how a new vertex (xnew, y n e w ) can be obtained by using the old vertex (x0id,y0id) and the centroid ( x c e n , y c e n ) : (•Tnew •Teen) ~\~ {jjnew 2/cen) — ^new / \ Void ycen , \ \ynew ycen) — * yTnew •Teen) •Told -Teen The first equation forces the new vertex to be of distance lnew from the centroid, and the second equation ensures that the slope of the new diagonal does not change. Note that the centroid remains unchanged. Using the same equations, a convex hull can be enlarged by increasing the distance between its centroid and every vertex, while keeping both the centroid and the slope of the line between the centroid and the vertex the same. 4.3.1 Linear Mode vs. Bisection Mode The shape enlargement algorithm outlined in Section 4.3 is fine when the number of iterations of the loop in Figure 4.9 is very small. Experience has shown that the number of iterations can vary widely, and that it is not uncommon for 100 iterations to occur, depending on the factor by which the feature is enlarged. If the increment is large, fewer iterations are required, but it is likely that many more features are included than the desired threshold. On the other hand, if the increment is small, many iterations may be required, but it is quite likely that the number of features selected matches the threshold exactly (or is slightly above the threshold). Having too many features is undesirable because it means that more features are passed on to more expensive filters (or following the hull filter, the aggregate distance calculations). Having too few features implies that a greater number of iterations are required, and each iteration can be quite costly since it amounts to making another complete pass through the feature pool. When dealing Chapter 4. Algorithm CRH 57 with a large number of features and a relatively small amount of main memory (say 16 M B , which has to be shared with system software including the windowing system), our experience has shown that paging activity can add many minutes to the elapsed time. A straightforward approach is to enlarge the shape in each iteration by either a fixed amount (that bears no relationship to the size of the cluster), or by an incremental multiplier (that equals, say, 10% of the radius of the cluster). If the shape is a circle, then the new radius rnew is either r0\d + 8 or r0id * m, for some 8 and m. This is called the linear mode of shape enlargement. For our data mining program, we chose a multiplier m that takes on the successive values 1.1, 1.2, 1.3, . . . , where 1.0 equals the radius of the circle encompassing the original cluster. To summarize, the convex polygon (circle, rectangle, or hull) of the cluster is compared to the respective convex polygon of each feature. The cluster's convex polygon is enlarged in each iteration until the number of intersecting features equals or exceeds the threshold for the particular filter (circle, rectangle, or hull). Many iterations may be required, especially if the top n features are relatively far away from the cluster. The second mode of shape enlargement involves bisection, which performs a loga-rithmic number of iterations. Initially, the range of bisection is defined by the cluster boundary and the largest or smallest x and y co-ordinates of the feature dataset. In other words, the maximum range is the maximum distance between the boundary of the cluster and one of the four corners of the map. In each subsequent iteration, one-half of the remaining distance interval is discarded, depending on whether too many or too few intersections occurred previously. Bisection continues until the appropriate multiplier is found, in which case either the threshold th is met exactly, or no further bisection is possible (given a tolerance e). At this point, m features are returned, where m is the smallest number of features found by the bisection mode such that m > th. The bisection mode typically outperforms the linear mode since fewer iterations are Chapter 4. Algorithm CRH 58 required. Only in the case where the required number of features (defined by a threshold th) is very close to the cluster will the linear method be superior. Of the 2 modes, bisection is preferred since it can reach a given threshold relatively quickly, regardless of the distribution of features on a map. Section 5.6.1 shows experimental results comparing the two modes. 4.4 Memoization The problem with the shape enlargement policy (linear or bisection) is that it may require a large number of iterations to reach the threshold. Ideally, we would like to make only 1 pass through the features for 2 reasons: • If the number of features is large (which we would expect in a data mining ap-plication) and if the number of iterations is large, then a very large number of comparisons/computations will have to be made. • If the number of features is large and the amount of main memory in the machine is not sufficiently large to store all of the features, then an enormous amount of paging activity may take place. Since one of the objectives of this thesis is to come up with an approach that supports scalability, it is clear that shape enlargement is less than desirable. Ideally, we desire a single pass through the features, whereby the distance between the cluster and each feature is recorded. These distances can be used to determine whether or not to accept a feature as a candidate during the current filter. The idea behind memoization is to perform the necessary calculations once and store the results for fast lookup. Even if the policy of shape enlargement were used, we could still use memoization to avoid the need to perform intersection tests between a cluster Chapter 4. Algorithm CRH 59 and a feature during every iteration. For example, if an intersection occurred between a cluster and a feature, then a flag could be set to avoid a future intersection test when the cluster is enlarged. Unfortunately, this only helps with the few features that intersect a given cluster. What about the possibly tens of thousands of features in the first filter that do not intersect the cluster? Under this approach, most features will have to be tested again and again. A better approach is to memoize the distance between a cluster and each feature. Then, instead of performing a distance calculation during every iteration, the distance would be computed once, and the distance would indicate how far apart the convex polygons are. If the 2 polygons intersect, then we assign a value of zero for the separation distance. During the circle filter, it is easy to compute the distances. Since the radius of a clus-ter's encompassing circle will be known and since the radius of a feature's encompassing circle will be known, the sum of these two distances can be compared to the distance be-tween the 2 circles' centres. This will give us an upper bound on the separation between the cluster and the feature. If the sum of the radii is less than or equal to the distance between the centres, then this implies an intersection. If the 2 circles do not intersect, then we can store the separation and subsequently pick those features that have the least separation. If the shapes are rectangles, then the minimum distance between the boundaries of the two rectangles is calculated and stored. The minimum distance need not be the minimum distance between a vertex of the cluster and a vertex of the feature; however, the minimum distance must be the minimum distance between a vertex of one such rectangle and an edge of the other. This computation is very fast. If the shapes are convex hulls, then the computation is somewhat more complex. First of all, we perform a test to determine if the hulls intersect. If no intersection is detected, Chapter 4. Algorithm CRH 60 then we compute the hull separation. We begin with an array representation for each of the hulls (to allow us to perform binary searching on the vertices [Edelsbrunner, 1985]), and we determine the outer common tangents of these 2 polygons.1 Specifically, we use the algorithm of Kirkpatrick and Snoeyink to compute common tangents without a separating line [Kirkpatrick and Snoeyink, 1993a], in time logarithmic to the number of vertices in the polygons. This identifies 2 chains of hull vertices: one for the cluster's hull and another for the feature's hull. Intuitively, these 2 chains of consecutive line segments "face" each other in the sense that a vertex in either chain can "see" at least one vertex in the other chain.2 We then compute the separation distance between the 2 disjoint convex polygons as a fixed-point problem [Kirkpatrick and Snoeyink, 1993b], performing a binary search to identify the final 2 vertices (one in each polygon). Since the minimum distance between 2 polygons need not be a vertex-to-vertex distance, we perform a final computation to find the minimum distance between a vertex and a neighbouring edge. Memoized distances can be reused. This is relevant to the implementation not only because it avoids the need for multiple passes through the list of features, but because it supports incrementality. For example, suppose a user asks for the top 10 features, but after seeing the results, he/she decides to request the top 20 features. Since many of the distance computations have already been done, these values need not be recomputed. If an intersection test already concluded that the cluster interests the feature's polygon, this fact is also memoized, thereby eliminating the need to perform redundant intersection tests. 1 Other approaches exist to determine the m i n i m u m distance between 2 convex polygons wi thout the use of tangent lines. One such example involves the D o b k i n - K i r k p a t r i c k hierarchical representation of polygons and po lyhedra [Dobkin and K i r k p a t r i c k , 1985] [Dobkin and K i r k p a t r i c k , 1990]. A s s u m i n g that 0(n) preprocessing has been done, their results y ie ld an 0 ( l o g m + l o g n ) solut ion to the 2-dimensional separation p rob lem (where m and n represent the number of vertices i n each respective polygon) . 2 I f x is a vertex i n po lygon A, and y is a vertex i n po lygon B, then we say that x and y "see" each other i f the l ine segment j o i n i n g x and y does not intersect the interior of either po lygon A or po lygon B. T h i s definit ion ignores any objects that m a y lie between polygon A and po lygon B. Chapter 4. Algorithm CRH 61 Only new distances need to be computed. Furthermore, once the relatively costly ag-gregate distance statistics have been computed, those statistics will not change (although their relative rank might change). It makes sense to store these calculations as well. As will be shown in Chapter 5, memoization leads to highly efficient incremental processing. To summarize, memoization is performed after computing each of the following: • cluster-feature separation distances during the circle, rectangle, and hull filters • hull intersection • aggregate distance statistics for a particular cluster-feature pair The price to be paid for memoization is a few extra bytes of storage per feature (to record the distances, an intersection flag, and a pointer to a dynamically allocated data area that retains the computed aggregate statistics). To promote efficiency and reuse in the cluster-feature separation calculations, a bucket approach will be used to hold the candidates. Pointers to the original features will be stored, along with each cluster-to-feature separation. There will be j buckets that will be responsible for groups of &-metre intervals (of separation). The interpretation is as follows: the first bucket holds pointers to all features that intersect the given cluster, the second bucket holds pointers to all other features whose boundaries lie within k metres of the cluster, the third bucket holds pointers to all other features lying within 2k metres of the cluster, . . . , and the last bucket holds pointers to all other features which do not lie within (j — 2)k metres of the cluster. With respect to hull intersection tests, once the program determines that the feature's hull intersects the cluster's hull, a flag that accompanies that feature's node (in the linked list of features) is set to avoid future intersection tests. This flag is also useful for the Chapter 4. Algorithm CRH 62 linear and bisection modes; however, in the case of bisection, the flag has to be reset if the new multiplier is less than the previous multiplier. With respect to aggregate distance statistics, a pointer field has been reserved in each feature's node to point to a 6-field aggregate distance structure (which is created only if necessary). The idea is to avoid redundant aggregate distance calculations at the expense of a small amount of storage. Before leaving the topic of memoization, we ask ourselves if it is really necessary to compute each of the cluster-feature distances. Perhaps it is possible to optimize our results using only a partial set of distances (or a set of less precise distances). Shasha and Wang approximate distances between objects using a triangle inequality [Shasha and Wang, 1990]. Their idea is to try to optimize processing by reducing the number of distance calculations that have to be done to find the n objects that are the most similar to a given target (according to some distance metric). Unfortunately, their algorithm is 0(n3) and is inappropriate when dealing with tens of thousands of features. As will be shown in the case study presented in Chapter 5, computing each cluster-feature distance still provides excellent performance using Algorithm C R H . We do not claim that computing and storing each such distance is optimal or necessary because situations can arise whereby the maximum (or minimum) x co-ordinate of the feature is so far away from the minimum (or maximum) x co-ordinate of the cluster, that there would be no point in computing and storing the actual distance between the cluster and the feature. In this case, we would also have to assume that the user is unlikely to ask for a huge final threshold. For completeness, we compute every cluster-feature distance in the initial filter. Chapter 4. Algorithm CRH 63 4.5 Computation of Aggregate Statistics The approach taken for each of the candidates that have successfully passed through the hull filter is as follows: • For each point p in the cluster, compute the minimum distance dp from that point to a hull vertex. Store that distance. • For all the minimums (there will be one minimum for each cluster point p), keep track of the smallest overall "minimum" and the largest overall "minimum", as well as the number of "minimum" values which fall below certain pre-defined values r, obeying the order r x < r 2 < . . . < r,- j £ Z+. For example, the r,-'s for j' — 4 might obey the order 250 metres, 500 metres, 1000 metres, and 2000 metres. • Rank the features using the following algorithm. More accurate or expensive al-gorithms can be used, but since we are dealing with approximations anyway, this algorithm is a relatively simple way of assigning a greater weight to those cluster points that are nearer the feature. Assume j £ Z+ and that DISTj is expressed as a percentage of the total points in the cluster which fall within distance rj of the feature's hull vertices. — For each of the final cluster-feature pairs, compute: i score = i * DISTj-i+i — Then, sort all of the scores for the cluster-feature pairs into descending order. • Report the features in descending order of score. This will allow the user to quickly see which cluster-feature combinations have been identified by the data mining Chapter 4. Algorithm CRH 64 program as being the closest in terms of aggregate proximity. A perfect score is 100 Y^i=i i which simplifies to 50j(j +1), in which case 100% of the points are within (approximately) DIST\ metres of the feature's hull. For example, if j = 4, then 1000 is a perfect score. 4.6 Summary of Processing Because of the efficiencies obtained through convex polygon approximation, filtering, and memoization, we believe that Algorithm C R H is a very satisfactory way of determining aggregate proximity relationships in a spatial data mining context. Here is the high-level pseudo-code for a spatial data mining program which incorpo-rates Algorithm C R H : 1. Load the features and the cluster. 2. Perform circle filtering to yield the nearest thc (circle threshold) candidates. Mem-oize the distances for all of the features processed by the circle filter. 3. Perform rectangle filtering on the thc features to yield a smaller number of features: thr. Memoize the distances (for the features encountered by the rectangle filter). 4. Perform hull filtering on the thr features to yield a smaller number of features: thh. Memoize the distances. 5. Take all of the cluster points and calculate the minimum distance from each cluster point to a hull vertex. Memoize the distances. 6. Rank the thh features according to their aggregate proximity, assigning greater weights to cluster points that are nearer the feature. Chapter 4. Algorithm CRH 65 7. Report the results. 8. Ask the user if a new threshold should be used. If so, repeat steps (2) through (8), using the memoized results. We present a case study in the following chapter. Chapter 5 Case Study and Implementation This chapter describes the actual implementation of a spatial data mining program, using a GIS (or real estate) example that deals with features and neighbourhoods in the city of Vancouver. Assumptions are documented, shortcomings are noted, and justification is provided for the techniques employed. Incidents that resulted in frustration (or bottle-necks that occurred) are also described, along with how those problems were resolved or circumvented.1 An overview of the data model was provided in Figure 1.1. Features have been provided in a single dataset. The cluster is in a dataset of its own. No preprocessing is required. The only interaction provided with a user occurs when the program asks the user if a rerun should occur (using a different final threshold). This interaction occurs at the end of a complete set of processing. The program expects the user to supply the names of the files containing: • the features • the cluster There is no need for any graphical interface, although data can easily be written to a file that can in turn be plotted using a graphics tool such as "xgraph". The program can be customized by future users to suit the particular domain of the user. For example, a 1 For example, some of our initial frustrations involved the lack of robust, publicly available compu-tational geometry code. We spent a lot of time searching for code. In some cases, we eventually found suitable code; in other cases, we had to write our own code. 66 Chapter 5. Case Study and Implementation 67 GIS user may wish to support a particular GIS data format or graphics package. Our implementation is a generic one. Its development is described in the following sections. Performance results follow. 5.1 Acquisit ion of Data The testbed involves houses and features in the city of Vancouver. Initially, we believed that we could easily find an abundance of on-line data about Vancouver (via the re-sources of the University of British Columbia), especially since our university is located in Vancouver. Unfortunately, very little local (or non-local) data was available. UBC's Map/Data Library expects to acquire digital data (i.e., on-line maps), but not in the near future. The Data Library has one set of maps on the Vancouver area. It was produced by the Topographical Surveys Division of the Surveys and Mapping Branch of Energy, Mines, and Resources Canada. The data resided on an MTS (Michigan Terminal System) mainframe in E B C D I C format. The data was successfully extracted and downloaded via a F O R T R A N program; however, closer analysis of the data indicated that many of the natural or manmade features that we were hoping would be included, were not present. After a number of interviews with individuals in government, industry, and various departments within the university, it was clear that the only digital map data of Vancou-ver that was of any interest was proprietary. This meant that money would be needed to acquire the data (and the related software), and that conversion efforts would be required to turn the data into a form suitable for the data mining program. Furthermore, it was apparent that not all of the data desired (e.g., cemeteries, golf courses, schools, shopping centres, key buildings) was necessarily included in these proprietary packages. Chapter 5. Case Study and Implementation 68 5.1.1 Digitization After considering all of the above, a decision was made to generate our own data from a good set of paper maps. The City of Vancouver's Engineering Department was contacted, and one 1:25000 (scale) area map and twelve 1:5000 maps were purchased for $5 each. Over a period of 4 days, the 1:25000 map was digitized using UBC's GIS Laboratory in Resource Management in Environmental Studies. We used standard digitizing equipment to extract all of the elementary schools, high schools, parks, golf courses, cemeteries, bridges, beaches, and shorelines, along with a number of significant building complexes in the city of Vancouver. The digitization was carefully performed to generate data files consisting of points (vertices) that represent the features. There were 326 features that were digitized. Digitization was chosen over scanning or camera techniques because of the amount of warping that would be introduced by these latter two techniques. In fact, it was necessary to go to an on-campus laboratory that had digitization equipment because the Computer Science department (and the administrative arm of UBC's computing services) no longer had such equipment. Digitization is still actively being used in aerial photography, soil science, geography, forestry, etc., but it is a time-consuming process. Human skill is required to identify and accurately extract interesting components of maps, photographs, etc. After consultation with a number of experts in these fields, it was concluded that digitization was preferred over scanning or camera techniques, especially after it was pointed out that a great deal of human time would be required to filter/adjust warped data. Furthermore, since maps containing even a modest amount of detail are typically much larger than an 8 | x 11 inch sheet of paper, it would be necessary to split the map into pieces, capture the images, and then piece the intermediate results back into one contiguous unit. In summary, it made sense to simply digitize the maps because we would Chapter 5. Case Study and Implementation 69 get much more accurate data with the same effort. Following digitization, the features were downloaded in ASCII format to floppy disks, and subsequently uploaded to a U N I X file system. 5.1.2 Volume of Data Since one of our objectives was to test scalability with tens of thousands of features, we decided to save an enormous amount of effort by duplicating the pool of features (326) that was produced in the digitization process mentioned above. We desired 50,000 or more features. Since we assume that the existing pool of features represents a good mix of shapes and sizes of features, it seemed reasonable to duplicate a large subset of the 326 features, provided that the subset does not influence the outcome of the results. For example, care had to be taken so that any feature that wound up in the top n list was not duplicated; otherwise, that feature would appear multiple times in the output (and would prematurely terminate processing since the final threshold would be reached very quickly). In particular, we ran our first test against the 326-feature dataset, and removed the top 75 ranked features, leaving behind 251 features. These 251 features were duplicated 10, 100, and 200 times, providing us with large datasets of features. The original 75 features were then added back into each dataset.2 Algorithm C R H will still have to examine all of the features in a given dataset; however, only the best 75 features (as before) would be extracted by the first filter. This method of generating feature datasets for scalability testing was repeated for other clusters as well. 2 In order to generate a top 25 list during scalability testing, we had to reserve 75 features (and avoid duplicating them) because that is how many features would be selected by the circle filter. (The default threshold for the circle filter is 3 times the value of the final/hull threshold.) Chapter 5. Case Study and Implementation 70 5.1.3 Choice of Clusters A cluster can be generated in a number of ways. For example, the cluster may have resulted from an SQL query in a database system, or it may simply be a set of pre-defined points in which each point belongs to the same, arbitrary neighbourhood in the city. Furthermore, a set of points can be partitioned into one or more clusters using an algorithm such as C L A R A N S described earlier. Since it does not matter how the points are generated, and since the data mining program can ultimately have any front-end interface, we simply created various clusters that possessed the following characteristics (for a decent set of tests). We tested clusters of various: • shapes (regular and irregular) • sizes (large and small areas) • densities (sparse and dense number of houses) • distributions (uniform and non-uniform) • locations (near the geographic centre of Vancouver, near the shoreline, near irregu-lar features having a large number of vertices, near a large number of neighbouring features, etc.) It was interesting to test various clusters. For example, one cluster tested was a neigh-bourhood in which the author of this thesis grew up. When performing runs involving this neighbourhood, the data mining program generated legitimate hits on features that the author was not familiar with, simply because he rarely ventured into those places. (Children tend to stay close to home, and close to where their friends are.) Maps con-firmed that these hits were legitimate, and that other features that were frequented by Chapter 5. Case Study and Implementation 71 the author really should not be ranked higher (and were not ranked higher) than the fea-tures that were selected by the program. This particular cluster consisted of 71 houses and spanned about 10 city blocks in an L-shaped pattern. Its convex hull intersected the convex hull of 2 features. 5.2 Linear, Bisection, and Memoization Modes The code is modular and supports 3 modes of operation, depending on a constant that can changed prior to compilation. The 3 modes are shape enlargement (linear), shape enlargement (bisection), and memoization. These modes were described in Sections 4.3, 4.3.1, and 4.4. The default mode is memoization. (We support all 3 modes to facilitate testing and to compare results.) A l l 3 modes use C R H filtering, and the same top n list is expected in each case. In general, a user does not know the density and proximity of a set of features around a cluster; however, if a user knew that many features were very close to the cluster, then the user could request the linear mode because linear C R H mode may outperform memoization. Since we assume that the user does not know the spatial characteristics, and since memoization virtually always outperforms the other modes, memoization is the preferred approach. Even in those cases where memoization does not outperform the other meth-ods, the difference in run times is bound to be so small that it does not warrant changing the constant and recompiling. Here are the parameters defined for the linear and bisection modes of operation: • Linear Mode The increment is 10% of the radius of the encompassing circle (or 10% of the length of the diagonal between the centroid and the furthest vertex in the rectangle or hull Chapter 5. Case Study and Implementation 72 case). • Bisection Mode As described in Section 4.3.1, bisection implies that | of the range of current search is discarded in an iterative process, until either the threshold is met exactly or a given tolerance is reached. A tolerance of 0.025 units was found to be satisfactory for all of the test cases which we ran. 3 5.3 Thresholds The thresholds for the filters are arranged in decreasing order. The default circle threshold is 30, the default rectangle threshold is 20, and the default hull threshold is 10. If a user decides to provide the final threshold thh rather than letting it default to 10, then the program computes thr as being twice that of thh, and thc as being three times that of thh. The program can accept a user-defined threshold in two ways: • On the command line prior to program execution. For example, if the user wants to get the top 25 choices, the following line can be supplied: mining feature_file cluster_file 25 • Following the execution of the program for a given cluster, the user is prompted to enter a new threshold, or 0 to terminate the program. Until the user enters 3 W e ran numerous tests and found that higher tolerances sometimes resulted i n more candidates being returned than desired. For example, i f thc — 75, i t wou ld not be unusual to get 80 or more candidates when the (higher) tolerance was reached. B y reducing the tolerance to 0.025, the number of candidates wou ld be closer to, or equal to, 75. A tolerance much smaller than 0.025 m a y result i n ext ra i terations, but w i l l not necessarily al low us to meet the threshold. For example, regardless of how close the tolerance is to zero, i t is possible to find cases where the threshold s t i l l cannot be met exact ly (e.g., when the next 2 candidates have the same separation after reaching thc — 1 candidates). Chapter 5. Case Study and Implementation 73 After stage 4, there are 10 candidate features ranked as follows: John Oliver High School with a weighted score of 1000 South Memorial Park with a weighted score of 903 Mountain View Cemetery with a weighted score of 891 Alexander MacKenzie School with a weighted score of 780 MacDonald Park with a weighted score of 369 Cartier Park with a weighted score of 300 John Henderson School with a weighted score of 214 Sunset Park with a weighted score of 206 Gray's Park with a weighted score of 191 McBride Annex School with a weighted score of 151 Figure 5.10: Ranking of Top 10 Features 0, the program reuses its previous results in order to generate results for the new threshold. 5.4 Computation of Aggregate Statistics The pre-defined ranges (distances) for our case study were chosen to be 250 metres, 500 metres, 1000 metres, and 2000 metres. Such numbers were defined as constants in the program (DISTi,DIST2,DISTs, and DIST4 respectively), and the values can easily be changed before compilation. A perfect score would be 1000, which corresponds to the case where 100% of the houses are located within 250 metres of the feature. The computation of aggregate statistics was formally defined in Section 4.5. Figure 5.10 is an example of the output that was produced for a cluster of 71 houses in south-central Vancouver. By default, only the final rankings are given, thereby producing a concise top 10 list. It is certainly understandable that a user may wish to have more details than just the executive summary shown in Figure 5.10. This is why an additional Chapter 5. Case Study and Implementation 74 The c l u s t e r i s being compared to Alexander MacKenzie School. . . Maximum house-to-boundary distance: 509.4 metres Minimum house-to-boundary distance: 48.3 metres 47.1% of the houses are w i t h i n 250 metres of Alexander MacKenzie School. 97.1% of the houses are w i t h i n 500 metres of Alexander MacKenzie School. 100.0'/ of the houses are w i t h i n 1000 metres of Alexander MacKenzie School. Figure 5.11: Detailed Results for One of the Top 10 Features parameter can be defined before compilation time. 4 A constant in the program allows for the presentation of detailed results, namely the percentage of houses that fall within each DISTi range, as well as the minimum and maximum 5 house-to-feature distances. Figure 5.11 describes this additional output. Appendix A shows the output for one complete execution of the program, including detailed aggregate statistics. 5.5 Memoization As discussed in Section 4.4, memoization is performed to avoid redundant calculations and hence achieve scalability and efficient incrementality. To show how useful the memo-ized aggregate statistics are, consider the following incremental situation. The aggregate distance calculations when going from a threshold of 10, to a threshold of 100, to a threshold of 99 (all in the same run) took 0.02 seconds, 0.18 seconds, and 0.00 seconds respectively. The 0.00 figure came from the fact that all 99 features had their aggregate 4Ideally, all of the parameters that a user may be interested in changing, including bucket size, bisec-tion tolerance, level of detail, etc., should be provided in the form of run-time flags, thereby eliminating the need to recompile the program. For the research described in this thesis, the implementation is com-plete; however, a nice front-end program, run-time options, and perhaps even a graphic output design could all be included in the program. 5 A s indicated in Section 4.5, we define "maximum" to be the largest of the n "minimum" values, where n is the number of houses in the cluster. Chapter 5. Case Study and Implementation 75 statistics calculated previously, so only a lookup operation was required. The number of buckets for the memoization routines was chosen to be 100, and the distance interval was chosen to be 100 metres (both values being defined as constants). Recall from Section 4.4 that this means that the first bucket holds pointers to all features that intersect the given cluster, the second bucket holds pointers to all other features whose boundaries lie within 100 metres of the cluster, the third bucket holds pointers to all other features lying within 200 metres of the cluster, and so on, with the last bucket holding pointers to all of the features which did not appear in the first 99 buckets. For many of our test cases, only the first few buckets needed to be examined to find the top 10 features (or perhaps 10 or 15 buckets to find the top 25 features). We found that it did not matter very much whether 1000 buckets were used, 500 buckets were used, or 50 buckets were used. Similarly, the distance intervals did not matter very much. Granted, the candidates would appear in different buckets, but the C P U times were virtually identical. Our experiments revealed that only a poor choice of buckets or intervals would significantly6 affect C P U time, such as when using: • a very large interval (with any number of buckets) • a very small interval (with a small number of buckets) For example, in the first scenario, a large number of candidates wound up in the second bucket (recall that the first bucket is devoted to intersections), and this meant that the entire second bucket had to be examined.7 6 During our experiments, we noted a 20-30% increase in C P U time for a "poor choice" of bucket parameters. While we call this a "significant" increase, we realize that many users will not perceive any difference in response time, given that the C P U times are already quite small (e.g., less than 1 second). 7A11 of the entries in the bucket do not actually have to be sorted. In fact, sorting a huge linked list can be costly. For efficiency, we copied the memoized distances from the linked list into an array, and then performed selection (via partitioning) on the array. We selected the k-th smallest value, where k is the number of entries needed from the bucket to meet the threshold. Once the fc-th smallest value was determined, we simply made a second pass through the linked list, extracting all candidates whose memoized distances were less than or equal to the &-th smallest value. Chapter 5. Case Study and Implementation 76 Number of Interval C P U Time Bucket Features Buckets (metres) (seconds) Sorted in Bucket 1000 10 0.73 none N / A 100 100 0.73 none N / A 10 100 0.92 10th 50249 10 150 0.73 8th 4 10 200 0.75 none N / A 10 1000 0.73 none N / A 10 2000 0.72 2nd 69 10 5000 0.83 2nd 29869 10 10000 0.90 2nd 50069 Table 5.3: The Effect of Bucket Sizes and Bucket Intervals on C P U Time during Circle Filtering, for an Input Dataset Containing 50,275 Features In the second scenario, the last bucket (bucket x) contained the vast majority of the features because the first x — 1 buckets contained an insufficient number of features to meet the top n list desired by the user. In this case, it was the last bucket that had to be examined. Table 5.3 shows the effect of changing the bucket sizes or intervals for one particular cluster and set of features. In all of the tests which we ran, we found that a default of 100 buckets at 100 metre intervals worked well. The leeway in specifying memoization parameters is a blessing because it indicates that the user does not have to be concerned with tuning. 5.6 Performance The performance results shown in the following graphs and tables are based on a time-sharing Sun SPARCstation-10 machine (XDSuperSPARC/SuperCache) with 64 M B of main memory. The operating system is SunOS Release 4.1.3. The C P U statistics shown Chapter 5. Case Study and Implementation 77 Scalability of CRH in all 3 Modes 141 1 1 1 1 1 1 1 r Figure 5.12: Scalability of Algorithm C R H are accurate to approximately 17 milliseconds, and are quoted in hundredths of seconds. 5.6.1 Scalability This set of experiments evaluates the scalability of Algorithm C R H using linear, bisection, and memoization modes. The thresholds associated with the circle, rectangle, and hull filters were set to 75, 50, and 25 respectively, yielding a top 25 list. 8 The number of features tested varied from 326 to 50,275. In Figures 5.12 and 5.13, the x axes of the two graphs represent the number of features, 8Unless indicated otherwise, all of the graphs and tables in this chapter reflect a top 25 list. Chapter 5. Case Study and Implementation 78 Scalability of the Circle Filter in all 3 Modes Figure 5.13: Scalability of the Circle Filter Chapter 5. Case Study and Implementation 79 and the y axes give the C P U time used in filtering. Figure 5.12 demonstrates how well this implementation of Algorithm C R H scales. Note that over 50,000 features can be examined in approximately 1 second of C P U time. Figure 5.13 shows that the scalability of Algorithm C R H is essentially determined by the circle filter, which is not surprising since it is the first filter in the sequence, and the encompassing circle routine is obviously quite efficient. Close scrutiny of the 2 graphs shows how similar the performance statistics are. By the time the first filter completes, we are left with a relatively small number of candidates to test, so even if subsequent filters require substantially more processing time per feature, the overall C P U time could still be quite respectable. To confirm that Algorithm C R H is scalable, Tables 5.4 and 5.5 show the breakdown of time spent in each of the filters (plus the aggregate statistics calculations) for a varying number of features. Table 5.4 shows the results for a top 25 list, whereas Table 5.5 shows a separate run for a top 10 list. We provide both tables to contrast the differences in C P U times attributable to hull filtering9 and aggregate distance calculations. These statistics also show the importance of quantity control (i.e., thresholds). If we were to include more filters that are very accurate, but are relatively expensive to compute, we would still expect the total run-time to be well under control provided, of course, that the number of features examined by such a filter is not great. As indicated in Sections 4.3.1 and 5.2, there are a few situations in which the linear and bisection modes can outperform memoization. These are situations where the density of features is very high, and the number of passes required by the two shape enlargement modes is very small. Of course, the number of passes can also be quite large. Thus, as 9 T o compute the convex hull for the cluster and for each of the features that were passed to the hull filter, we performed dynamic calls to the "qhull" (quick hull) program. This robust program was developed at the University of Minnesota's Computational Geometry Center. We had to change some of the qhull source code to enable us to perform the dynamic calls from our data mining program. Chapter 5. Case Study and Implementation 80 Number of Features Circle Rectangle Hull Aggregate Statistics Total 326 0.00 0.00 0.20 0.07 0.27 2,585 0.03 0.00 0.18 0.05 0.26 25,175 0.35 0.00 0.20 0.05 0.60 50,275 0.75 0.00 0.22 0.05 1.02 Table 5.4: Breakdown of C R H Processing Time (in seconds) for the Top 25 Features Number of Features Circle Rectangle Hull Aggregate Statistics Total 326 0.00 0.00 0.08 0.02 0.10 2,585 0.03 0.00 0.07 0.02 0.12 25,175 0.33 0.00 0.10 0.00 0.43 50,275 0.73 0.00 0.08 0.02 0.83 Table 5.5: Breakdown of C R H Processing Time (in seconds) for the Top 10 Features far as consistency is concerned, memoization is the best in that it guarantees no extra passes, is more precise, and is almost always more efficient. If the circle filter and the rectangle filter were not used, Table 5.6 indicates how much time it would take just to perform the hull filtering. Note the number of passes (iterations) that are required for the various modes. (The results in Table 5.6 reflect a top 10 list.) Figure 5.12 and Tables 5.4 and 5.5 confirm the efficiency of Algorithm CRH's multiple filtering approach in processing a large number of features. We conclude that Algorithm C R H delivers the desired scalability. Chapter 5. Case Study and Implementation 81 Mode Features Examined Iterations Made C P U Time (seconds) Linear 326 50 1.27 Linear 2,585 50 10.07 Linear 25,175 50 99.10 Linear 50,275 50 195.14 Bisection 326 6 1.85 Bisection 2,585 6 14.33 Bisection 25,175 6 141.49 Bisection 50,275 6 293.94 Memoization 326 1 1.33 Memoization 2,585 1 10.62 Memoization 25,175 1 104.35 Memoization 50,275 1 212.52 Table 5.6: Performance when Convex Hull Filtering is the Sole Filter 5.6.2 M i x of Clusters and Features At this point, the reader may be wondering if the performance results are due to the choice of cluster (which in turn would determine the types of features that would be analyzed and ranked). After all, we realize that: • The size of the cluster (in terms of area) could be small or large. This would affect the number of iterations required by the linear or bisection mode. • The neighbourhood around the cluster (e.g., up to 3 kilometres from the boundary of the cluster) could be densely populated with features, or sparsely populated with features. • The features could have many or few vertices. Similarly, their hulls could have many or few vertices. This would affect the time required to perform the distance calculations and to construct the convex polygons. Chapter 5. Case Study and Implementation 82 14 12 10 CO "D c o o 0) 3, CD E Q_ O Scalability of CRH in all 3 Modes Modes: Linear Bisection Memoization 1.5 2 2.5 3 3.5 4 4.5 Number of Features x 10 Figure 5.14: Example Using a 250 Hectare Cluster • The number of points in the cluster could vary greatly. This would affect the amount of time required to calculate the aggregate statistics. The results which we saw in Figure 5.12 are from an L-shaped neighbourhood oc-cupying approximately 7 hectares (0.07 square kilometres). There are 71 houses in this dense cluster. The convex hull consists of 5 points, and intersects 2 features: a high school and a cemetery. Most of the features located within a few kilometres from this neighbourhood are rectangular in shape. This area of south-central Vancouver is nowhere near the coastline. Chapter 5. Case Study and Implementation 83 25 co o CO ~i r Modes: Scalability of CRH in all 3 Modes 1 1 1 1 1 • r Linear Bisection Memoization 1.5 2 2.5 3 3.5 Number of Features 4.5 x 10 Figure 5.15: Example Using a 40 Hectare Cluster Figure 5.14 is an example of a moderately large neighbourhood of approximately 250 hectares. There are 177 houses in this relatively sparse, irregularly-shaped cluster. There are 17 points in the cluster's convex hull, and the hull encloses 2 features (a high school and a community centre/park). There are a lot of irregular features (including beaches) within a few kilometres of the cluster's boundary. Note that because of the size of the cluster and the fact that the cluster has many nearby features, the linear mode outperformed the bisection mode; however, memoization still outperformed the linear mode. Chapter 5. Case Study and Implementation 84 Scalability of CRH in all 3 Modes 14 12 10h eo "D C O Q o 8 CD CD E Q_ O n i i I I I I I I r Modes: Linear Bisection Memoization 1.5 2 2.5 3 3.5 4 4.5 Number of Features Figure 5.16: Example Using a 25 Hectare Cluster x 10 Figure 5.15 is an example of a small, rectangular neighbourhood. There are 54 houses in this cluster (approximately 40 hectares), and there are many rectangular features within a few kilometres of the cluster's boundary. The cluster's convex hull consists of 4 points and does not intersect any neighbouring cluster, although several features are very close to the cluster. Since this cluster is smaller than the cluster shown in Figure 5.14, the time required for the linear mode is considerably larger. Figure 5.16 is an example of a small, irregularly shaped neighbourhood. There are 49 houses in this relatively dense cluster spanning approximately 25 hectares. Its convex hull intersects no feature, and there are a lot of irregular features within a few kilometres Chapter 5. Case Study and Implementation 85 of the cluster's boundary. Memoization easily outperforms bisection, which in turn easily outperforms the linear mode. This ordering is typically the ordering that Algorithm C R H yields for an arbitrary cluster in an arbitrary area of town. 5.6.3 Incrementality So far, we have seen that Algorithm C R H with memoization is efficient and scalable. In the following set of experiments, we increased and decreased the thresholds to evaluate how well memoization can support incrementality. In particular, for a pool of 50,275 features, we changed the set of final thresholds thh from: 1 0 1. 10 to 25 to 15 2. 5 to 10 to 15 to 20 to 25 3. 25 to 10 Case 1 began by delivering the top 10 features, followed (in the same run) by the top 25 features, followed by the top 15 features. The top 10 features were returned in 0.85 seconds of C P U time. When the final threshold was raised from 10 to 25, only an additional 0.16 seconds (0.13 seconds for the hulls, and 0.03 seconds for the aggregate distance calculations) were required to get the additional 15 features. The number of new calculations was kept small because the results from the thh — 10 case had been memoized. Finally, when the top 25 features were reduced to the top 15 features, 0.00 additional seconds were required since all of the previous results were memoized and no new calculations were required. Case 2 delivered the "top 5" in 0.80 seconds, followed by the "top 10" in an additional 0.05 seconds, the "top 15" in 0.07 more seconds, the "top 20" in 0.09 more seconds, 1 0 Reca l l that the default threshold value of thc is 3 times that of thh, and that the default value of thr is 2 times that of thh- These rules were obeyed throughout the incremental processing cycle. Chapter 5. Case Study and Implementation 86 and finally the "top 25" in 0.07 more seconds. Even though the step size (additional features requested) was 5 in each incremental cycle, we should not necessarily expect the incremental C P U times to be the same since the features added to the list may contain many more (or fewer) points than the earlier features. For example, a feature having 100 vertices would involve far more calculations than a feature having only 4 vertices. Case 3 delivered the "top 25" in 0.98 seconds, followed by the "top 10" in (the expected) 0.00 additional seconds. Similar results were obtained for all of our test cases, confirming that incrementality is efficiently handled. Chapter 6 Conclusions 6.1 Summary We have studied a specific problem in spatial data mining, namely that of efficiently determining aggregate proximity relationships between a cluster and a large number of features. We have shown that because of scalability and expressiveness, no single com-putational geometry technique (of which there are many) is sufficient. Instead, we have concluded that a multiple filtering approach is most appropriate for our task. Specifi-cally, we developed Algorithm C R H which performs filtering via successive approxima-tions, namely encompassing circles, isothetic rectangles, and convex hulls. In particular, in less than 1 second of C P U time, an implementation of Algorithm C R H can examine the spatial relationships between a cluster and 50,000+ features, and can compute and rank the aggregate proximity statistics for the most promising features. Among the three modes of Algorithm C R H that we have studied (i.e., linear, bisection, and memoization), the version with memoization of distances is almost always the most efficient and certainly the most consistent since it only needs to make a single pass through the features. Algorithm C R H with memoization also provides for extremely efficient incremental processing when the thresholds are changed. By performing an implementation of a spatial data mining program, we have shown that Algorithm C R H efficiently supports expressiveness, scalability, and incrementality — all of which are critical issues in spatial data mining. 87 Chapter 6. Conclusions 88 6.2 Future Work 6.2.1 Abstract Features and Generalizations While the research reported in this thesis has dealt with a specific kind of spatial relation-ship problem (i.e., aggregate proximity, in a data mining context), it would be interesting to report other types of spatial relationships. For instance, developing techniques to han-dle abstract features represented in a manner similar to contour lines or isobars may lead to correlations between a cluster and many kinds of demographic data, such as family income level, ethnic background, household size, religious affiliation, etc. While there are certainly some similarities between abstract features and the features described in this thesis, we do not claim that our techniques can effectively support contour lines (for example) in the same way that we handle features. Even though contour lines can be represented as polygons, there are some major differences between a typical feature and a contour line. First of all, a contour line that describes an elevation of 10 metres is likely to encompass much of Vancouver. How then should expressiveness be handled? Even a contour line of elevation 20 metres is likely to encompass a large chunk of Vancouver, although not quite as much as the 10 metre contour line. Furthermore, the 20 metre contour line must (by necessity) be nested inside the 10 metre line. It would certainly be a non-trivial task to see how steepness (i.e., a small distance between successive contour lines) can be detected and reported by the program. It is quite possible that a housing cluster located near a cliff has a very nice view of some part of the landscape. Surely this information is worth reporting to the user. Contour lines may also "run off" the edge of the map. How should these situations be handled? It can be argued that the edge of the map should be included as one of the edges in the contour's "polygon". If so, how should steepness be handled? After all, Chapter 6. Conclusions 89 these cases degenerate around the edges of the map, and it would be inappropriate to say that there is a cliff at the edge of a map. The fact that contour lines are nested suggests a hierarchy of contours. For example, while it might be true that some of the houses in a particular cluster lie above 130 metres in elevation, and that most of the houses in that same cluster lie above 110 metres in elevation, it might also be true that all of the houses in the cluster lie above 90 metres in elevation. Would a statement such as "100% of the houses lie above 90 metres in elevation" be of interest to the user? Perhaps. For example, in a very flat urban area, most of the houses might be situated near sea level, but there may be a few hills in the city. If this is the case, then the user would probably be interested in noting that all of the houses lie above 90 metres in elevation. On the other hand, in an urban area consisting of a huge variety of elevations, the same statement may be of no value to the user. If we extend our reasoning to include demographic data such as income level, it becomes clear that the size of an income level "polygon" might be quite large. For example, many hectares of the city of Vancouver might be described by large polygons within which the average family income exceeds $40,000. Should such a polygon be broken down into smaller, well-defined income levels (i.e., multiple polygons)? Consider the following family income distribution: • $0 to $19,999 • $20,000 to $39,999 • $40,000 to $59,999 • $60,000 to $79,999 • $80,000 to $99,999 Chapter 6. Conclusions 90 • $100,000 or more Note that a hierarchy exists. Any family whose income is $100,000 or more is at least $80,000, which is also at least $60,000, and so on. This is interesting because it means that these intervals can be combined to produce statements such as "94% of the houses in the cluster represent a family income of at least $80,000". Such a statement may hold more impact than individual statements like "64% of the houses in the cluster represent a family income between $80,000 and $99,999" and "30% of the houses in the cluster represent a family income of $100,000 or more". Since hierarchies are involved, this permits generalization (such as the statement just mentioned concerning 94% of the houses). This means that the user will be presented with fewer details and, presumably, more meaningful, concise statements. Obviously, there are trade-offs in the quality and quantity of statements generated by any data mining system. A glut of information is often undesirable since the user typically wants a generalization of the data [Han et al, 1992], with highlights being drawn to the most significant proximity relationships. If the output is clouded with dozens (if not hundreds) of pieces of information, then the user may be less likely to use the system because the statements generated may wind up being a potpourri of information that the user must analyze manually. For this reason, thresholds should be defined that allow a user to specify the number of relationships to report. The attribute-oriented approach of the DBLearn system [Han et al, 1992] [Han et al, 1993] [Lu et al, 1993] provides an excellent example of generalizations and thresholds in data mining systems, including spatial data mining systems. We conclude this section by mentioning that some data mining systems allow users to define their own hierarchies using data dictionaries [Dhar and Tuzhilin, 1993]. For the example presented in this paper, recreational areas may constitute a hierarchy. Within it, Chapter 6. Conclusions 91 we might find community centres, golf courses, and parks. Within parks, we might find beaches, playgrounds, picnic areas, hiking trails, etc. Users may even attach preferences (weights) to branches that appear on the same level of the hierarchy, so that a golf course, for example, is ranked higher than a community centre, or a beach is ranked higher than a playground. 6.2.2 Boundary Shape Matching Incorporating alpha-shapes into the data mining program may help to discover features whose boundary shapes closely match that of a cluster. Boundary shape matching can be partial or total. For example, if a housing cluster borders a shoreline and extends up a hill , it is likely that only the lower portion of the cluster's boundary follows the shoreline's boundary. Even if only 10% of the cluster's boundary follows the shape of the shoreline, this information may still be of value to the user since it suggests a spatial relationship. A housing cluster in the middle of a forest, and a cluster of houses surrounding a lake are examples of total boundary shape matching. The use of alpha-shapes in spatial data mining is a novel and interesting application. Alpha-shape implementation will require some thought since it is unclear what a-values should be used. Although the boundary of each feature can be defined fairly precisely via digitization, this is not generally the case for a cluster. This poses a question that is very difficult to answer: What is the "exact" shape of a cluster of points? For a cluster that is non-convex (from a human's perspective), we can certainly use alpha-shapes to find the best convex approximation (i.e., the convex hull), but alpha-shapes allow us to get a finer resolution (i.e., a non-convex shape). It is unclear how fine the resolution should be since very fine resolutions may be "too fine". For example, there may be severe, narrow "cavities" or fissures in the polygon representing the housing cluster. Consider Chapter 6. Conclusions 92 (a) Outline of Letter "A" (b) Fissures Distorting Letter "A" Figure 6.17: Fissures in an Alpha-Shape Figure 6.17(a). To a human, this figure may immediately be recognizable as an outline of the letter " A " . Now consider Figure 6.17(b). Since a machine may not be able to determine an appropriate ct-value, this is one of many possible edge constructions that the machine may come up with. The long, narrow fissures serve to distort the " A " . The distortion may become so great that it may be difficult for a human to recognize any similarity between the resulting shape and the " A " pattern. In this light, if a rectangular city block in a residential area no longer takes on a rectangular appearance, then perhaps some approximation which comes up with a more recognizable rectangular shape might be better. The reason that we are interested in avoiding "fissures" in our shape description is that we want to use the resulting shape to perform boundary matching with features that (typically) do not have these kinds of fissures. The National Center for Supercomputing Applications and the Department of Com-puter Science at the University of Illinois at Urbana-Champaign has released a software package that deals with alpha-shapes. For a cluster of points having a pre-defined shape, we used this package to observe how different a-values can affect the "intuitive" shape of the given cluster. In fact, the alpha-shape could become disconnected and "unrecog-nizable" for a slight change in the a-value. For the non-convex shapes that we tested, Chapter 6. Conclusions 93 we found that the "best" a-values occurred when the number of voids (holes) in the alpha-shape was zero, and the number of connected components was 1. Alpha-shapes containing holes and multiple components would probably not be of much value in defin-ing a "suitable" cluster shape for our data mining example. Assuming that we can get a "suitable" shape for the boundary of the cluster, how then may boundary matching take place? One way is to determine the actual separation between the boundaries of the 2 polygons, over some interval. If the separation is rela-tively constant over part of the perimeter of the boundaries (e.g., the boundaries may be x metres apart ±y metres (y : mining all.326 biged *** 326 fea t u r e s have been loaded s u c c e s s f u l l y . *** *** 1 c l u s t e r has been loaded s u c c e s s f u l l y . *** CLUSTER NAME: b i g ed's neighbourhood CPU time to load f e a t u r e s and c l u s t e r i s 0.42 seconds Here are the f i l t e r i n g thresholds being used: C i r c l e t h r e s h o l d = 30 Rectangle t h r e s h o l d = 20 H u l l t h r e s h o l d = 10 FILTER #1: Encompassing C i r c l e F i l t e r CPU time to perform stage 1 of f i l t e r i n g i s 0.00 seconds A f t e r stage 1 of f i l t e r i n g , there are 30 candidate f e a t u r e s 99 Alexander MacKenzie School John Oliver High School Mitchell Island MacDonald Park South Memorial Park Mountain View Cemetery Queen Elizabeth Park Cartier Park Langara Golf Course John Henderson School Sunset Park Sanford Fleming School Gray's Park Kensington Park McBride Annex School Van Horne School General Brock School Henderson Annex School Langara Community College Nat Bailey Stadium Tecumseh Annex School Richard McBride School Tecumseh Park Gordon Park Columbia Park Riley Park Tecumseh School Hillcrest Park Moberly Park Oakridge Shopping Centre FILTER #2: Isothetic Rectangle Filter CPU time to perform stage 2 of filtering is 0.00 second After stage 2 of filtering, there are 20 candidate feature Mountain View Cemetery John Oliver High School Alexander MacKenzie School South Memorial Park MacDonald Park Cartier Park Gray's Park Sunset Park John Henderson School McBride Annex School Henderson Annex School General Brock School Van Home School Kensington Park Sanford Fleming School Queen Elizabeth Park Riley Park 101 Tecumseh Annex School Langara Community College Langara Golf Course FILTER #3: Convex Hull Filter CPU time to perform stage 3 of filtering is 0.08 seconds After stage 3 of filtering, there are 10 candidate features: John Oliver High School Mountain View Cemetery Alexander MacKenzie School South Memorial Park MacDonald Park Cartier Park John Henderson School Sunset Park Gray's Park McBride Annex School STAGE #4: Perform aggregate distance calculations. The cluster is being compared to John Oliver High School... Maximum house-to-boundary distance: 180.8 metres Minimum house-to-boundary distance: 17.7 metres 100.0% of the houses are within 250 metres of John Oliver High School. 102 The cluster is being compared to Mountain View Cemetery... Maximum house-to-boundary distance: 348.6 metres Minimum house-to-boundary distance: 14.3 metres 72.9% of the houses are within 250 metres of Mountain View Cemetery. 100.0% of the houses are within 500 metres of Mountain View Cemetery. The cluster is being compared to Alexander MacKenzie School. . . Maximum house-to-boundary distance: 509.4 metres Minimum house-to-boundary distance: 48.3 metres 47.1% of the houses are within 250 metres of Alexander MacKenzie School 97.1% of the houses are within 500 metres of Alexander MacKenzie School 100.0% of the houses are within 1000 metres of Alexander MacKenzie School The cluster is being compared to South Memorial Park... Maximum house-to-boundary distance: 342.7 metres Minimum house-to-boundary distance: 117.3 metres 75.7% of the houses are within 250 metres of South Memorial Park. 100.0% of the houses are within 500 metres of South Memorial Park. The cluster is being compared to MacDonald Park... Maximum house-to-boundary distance: 738.7 metres 103 Minimum house-to-boundary distance: 384.4 metres 0.0% of the houses are within 250 metres of MacDonald Park. 22.9% of the houses are within 500 metres of MacDonald Park. 100.0% of the houses are within 1000 metres of MacDonald Park. The cluster is being compared to Cartier Park... Maximum house-to-boundary distance: 981.5 metres Minimum house-to-boundary distance: 585.8 metres 0.0% of the houses are within 250 metres of Cartier Park. 0.0% of the houses are within 500 metres of Cartier Park. 100.0% of the houses are within 1000 metres of Cartier Park. The cluster is being compared to John Henderson School.. . Maximum house-to-boundary distance: 1259.3 metres Minimum house-to-boundary distance: 721.8 metres 0.0% of the houses are within 250 metres of John Henderson School 0.0% of the houses are within 500 metres of John Henderson School 57.1% of the houses are within 1000 metres of John Henderson School 100.0% of the houses are within 2000 metres of John Henderson School The cluster is being compared to Sunset Park... Maximum house-to-boundary distance: 1278.4 metres Minimum house-to-boundary distance: 739.9 metres 104 0.0% of the houses are within 250 metres of Sunset Park. 0.0% of the houses are within 500 metres of Sunset Park. 52.9% of the houses are within 1000 metres of Sunset Park. 100.0% of the houses are within 2000 metres of Sunset Park. The cluster is being compared to Gray's Park... Maximum house-to-boundary distance: 1258.2 metres Minimum house-to-boundary distance: 719.8 metres 0.0% of the houses are within 250 metres of Gray's Park. 0.0% of the houses are within 500 metres of Gray's Park. 45.7% of the houses are within 1000 metres of Gray's Park. 100.0% of the houses are within 2000 metres of Gray's Park. The cluster is being compared to McBride Annex School... Maximum house-to-boundary distance: 1387.6 metres Minimum house-to-boundary distance: 850.3 metres 0.0% of the houses are within 250 metres of McBride Annex School 0.0% of the houses are within 500 metres of McBride Annex School 25.7% of the houses are within 1000 metres of McBride Annex School 100.0% of the houses are within 2000 metres of McBride Annex School CPU time to perform stage 4 is 0.02 seconds After stage 4, there are 10 candidate features ranked as follows: 105 John Oliver High School with a weighted score of 1000 South Memorial Park with a weighted score of 903 Mountain View Cemetery with a weighted score of 891 Alexander MacKenzie School with a weighted score of 780 MacDonald Park with a weighted score of 369 Cartier Park with a weighted score of 300 John Henderson School with a weighted score of 214 Sunset Park with a weighted score of 206 Gray's Park with a weighted score of 191 McBride Annex School with a weighted score of 151 Previous hull threshold was 10 Enter new threshold, or 0 to terminate. Total CPU time used by this program is 0.53 seconds Command<2> 106