UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bayesian cluster validation Koepke, Hoyt Adam 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2008_fall_koepke_hoyt.pdf [ 3.84MB ]
Metadata
JSON: 24-1.0051357.json
JSON-LD: 24-1.0051357-ld.json
RDF/XML (Pretty): 24-1.0051357-rdf.xml
RDF/JSON: 24-1.0051357-rdf.json
Turtle: 24-1.0051357-turtle.txt
N-Triples: 24-1.0051357-rdf-ntriples.txt
Original Record: 24-1.0051357-source.json
Full Text
24-1.0051357-fulltext.txt
Citation
24-1.0051357.ris

Full Text

BAYESIAN CLUSTER VALIDATION  by Hoyt Adam Koepke B.A., The University of Colorado at Boulder, 2004  a thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in  the faculty of graduate studies (Computer Science)  The University Of British Columbia (Vancouver) August 2008    Hoyt Adam Koepke, 2008  Ab stract W e propose a novel framew ork b ased on Bayesian principles for validating clustering s and present effi cient alg orithms for use w ith centroid or ex emplar b ased clustering solutions. O ur framew ork treats the data as fi x ed and introduces perturb ations into the clustering procedure. In our alg orithms, w e scale the distances b etw een points b y a random variab le w hose distrib ution is tuned ag ainst a b aseline null dataset. The random variab le is integ rated out, yielding a soft assig nment matrix that g ives the b ehavior under perturb ation of the points relative to each of the clusters. F rom this soft assig nment matrix , w e are ab le to visualiz e inter-cluster b ehavior, rank clusters, and g ive a scalar index of the the clustering stab ility. In a larg e test on synthetic data, our method matches or outperforms other leading methods at predicting the correct numb er of clusters. W e also present a theoretical analysis of our approach, w hich sug g ests that it is useful for hig h dimensional data.  ii  Tab le of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents  . . . . . . . . . . . . . . . . . . . . . . . . . .  ii iii  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1 Cluster Analysis and Validation . . . . . . . . . . . 1.1 The Clustering F unction . . . . . . . . . . . . . . . . 1.1.1 Stag es of the Clustering F unction . . . . . . . 1.1.2 Common Clustering Alg orithms . . . . . . . . 1.1.3 The Sq uared E rror Cost F unctions . . . . . . 1.2 Cluster V alidation . . . . . . . . . . . . . . . . . . . 1.2.1 Clustering Stab ility . . . . . . . . . . . . . . 1.2.2 Clustering Similarity Indices . . . . . . . . . 1.3 G ap Statistic . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  ix  . . . . . . . . .  1 2 3 4 6 9 10 13 17  2 A Bayesian Framework for Clustering Stability . . . . . . 2.1 The Ab stract F ramew ork . . . . . . . . . . . . . . . . . . . . 2.1.1 The Averag ed Assig nment M atrix . . . . . . . . . . . 2.1.2 The M atching M atrix . . . . . . . . . . . . . . . . . . 2.1.3 P erturb ations and L ab el M atching . . . . . . . . . . . 2.2 V isualiz ations and Statistical Summaries of the Averag ed Assig nment M atrix . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 H eatmap P lot of Φ . . . . . . . . . . . . . . . . . . . . 2.2.2 Scalar Stab ility Indices . . . . . . . . . . . . . . . . . . 2.2.3 E x tensions to O ther Stab ility Indices . . . . . . . . . . 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .  19 20 21 22 22 23 23 26 28 30  3 Bayesian Cluster Stability Algorithms . . . . . . . . . . . . 31 iii  3.1  3.2  3.3  3.4 3.5 3.6  3.7  3.8  4  P erturb ing the distance metric . . . . . . . . . . . . . . . . . 3.1.1 Intuitive Understanding . . . . . . . . . . . . . . . . . 3.1.2 E x ample: E x ponential P rior . . . . . . . . . . . . . . . P rior Selection Using a Baseline N ull Distrib ution . . . . . . 3.2.1 O b servations from H ig h Dimensions . . . . . . . . . . 3.2.2 Tunab le P riors . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Types of Baseline Distance M atrices . . . . . . . . . . Scaled-distance P erturb ations w ith a L ocation E x ponential P rior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Analytic Calculation of φLE . . . . . . . . . . . . . . . 3.3.2 Analytic Calculation of ∂φLjE /∂θ . . . . . . . . . . . . 3.3.3 Analytic Calculation of φLE and ∂φLjE /∂θ for Tied θ . . Scaled Distance P erturb ations w ith a Shifted G amma P rior . 3.4.1 Analytic Calculation of φSG . . . . . . . . . . . . . . . G eneral M onte Carlo Alg orithm . . . . . . . . . . . . . . . . . Approx imation Alg orithms . . . . . . . . . . . . . . . . . . . 3.6.1 E rror Bounds for the L ocation E x ponential P rior . . . 3.6.2 E rror Bounds for the Shifted G amma P rior . . . . . . Additional P roperties of the P ointw ise Stab ility . . . . . . . . 3.7.1 Diff erences Betw een tw o P ointw ise Stab ility Terms . . 3.7.2 Behavior of the P ointw ise Stab ility . . . . . . . . . . . E x tensions to O ther Indices . . . . . . . . . . . . . . . . . . . 3.8.1 O ptimiz ing AR and VI over θ . . . . . . . . . . . .  Synthetic Data for Cluster Validation Tests . . . . . . . . 4.1 R elated W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Defi nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Component Separation and P rox imity . . . . . . . . . 4.3 Stag e 1: P retransformed Component Distrib utions . . . . . . 4.3.1 Choosing L ocations for the Components . . . . . . . . 4.3.2 Setting the M ean of the Initial Cluster V ariance Distrib ution . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Individual M ix ture Component Setting s . . . . . . . . 4.4 Stag e 2: Shaping the Components . . . . . . . . . . . . . . . 4.4.1 F ormal N otation . . . . . . . . . . . . . . . . . . . . . 4.4.2 Acceptab le Transformation F unctions . . . . . . . . . 4.4.3 R otation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Coordinate Translation . . . . . . . . . . . . . . . . . 4.4.5 Coordinate Scaling . . . . . . . . . . . . . . . . . . . . 4.4.6 E x ample Cluster Shaping P arameters . . . . . . . . . iv  32 33 34 35 36 36 38 40 41 44 48 48 50 53 56 58 60 61 61 65 70 70 73 75 75 76 77 77 78 79 80 81 81 82 84 84 85  4.5 4.6  4.7 4.8  Stag e 3: Adjusting the P roposed Cluster Distrib ution Sampling from the Distrib ution . . . . . . . . . . . . . 4.6.1 Determining Component Sample Siz es . . . . . 4.6.2 Sampling from the Components . . . . . . . . . User Set P arameters . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . .  . . . . . .  . . . . . .  85 87 88 90 91 91  . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . .  5  Testing and Verifi cation . . . . . . . . . . . . . . . . 5.1 M ethods Tested . . . . . . . . . . . . . . . . . . . . . 5.1.1 Bayesian Cluster V alidation M ethods . . . . . 5.1.2 G ap Statistic . . . . . . . . . . . . . . . . . . 5.1.3 Data P erturb ation . . . . . . . . . . . . . . . 5.1.4 P redictors . . . . . . . . . . . . . . . . . . . . 5.2 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Data Types . . . . . . . . . . . . . . . . . . . 5.2.2 Clustering . . . . . . . . . . . . . . . . . . . . 5.2.3 Q uantitative E valuation . . . . . . . . . . . . 5.3 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 AN O V A Data, M edium Sample Siz e . . . . . 5.3.2 AN O V A Data, Small Sample Siz e . . . . . . . 5.3.3 Shaped Data, M edium Sample Siz e . . . . . . 5.3.4 Shaped Data, Small Sample Siz e . . . . . . . 5.3.5 Detailed Comparison . . . . . . . . . . . . . . 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Increasing Dimension . . . . . . . . . . . . . 5.4.2 Cluster Shape . . . . . . . . . . . . . . . . . . 5.4.3 Sample Siz e . . . . . . . . . . . . . . . . . . . 5.4.4 O ther Considerations . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . .  93 93 94 94 95 96 97 97 97 98 99 99 100 101 103 104 116 117 117 119 121  6  Limits in H igh Dimensions . . . . . . . . . . . . . . . . . . . 6.1 L imits of Clustering in H ig h Dimensions . . . . . . . . . . . . 6.1.1 P artitions . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 P roperties of The Sq uared E rror Cost F unction . . . . 6.1.3 N oisy Data . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 L emmas Concerning the Central L imit Theorem . . . 6.1.5 Impossib ility Theorem for Clustering in H ig h Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Behavior of the Averag ed Assig nment M atrix . . . . . . . . . 6.2.1 Converg ence of Φ as a function of R andom V ariab les . 6.3 Asymptotic Behavior of Φ in H ig h Dimensions . . . . . . . .  123 123 124 125 129 129  v  . . . . . . . . . . . . . . . . . . . . .  . . . . . .  132 137 138 141  6.3.1  6.4 7  Asymptotic Behavior of Φ w ith AN O V A-type M ix ture M odels . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.3.2 An Impossib ility Theorem for Φ in H ig h Dimensions . 146 P rior Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 147  Additional R elevant R esearch . . . . . . . . . . . . 7.1 V alidation Alg orithms involving R eclustering . . . . 7.1.1 M onte Carlo Ab straction . . . . . . . . . . . 7.1.2 E valuating the N ormaliz ing Constant . . . . 7.1.3 Computational R esults . . . . . . . . . . . . . 7.1.4 Discussion . . . . . . . . . . . . . . . . . . . .  Bibliograp hy  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  14 9 149 150 150 156 157  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 8  vi  List of Tab les 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17  The possib le scores of the cluster validation methods. . . . . Score results for AN O V A type data w ith 750 points. . . . . . Score results for small sample siz e, AN O V A type data w ith several predictors. . . . . . . . . . . . . . . . . . . . . . . . . Score results for Shaped data w ith several predictors. . . . . . The score results for Shaped data w ith several predictors. . . Sampling distrib ution of AP W -R C and AP W -P erm on 2D AN O V A data w ith 750 points using StdBeforeBest. . . . . . . Sampling distrib ution of SS-Draw s-V I on 2D AN O V A data w ith 750 points. . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling distrib utions of the G ap statistic using tw o prediction rules on 2D AN O V A data w ith 750 points. . . . . . . . . Sampling distrib ution of AP W -R C on 2D AN O V A data w ith 100 points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling distrib ution of AP W -P erm, StdBeforeBest, on 2D AN O V A data w ith 100 points. . . . . . . . . . . . . . . . . . Sampling distrib ution of the g ap statistic on 2D AN O V A data w ith 100 points using the StdBeforeBest prediction rule. . . . Sampling distrib utions of AP W -R C w ith tw o diff erent predictors on 2D AN O V A data w ith 100 points. . . . . . . . . . Sampling distrib ution of AP W -P erm, StdBeforeBest, on 2D Shaped data w ith various sample siz es. . . . . . . . . . . . . . Sampling distrib ution of SS-Draw s-V I on 2D Shaped data w ith 100 points. . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling distrib ution on 2D shaped data w ith 100 points using the g ap statistic. . . . . . . . . . . . . . . . . . . . . . . Sampling distrib ution of AP W -P erm, StdBeforeBest, on 100D AN O V A data w ith various sample siz es. . . . . . . . . . . . . Sampling distrib ution of AP W -R C, StdBeforeBest, on 100D AN O V A data w ith various sample siz es. . . . . . . . . . . . . vii  98 100 101 102 103 105 106 107 108 109 110 111 112 113 114 115 116  5.18 Sampling distrib ution of SS-Draw s-V I on 100D AN O V A data w ith 100 points. . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.19 Sampling distrib ution on 100D AN O V A data w ith 100 points using the g ap statistic. . . . . . . . . . . . . . . . . . . . . . . 118 5.20 Sampling distrib ution of AP W -R C and AP W -P erm, StdBeforeBest, on 100D Shaped data w ith 750 points. . . . . . . . . . . . . . 119 5.21 Sampling distrib ution of AP W -R C and AP W -P erm, StdBeforeBest, on 100D Shaped data w ith 100 points. . . . . . . . . . . . . . 120 5.22 Sampling distrib ution of SS-Draw s-V I on 100D Shaped data w ith 100 points. . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.23 Sampling distrib ution of the g ap statistic on 100D Shaped data w ith 100 points. . . . . . . . . . . . . . . . . . . . . . . . 121 6.1 6.2 7.1 7.2 7.3  N otational conventions used for asymptotic proofs of mix ture models in increasing dimension. . . . . . . . . . . . . . . . . . 142 N otational conventions used for asymptotic proofs of AN O V A mix ture models. . . . . . . . . . . . . . . . . . . . . . . . . . . 142 The terms used in this chapter. . . . . . . . . . . . . . . . . . 152 The results for the targ et distrib ution w ith C(α) = 1. . . . . 156 The results for the targ et distrib ution w ith C(α) dynamically updated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156  viii  List of F ig u res 1.1  The tw o stag es of the clustering function. . . . . . . . . . . .  3  2.1  E x ample of a heatmap plot show ing cluster interactions. . . .  25  3.1  P lots of the pointw ise stab ility at each point of a 2d plain g iven the cluster centers. . . . . . . . . . . . . . . . . . . . . .  37  4.1 4.2 4.3  E x ample plots of 2d mix ture models w ith diff erent check s on the transformations. . . . . . . . . . . . . . . . . . . . . . . . 2d mix ture models w ith corresponding samples (2000) as a function of the transform severity parameter α . . . . . . . . 2d mix ture models w ith corresponding samples (2000) as a function of the numb er of transformations per dimension. . .  ix  83 86 88  Ch ap ter 1  Clu ster Analy sis and Valid ation Clustering , or cluster analysis, is the process of identifying g roups of similar points in data w ith no associated class lab els. It occurs commonly as a sub prob lem in statistics, machine learning , data mining , b iostatistics, pattern recog nition, imag e analysis, information retrieval and numerous other disciplines. It is usually used to discover previously unk now n structure, b ut other uses, such as increasing effi ciency in accessing data, are also common. Because the prob lem is ub iq uitous, numerous clustering alg orithms have b een proposed and studied. W hile many clustering alg orithms produce a candidate partitioning , relatively few attach a measure of confi dence to the proposed clustering . An ideal clustering alg orithm w ould do b oth; how ever, such a procedure is not alw ays practical. As a result, the fi eld of cluster validation attempts to remedy this b y proposing methods to assess how w ell a proposed clustering of a dataset refl ects its intrinsic structure. The purpose of this thesis is to introduce a novel techniq ue for validating clustering s. W e b eg in b y b roadly outlining , as much as possib le, the fi eld of cluster analysis and cluster validation. Then, in chapter 2 w e introduce our method as a g eneral framew ork for cluster validation that can apply g enerally to most clustering procedures and types of data. Then, in chapter 3, w e apply it specifi cally to distance b ased clustering s – clustering s w here the fi nal partitioning depends on the distance b etw een the point and a centroid or ex emplar point. To test our method, w e then describ e a procedure in chapter 4 for g enerating synthetic data w ith the correct numb er of clusters. Chapter 5 g ives the results of a massive simulation on this data, compared 1  ag ainst several other leading methods, show ing that our method performs favorab ly. F urthermore, our method has several desirab le properties in hig h dimensions, w hich w e outline in chapter 6. In addition, w e also present results describ ing the b reak dow n of certain types of clustering s in hig h dimensions, w hich is itself of interest. F inally, in chapter 7, w e introduce a M onte Carlo alg orithm developed for the case of reclustering a distrib ution, b ut w hich may apply more b roadly as w ell. The focus of this chapter w ill b e on clustering analysis and then validation methods. W e w ill b riefl y describ e several important clustering alg orithms in section 1.1, b ut refer the reader to other resources for more in-depth treatments of the sub ject. In section 1.2.1, w e discuss the concept of stab ility as a cluster validation tool. This motivates our discussion of cluster similarity indices in section 1.2.2, as stab ility b ased methods rely heavily on these.  1 .1  Th e Clu stering F u nction  There are numerous w ays to classify and describ e the plethora of proposed clustering alg orithms, and our treatment of the sub ject here is far from complete. Because the alg orithms w e propose in our framew ork apply to a larg e class of clustering alg orithms – those dealing w ith centroids or ex emplar points – w e summariz e several clustering alg orithms relevant to our method and refer the interested reader to one of [X W 05, J TZ 04, Ber02, J M F 99, BB99] for more detailed surveys. In this section, w e fi rst look at tw o aspects that many clustering alg orithms, or clustering functions, share in common. The fi rst common aspect is the minimiz ation of a cost function, w hich w e discuss in detail in section 1.1.3. This cost function returns a value that indicates how w ell partitioned the data is; a low er value of the cost function usually indicates a b etter clustering . The second aspect, and one directly relevant to our proposed framew ork , is that many clustering functions operate in tw o stag es; the fi rst stag e creates some model describ ing the clusters and the second partitions the data. W e visit this in section 1.1.1. Then, in section 1.1.2, w e summariz e several diff erent classes of clustering alg orithms and present details of some of the most common ones.  2  Data → CS → C lu ste r S tatistic s → CP → A ssig n m e n t F ig ure 1.1: The tw o stag es of the clustering function.  1 .1 .1  Stag es of th e Clu stering F u nction  Anticipating the method w e introduce in the nex t chapter, w e note that many clustering alg orithms can b e describ ed as a tw o stag e process, as show n in F ig ure 1.1. The fi rst stag e is g enerating a set of statistics describ ing the data, and the second stag e uses those statistics to partition the data.1 F ormally, suppose C (K, X ) is a clustering function. In most cases, w e can ex press this as a nested function, so C (K, X ) = CP (CS (K, X ), X )  (1.1)  The fi rst stag e, CS , processes the data and outputs information or statistics ab out the clustering . The second stag e, CP , uses this information to partition the data points into clusters. W e represent this g raphically in F ig ure 1.1. F or ex ample, k-means g enerates a set of centroids and points are assig ned to the closest centroid. In this case, either the centroids or the point-tocentroid distance matrix could b e treated as summary statistics produced b y CS . Additionally, many model-b ased clustering methods, such as E M w ith G aussian mix ture models, produce a mix ture model representing the density of the data, and the mix ture model parameters completely describ e the clustering . H ierarchical alg orithms often use such statistics of the data to perform the merg ing or dividing steps. This idea is used ex plicitly in the information b ottleneck approach orig inally proposed b y Tishb y [TP B00, Slo02, ST00a]. The main idea is to fi nd a limited set of code w ords that preserves the most information ab out the data. W hile this approach applies more g enerally than to just clustering , it has inspired some eff ective clustering alg orithms [TS00], particularly for document clustering [ST00b ] and imag e clustering [G G G 02]. 1  Technically, all clustering functions can be divided this way because the statistics could be a labeling of the data. While this makes the latter partitioning step trivial, we put no constraints on the form of the statistics used to generate the information to keep this description sufficiently general.  3  1 .1 .2  Com m on Clu stering Alg orith m s  There are several categ oriz ations that are useful in disting uishing the numerous ex isting clustering alg orithms. The disting uishing criteria w e use here are hierarchical versus partitional. H ierarchical alg orithms b uild a dendog ram, or tree, representing a hierarchy of nested partitions, b y either repeatedly merg ing smaller clusters or dividing larg er clusters into parts. Ag g lomerative-type alg orithms start w ith smaller clusters and successively merg e them until some stopping criteria is met. P artitional alg orithms, on the other hand, separate the data into the clusters in one pass. H ierarchical Algorithms The most b asic hierarchical clustering alg orithm is ag g lomerative; not surprising ly, it is called ag g lomerative clustering . It can apply to any type of data w here the distances b etw een points can b e defi ned and is used not only for clustering points in E uclidean space b ut for clustering other types of data such as tex t [ST00a, Z K 02]. It starts w ith each data point b eing a sing le cluster. At each stag e, the closest pair of points or clusters is merg ed, ending w ith a sing le cluster and a tree shaped representation of the structure in the data. Closest is usually defi ned according to one of three link ag e measures. Sing le link ag e is the smallest distance b etw een any point in one cluster and any point in the other. This is fast, b ut tends to b e hig hly sensitive to random eff ects in the data and can often produce clusters that look more lik e strands of spag hetti than clusters. Averag e link ag e is the averag e distance b etw een all pairs of points w here one point is in the fi rst cluster and the other is in the second cluster; this often produces b etter clusters b ut is a slow er alg orithm. F inally, complete link ag e is the farthest distance b etw een points in one cluster and points in the other; this is alg orithmically as fast as sing le link ag e, b ut often produces tig hter, more compact clusters. Another w ell-studied class of ag g lomerative clustering alg orithms uses prob ab ilistic models of the clustering and merg es b ased on how w ell the resulting posterior distrib ution w ould refl ect the data in the resulting cluster. The vanilla version [IT95] simply uses the lik elihood of the resulting model; a more developed version proposed b y H eller and G hahramani [H G 05] uses hypothesis testing . W hile these methods w ork for g eneral models and data types, they have found the most use in tex t and document clustering as tex tual data is more easily represented b y prob ab ilistic models than b y points in E uclidean space. F or a more detailed discussion of these alg orithms, and  4  other alg orithms for tex tual data, w e refer the reader to [Z K 02], [H M S02] or [SK K 00]. Divisive hierarchical clustering alg orithms recursively divide the data into tw o or more partitions, thus b uilding the dendog ram from the root node and ending w ith the leaves. There are several alg orithms desig ned specifi cally for this purpose [Bol98]. Another simple method, thoug h, is to apply a partitional clustering alg orithm w ith only a handful of clusters to divide the data, then recursively apply it to each of the partitions. Partitional Clustering Algorithms By far the most popular partitional clustering alg orithm is k-means [Ste06]. This is not surprising , as the alg orithm has several nice properties. It is q uite simple; it can b e b oth describ ed and coded in a few lines of code. The k-means alg orithm is also computationally q uite fast and is one of the few clustering alg orithms that scales w ell to millions or even b illions of points w ith parallel implementations [SB99]. The alg orithm w ork s remark ab ly w ell on a variety of data (thoug h more specializ ed alg orithms can readily b eat it). Theoretically, the alg orithm attempts to minimiz e the sq uared error cost function, discussed in more detail b elow . This formulation is conducive to theoretical analysis, so the alg orithm is q uite w ell studied. K-means w ork s b y iteratively assig ning points to the nearest centroid, then repositioning those centroids to the mean of the points. Centroids are usually initializ ed randomly, thoug h there are more sophisticated methods b ased on density estimate that yield b etter results and/ or faster converg ence times [AV 07]. Another popular class of partitional clustering alg orithms are spectral clustering alg orithms, w hich also operate on a g raph structure of points [K V V 04, V M 01, N J W 02]. The central idea is that the values of the principle eig envector of the inter-point distance matrix can b e used to g roup the points – g roups of similar points tend to have similar values in the principle eig envector. This idea has also b een ex tended to directed g raphs [M P 07]. The minfl ow -max -cut alg orithm from optimiz ation theory, can b e used to cluster the points b y partitioning the inter-point distance matrix so as to max imiz e the distances cut, thereb y minimiz ing the distances w ithin partitions [DH Z + 01, F TT04]. Additionally, statistics such as the sum of edg e leng ths w ithin clusters also refl ects the connectivity of the g raph, and the q uality of the resulting partitions, in a w ay that distance matrices do not.  5  1 .1 .3  Th e Sq u ared Error Cost F u nctions  M any clustering alg orithms, e.g . k-means, attempt to fi nd a partition that minimiz es a g iven cost function. K-means uses the sq uared error cost function, w hich w e defi ne formally b elow . Because w e use this ex tensively later on, particularly in chapter 6, w e formally show some of its g eneral properties here. W e reserve the full analysis, how ever, to chapter 6. Defi nition 1.1.1. Sq uared E rror Cost Function (p) (p) (p) L et X (p) = (x1 , x2 , ..., xn ) b e a list of n p-dimensional points. Then (p) the sq uared error cost of X is n  cost X (p) =  2 2  (1.2)  (xiq − µq )2  (1.3)  xi − m i=1 n  p  = i=1 q=1  w here µq =  1 n  n  xiq  (1.4)  i=1  is the mean of the qth component of all the points in X (p) . If p = 1, this simply b ecomes n  (xi − m)2  cost X (1) =  (1.5)  i=1  Defi nition 1.1.2. Sq uared E rror Cost Function for Partitions (p) (p) (p) L et X (p) = x1 , x2 , ..., xn b e a set of n p-dimensional points, and let P = (P1 , P2 , ..., PK ) b e a partitioning of those points into K clusters. The sq uared error cost function is then just the sum of the costs of each partition: cost X (p) , P =  cost  xi ∈ X (p) : i ∈ Pk  (1.6)  k  =  x i − µk  2 2  k i∈Pk  w here µk is the mean of the points in the kth partition. 6  (1.7)  Theorem 1.1.3. Sep arability of the Sq uared E rror Cost Function into Comp onents. T h e sq u ared e rro r co st fu n c tio n is se p arab le in to d im e n sio n co m p o n e n ts. (p) (p) (p) S p ec ifi cally , su p p o se X (p) = x1 , x2 , ..., xn is a se t o f p-d im e n sio n al p o in ts, an d le t Xq(p) = xq : x ∈ X (p) .  (1.8)  T h en p  cost X  (p)  cost Xq(p)  =  (1.9)  q=1  P ro o f. p  n  cost X (p) =  (xiq − µq )2  (1.10)  q=1 i=1 p  cost Xq(p)  =  (1.11)  q=1  Theorem 1.1.4 . Invariance of Cost Function U nder U nitary Linear Transformations (p) (p) (p) L e t A be a p × p u n itary m atrix , an d le t X (p) = x1 , x2 , ..., xn be a se t o f n p-d im e n sio n al p o in ts. T h e n cost  Ax : x ∈ X (p)  7  = cost X (p)  (1.12)  P ro o f. n  cost  Ax : x ∈ X  (p)  =  2 2  (1.13)  (A(xi − m))T (A(xi − m))  (1.14)  (xi − m)T AT A(xi − m)  (1.15)  (xi − m)T (xi − m)  (1.16)  Axi − Am i=1 n  = i=1 n  = i=1 n  = i=1  = cost X (p)  (1.17)  Another nice property of the sq uared error cost function is that it can b e ex pressed as a linear function of the total distance b etw een all the pairs of points. W e present this in the follow ing theorem. Theorem 1.1.5 . T h e sq u ared e rro r co st fu n c tio n can be e x p re ss in te rm s o f th e d istan ce s be tw ee n e v e ry p air o f p o in ts. S p ec ifi cally , cost X (p) =  xi − x ¯  2 2  =  i  1 2n  xi − xj  2 2  (1.18)  i,j  P ro o f. It is easier to start w ith the distances b etw een pairs of points and show that it eq uals the orig inal form. By defi nition, w e have that 1 2n  (xi − xj )2 = i,j  =  =  1 2n 1 2n  ((xi − x ¯) − (xj − x ¯))2  (1.19)  i,j  (xi − x ¯)2 + (xj − x ¯)2 − 2(xi − x ¯)(xj − x ¯) i,j    1  2n  2n(xi − x ¯)− i  i  8    2n(xi − x ¯)   j  (1.20)   2n(xj − x ¯).  (1.21)  H ow ever, 1 n  xi − n  (xi − x ¯) = i  i  xi = 0  (1.22)  i  so 1 2n  (xi − xj )2 = i,j  (xi − x ¯ )2 .  (1.23)  i  Thus the theorem is proved. Corollary 1.1.6 . T h e co st fu n c tio n fo r m u ltip le p artitio n s can be e x p re ssed in te rm s o f th e d iff e re n ce be tw ee n all p airs o f p o in ts w ith in eac h o f th e p artitio n s. S p ec ifi cally , cost X (p) , P =  x i − µk  2 2  k i∈Pk  =  1 2  k  1 |Pk |  xi − xj  2 2  (1.24)  i,j∈Pk  P ro o f. This follow s immediately from theorem 1.1.5 and the defi nition of the cost function for multiple partitions. Corollary 1.1.7 . Invariance of the cost function to constant shifts (p)  (p)  (p)  L e t u be an y p-d im e n sio n al p o in t. L e t X (p) = x1 , x2 , ..., xn be a se t o f n p-d im e n sio n al p o in ts, an d le t Y (p) = x1 + u, x2 + u, ..., xn + u. L e t P be a p artitio n in g o f th o se p o in ts in to K p artitio n s. T h e n cost X (p) , P = cost Y (p) , P  (1.25)  P ro o f. This follow s directly from corrolary 1.1.6 b y ob serving that the cost function is ex pressed entirely as the diff erence b etw een points. Thus w e have that cost Y (p) , P reduces trivially to cost X (p) , P .  1 .2  Clu ster Valid ation  O nce a clustering is found, one should ask how w ell it refl ects the natural structure of the data b efore trying to interpret it in the contex t of the ex periment. It is critical to detect if the clustering is spurious – i.e. if the  9  structure implied b y the clustering is a random artifact of the data and not part of the true underlying distrib ution. Ultimately, techniq ues for cluster validation need to aid in answ ering tw o overlapping q uestions reg arding q uantity and q uality. F irst, does the numb er of proposed clusters accurately refl ect the data [M B02]? Second, how representative of the modes of the underlying distrib ution is the clustering ? F or ex ample, one mig ht w ant to k now how w ell the data supports tw o clusters b eing distinct or w hether all the sig nifi cant modes in the data are represented b y a cluster. M odel b ased clustering methods fi t a mix ture model to the data, representing each “ cluster” b y one component of the mix ture model. These methods produce, for each point, a distrib ution over cluster memb erships formed b y normaliz ing the prob ab ilities that the point w as draw n from each of the components. This so called soft assig nment matrix provides a natural w ay to assess how w ell the mix ture model refl ects the data; the ideal fi t w ould result in all points having a much hig her prob ab ility of b elong ing to one component than to any others. O ther techniq ues involve assessing the clustering in terms of stab ility, or the resistance of the clusters to reasonab le perturb ations to the data. The idea is that a clustering that refl ects the true underlying distrib ution of the data w ill b e resistant to such perturb ations, b ut clusters that simply refl ect spurious and random eff ects w ould not. The g eneral process, then is to cluster the data multiple times, each time w ith diff erent perturb ations. If the resulting set of clustering s is consistent, then the orig inal clustering is presumed to b e stab le. W e discuss this more in the nex t section. O ne aspect of cluster validation is ensuring that a proposed clustering correctly identifi es the numb er of clusters. In simulations, this can b e tested b y check ing how often a cluster validation techniq ue show s that clustering the dataset into the correct numb er of clusters yields a b etter clustering than clustering it into an incorrect numb er of clusters. The prob lem of predicting the correct numb er of clusters is often phrased as a model selection prob lem [TW 05], and is g enerally accepted in the literature as a method for comparing various cluster validation techniq ues. In chapter 5, w e compare the accuracy of our approach ag ainst that of other clustering validation techniq ues on this prob lem.  1 .2 .1  Clu stering Stab ility  M any stab ility b ased techniq ues for cluster validation have b een proposed. The k ey idea of stab ility b ased cluster validation is to perturb the data 10  set, usually b y sub -sampling the data or adding noise, then clustering the perturb ed data [BH E G 02, H en04, G T04]. The primary idea b ehind cluster stab ility is that a clustering solution should b e resistant to chang es, or perturb ations, in the data that one w ould ex pect to occur in a real process. The stab ility of the clustering is determined b y analyz ing the similarity of the clustering across data perturb ation runs [BH E G 02] or b etw een the orig inal data and the perturb ed data, usually using a type of index desig ned for this purpose (see section 1.2.2). Typically, this is simply the averag e of the stab ility indices produced b y the seq uence of data perturb ation runs, b ut other summaries are possib le. There are numerous papers on the b est w ay to construct this procedure. Ben-H ur et. al. [BH E G 02] randomly draw s a certain percentag e of all the data points, clustering each, and compares each resulting clustering to all the others. They then form a histog ram of the resulting stab ilities and use that to chose the cluster. O ther variants of this idea have also b een proposed [AL A+ 03, SG 03]. As a slig htly diff erent variation, Breck enridg e proposes measuring replication and consistency using a type of cross-validation on the orig inal data [Bre89]. H e fi rst tries to identify consistently corresponding clusters b etw een multiple runs, then measure the prediction streng th on these consistently recurring clusters. This idea is ex tended b y L ang e, R oth, Braun, and Buhmann in a series of papers [L BR B02, L R BB04, R L BB02]. M oller and R adk e propose a variant b ased on using nearest neig hb ors to g uide the resampling [M R 06]. Their idea is to split the dataset into tw o diff erent parts b y splitting the nearest neig hb or pairs, then ask how w ell a clustering on one part predicts a clustering on the other. F ridlyand and Dudoit also b uild on this idea, instead using a hierarchical procedure b ased on repeatedly splitting a dataset [F D01, DF 02]. Despite all these techniq ues to perturb the data, it’s unclear how often these stab ility indices are actually used in practice [H en04]. P erhaps this is b ecause a sing le scalar index is a fairly limited assessment of the properties of the clustering . Ben-H ur et. al. [BH E G 02] uses histog rams of the stab ility indices to present more information, b ut it is unclear if this is sig nifi cantly more useful. An ideal summary w ould also include statistics indicating the stab ility of each of the clusters as w ell as the entire clustering , b ut few , if any, data perturb ation techniq ues are ab le provide this information.  11  Critiq ues of Cluster Stability Analysis Cluster stab ility analysis, w hile conceptually straig htforw ard and commonly used in practice, does have some major fl aw s. In particular, Ben-David [BDvL P 06] points out that a hig hly stab le clustering implies that clustering is g ood. F or ex ample, k-means w ith 2 means w ill usually converg e to a consistent solution w hen applied to three clusters in w hich tw o are suffi ciently w ell separated from the other or w hen applied to a sing le elliptical cluster. Thoug h the latter case could b e handled b y a parsimony or model complex ity arg ument, the former ex ample w ould counter that. Shamir and Tishb y, how ever, counter this [ST07] b y relating stab ility to measures of g eneraliz ation in model selection, arg uing that the rate of converg ence of these measures is the important criteria. This, they arg ue, justifi es the use of stab ility for cluster validation. Another prob lem, how ever, has b een pointed out b y Tib shirani and W alther in [TW 05]. In particular, if the true numb er of clusters is larg e, the method does not have suffi cient resolution to disting uish b etw een k and k+1 possib le clusters. This prob lem, it seems, is systemic in these techniq ues; how ever, to counter it, may arg ue that k now ing , for ex ample, that predicting that a data set has 15 clusters w hen it actually has 16 is less g rievous a mistak e than predicting a clustering has 2 clusters w hen it actually has 3. N evertheless, Tib shirani’s ob servation lends more w eig ht to our previous claim that summariz ing an entire clustering b y a sing le scalar stab ility index can easily hide important properties of the clustering . Predicting the N umber of Clusters To predict the numb er of clusters in a dataset, one w ould w ould calculate a scalar assessment of the clustering s for a rang e of k, then choose the one that is “ b est” according to some specifi ed criteria. In the data stab ility case, the scalar assessment is usually the stab ility of that clustering . If additional information is availab le, such as the standard deviation of the assessment, more sophisticated methods of choosing k can also b e used, e.g . the smallest one w ithin one standard deviation of the b est. In the literature, it seems that the canonical method for comparing techniq ues is in terms of the accuracy of how w ell they predict the numb er of clusters in a dataset w here the true value of k is k now n. W hile one could arg ue that g ood performance on this type of prob lem may not b e a g ood predictor of its performance in many real-life cluster validation scenarios, w hich is lik ely true, it is perhaps the b est method of comparison out there  12  g iven the discussed limitations of ex isting techniq ues. Thus it is also the method w e use in chapter 5 to compare our proposed cluster validation approach ag ainst current methods. F urthermore, choosing the correct numb er of clusters does come up often, and there are techniq ues developed specifi cally for that, e.g . the g ap statistic, describ ed in section 1.3.  1 .2 .2  Clu stering Sim ilarity Ind ices  Because cluster stab ility analysis depends on how measuring the consistency of a set of clustering s, a k ey sub prob lem that has also received w idescale attention is how to accurately compare clustering s. N umerous methods have b een proposed in the literature. W hile there are literally doz ens of indices, w e focus on tw o here, the H ub ert-Arab ie Adjusted R and Index [H A85], w hich w e denote b y AR, and the V ariation of Information, VI [M ei07, M ei03]. The AR index has a relatively long history in cluster analysis, w hile VI is more recent. [M C85] compare many diff erent indices for measuring the diff erence b etw een tw o partitions in a set of data and end up recommending the H ub ert-Arab ie adjusted R and index . H ow ever, w hile it is arg uab ly the most popular, w e actually found the V ariation of Information to perform b etter in our simulations. W e fi rst discuss the matching matrix , something used commonly b y similarity indices and these in particular. W e then discuss the details of each index , and w e include connections to the prob ab ilistic formulations w hich can b e readily incorporated into our index . The Matching Matrix O ne of the common representations of the overlap b etw een tw o clustering s is A a so-called matching matrix or confusion tab le. If C A = C1A , C2A , ..., CK A A B B and C B = C1B , C2B , ..., CK B , let njk b e Cj ∩ Ck , that is, the numb er of  points in common b etw een clusters CjA and CkB . Then this matching matrix is just MAB = [njk ]j=1,2,...,K A ,k=1,2,...,K B  (1.26)  If the tw o clustering s are represented b y assig nment matrices AA = [aA jk ] and AB = [aB ], w here a = 1 if point j is a memb er of cluster j and 0 jk jk otherw ise, than this matching matrix is just  13  MAB = ATA AB  (1.27)  B B A L et nA j = Cj and let nk = Ck . N ote that these statistics match up to the total counts in row s and columns of the matching matrix :  nA j =  njk  (1.28)  njk  (1.29)  k  nB k = j  The majority of the cluster stab ility indices rely on this matching matrix and particularly these summaries. W e g o on to present tw o such indices in the nex t tw o sections. H ubert-Arabie Adjusted R and Index The H ub ert-Arab ie Adjusted R and Index is b ased on the R and index proposed in 1971 [H A85] and measures stab ility b y look ing at pairw ise assig nments b etw een clustering s. G iven tw o clustering s C A and C B on a dataset X w ith n points, w e can separate the n(n − 1) possib le pairs of points into four categ ories and count the numb er of pairs in each: NA NB NC ND  N umb er of pairs w hose points in the same clusters (ig noring cluster lab els) in C A and C B . N umb er of pairs w hose points are in the same cluster in C A b ut diff erent clusters in C B . N umb er of pairs w hose points are in diff erent clusters in C A b ut the same cluster in C B . N umb er of pairs w hose points are in diff erent clusters in b oth C A and C B .  Using these defi nitions, the R and index is defi ned as C A, C B ) = R(C  NA + ND NA + NB + NC + ND  (1.30)  C A , C B ) is 1 if b oth clustering s are identical. it is easy to see that R(C Computing NA , NB , NC , and ND can b e done naively in O n2 time. H ow ever, b y using alternate statistics of the clusters, w e can calculate it  14  much more effi ciently. Using the notation used to defi ne the matching matrix in the previous section, w e have that njk 2  (1.31)  NA =  j,k  NB =  j  nA j 2  − NA  (1.32)  NC =  k  nB k 2  − NA  (1.33)  n 2  ND =  − NA + NB + NC  (1.34)  w hich can b e calculated easily. H ow ever, the main issue w ith the R and index is that its ex pected value depends on n and tends tow ard 1 as n increases [Ste04, H A85]. This mak es results on diff erent datasets more diffi cult to compare and limits the pow er of the method for larg e n. Thus in 1986, H ub ert and Arab ie proposed a modifi ed index that solved this prob lem: C A, C B ) = AR(C  C A , C B ) − E [R(C C A , C B )] R(C C A , C B ) − E [R(C C A , C B )] max R(C  (1.35)  To calculate the ex pectations, they assumed the partitioning s came from a g eneraliz ed hyperg eometric distrib ution. Under this assumption, it can b e proved [H A85] that   −1 A n n nB j k  C A, C B ] =  E [RC . (1.36) 2 2 2 j  k  The proof, how ever, is leng thy and tedious and w e omit it here. W ith some simple alg eb ra, AR can then b e ex pressed as  C A, C B ) = AR(C  j,k 1 2  j  nA j 2  njk 2  +  − k  nB k 2  j  nA j 2  −  k j  nA j 2  nB k 2  n 2 k  nB k 2  −1  n 2  −1  (1.37) This ex pression can b e calculated effi ciently g iven the partitioning s C A and CB. Because the H ub ert-Arab ie Adjusted R and Index depends on partition counts, w e cannot apply the index directly w hen our input is in terms of prob ab ilities or distrib utions over point assig nments. In this case, w e can 15  use an n-invariant form of the index as g iven b y [Y R 01] and [M ei07]. The idea is to represent each row of the soft assig nment matrix b y m draw s from a multinomial distrib ution w ith w eig hts g iven b y that row . The similarity index b etw een the tw o soft partitioning s of n1 , n2 points is then approx imated b y the similarity b etw een mn1 , mn2 points w ith a hard partitioning , an the approx imation b ecomes ex act as m → ∞ . L et pjk b e the prob ab ility of a point b elong ing to cluster j in C A and cluster k in C B , and let pA j b e the prob ab ility of a point b elong ing to cluster j in C A and similarly for pB k . Then the n-invariant form is g iven b y [M ei07]  j  PAR = 1 2  j  pA j  2  k  p2jk −  +  k  j  pB k  2  pA j  2 k  −  pB k  pA j  j  2  2 k  pB k  2  (1.38) Ag ain, the derivations is tedious and w e omit it here. Variation of Information Some scalar similarity indices rely on prob ab ilistic summaries of the clustering , e.g . the empirical prob ab ility of a point b eing in CjA in clustering C A and CkB in clustering C B . The V ariation of Information is one such index . The V ariation of Information measures the sum of information lost and information g ained b etw een the tw o clustering s: C A , C B ) = H(A) + H(B) − 2M I (A, B) VI(C  (1.39)  w here H(A) is the entropy of A, and M I (A, B) is the mutual information b etw een A and B. W hen using a hard clustering , w e can g et the prob ab ilities needed empirA A ically using statistics from the matching matrix . N ow pA j = nj /n denotes the prob ab ility that a point is assig ned to cluster j in C A , and lik ew ise B B pB k = nk /n denotes the prob ab ility that a point is assig ned to cluster k in C B . L ik ew ise, pjk = njk /n is the prob ab ility that a point is assig ned to cluster CjA in clustering C A and cluster CkB in C B . Then M eil˘a’s V ariation of Information is: B pB k log pk −2  A pA j log pj −  C A, C B ) = − VI(C j  pjk log j  k  16  k  pjk (1.40) B pA j pk  This can easily b e calculated.  1 .3  G ap Statistic  O ne popular w ay of choosing numb er of clusters is using the so-called g ap statistic proposed b y Tib shirani [TW H 01]. The g ap statistic uses the difference in total cluster spread, defi ned in terms of the sum of all pairw ise distances in a cluster, b etw een the true dataset and the clustering s of several reference distrib utions w ith no true clusters. F or ex ample, in E uclidean space and w ith the sq uared error distance metric, the spread is the total empirical variance of all the clusters. The reference datasets are used primarily to adjust for the dependence on k in the measure, b ut also to g uard ag ainst spurious clustering s as random structure w ould presumab ly b e present in the reference datasets. H ow ever, g enerating a null dataset is not necessarily easy, as the fi nal value can still b e hig hly dependent on the distrib ution of points. W hile much of this is not w ell understood, Tib shirani [TW H 01] proposes using the uniform distrib ution w ithin the P CA rotated b ounding b ox of the orig inal dataset, arg uing that a uniform distrib ution is most lik ely to contain spurious structure. F ormally, the w ithin-cluster spread of points is the sum of the distance b etw een each point w ithin a cluster to every other point in the same cluster divided b y the numb er of points: spread (C) =  1 2nj  dist xi , xi  (1.41)  i,i ∈Cj  The most common choice for the distance measure is the E uclidean sum of sq uares distance: dist xi , xi = xi − xi 22 ,  (1.42)  b ut other distance measures can also b e used. N ote that in this case, theorem 1.1.5 tells us that the spread is the same as the empirical variance of the cluster around the mean of the points. W e present the procedure more formally as a three step process: 1. F or each k in a g iven rang e of k’s, cluster the data into k clusters C = {C1 , C2 , ..., Ck }. 2. G enerate Nb reference datasets and cluster each to form Nb sets of clusters C1 , C2 , ..., CNb . The g ap statistic is then 17  1 G ap (k) = Nb  Nb  log (spread (Cb )) − log (spread (C))  (1.43)  b=1  The purpose of the log s is to essentially use the fractional diff erence b etw een the g iven data and the reference instead of the ab solute difference, as log (a/b) = log a − log b. Because there is more dependence on k in spread (C) − spread (Cb ) than in spread (C)/ spread (Cb ), the log s are used here. b 3. If ¯b = N1b N b=1 log (spread (Cb )), compute the unb iased standard deviation of the b aseline clustering s:  sk =  1 Nb − 1  Nb  2  log (spread (Cb )) − ¯b ,  (1.44)  b=1  ˆ using one of tw o criteria: then estimate the numb er of clusters K ˆ = smallest k such that G ap (k) ≥ G ap (k + 1) − sk+1 . A. K ˆ = smallest k such that G ap (k) ≥ G ap ( ) − s , w here B. K ˆ may eq ual . arg max G ap ( ). N ote that K  =  The intuition b ehind A is that at some point G ap (k) levels off and stops improving w ith increasing k. The intuition b ehind the second one invok es a parsimony principle – the simplest model w hich adeq uately ex plains the data, as determined b y how w ell the b est model ex plains it, is prob ab ly correct. W hile [TW H 01] proposes using A, w e actually found B to w ork b etter on some of the simpler prob lems so w e include it here. W e study w hen each holds in detail in chapter 5. In our results section, w e compare our method ag ainst several fl avors of data perturb ation methods and the g ap statistic. The g ap statistic performs surprising ly w ell, thoug h it b eg ins to b reak dow n w hen the clusters do not have G aussian shapes. In this case, it seems that our methods, describ ed in the follow ing chapters, perform the b est.  18  Ch ap ter 2  A Bay esian F ram ew ork for Clu stering Stab ility In this chapter, w e propose an ab stract framew ork that can, at a hig h level, b e applied universally to almost any clustering function. In our approach, w e modify the clustering function to tak e a hyperparameter that q uantitatively introduces some type of perturb ation, then integ rate over a prior on the hyperparameter to ob tain a prob ab ilistic soft assig nment matrix of points to clusters. This matrix g ives the prob ab ility that a point i is assig ned to a cluster j under perturb ation.1 To our k now ledg e, this approach is uniq ue. G iven this matrix , w e propose a set of summariz ing statistics. These allow us to determine the b ehavior of specifi c points and clusters, rank clusters in terms of stab ility, q uantify interactions b etw een clusters, and compare clustering s w ith an overall stab ility index . M any of these tools – particularly those g iving information ab out individual clusters w ithin the clustering – are uniq ue to our method. W e b eg in b y formally describ ing our approach and introducing the perturb ation induced soft assig nment matrix , w hich w e denote here and in sub seq uent chapters using Φ = [φij ]. W hat w e propose here is a framew ork that ab stracts aw ay many of the implementation details, namely the aspects of g enerating Φ that depend on particular clustering alg orithms. The rest of the chapter assumes that those issues are w ork ed out (w hich w e do in chapter 3). W e then introduce a set of tools and summary statistics b ased on Φ that aid in analyz ing and validating a clustering . W e end this chapter b y discussing possib le ex tensions that incorporate previously studied validation 1 This is in contrast to the soft assignment matrix that comes from model based clustering, which gives the probability that point i originated from jth mix ture component.  19  indices. W hile this chapter focuses on the ab stract framew ork , in the nex t chapter w e use this framew ork to develop specifi c alg orithms for clustering validation. The most critical aspects of this are choosing the type of perturb ation and selecting a g ood prior, and w e discuss these issues only superfi cially here. In chapter 5, w e show , in a massive test on synthetic data, that the proposed validation alg orithms proposed in this framew ork are as reliab le or b etter than several other leading methods in terms of determining the numb er of clusters, demonstrating the potential usefulness of the proposed framew ork . F inally, in chapter 6, w e discuss the asymptotic b ehavior of our method as the dimension increases, sug g esting that it is appropriate for use in hig h dimensions.  2 .1  Th e Ab stract F ram ew ork  Suppose C (K, X ) is a clustering function that partitions a set of n data X = {x1 , x2 , ..., xn } into a set {C1 , C2 , ..., CK } of K clusters b ased on some structural feature of the data. O ur approach is to create a new clustering function C (K, X , λ) b y modifying C (K, X ) to tak e a hyperparameter λ, w here the role of λ is (informally) to perturb some aspect of the clustering .2 The defi nitions in our framew ork mak e no assumptions ab out how the parameter aff ects the clustering function; the details w ill vary b ased on the clustering alg orithm and w hat type of stab ility one w ishes to assess. W e then defi ne a prior distrib ution π(λ|θ) over the perturb ation parameter index ed b y a hyperparameter θ. W hile the full prior selection issue depends on the role of λ in C , w e describ e a procedure in chapter 3 to choose an optimal θ using various aspects of the clustering distrib ution. W ithout loss of g enerality, suppose the clustering function C (K, X ) returns an n×K assig nment matrix g iving the assig nment of points to clusters. Thus A = [aij ] = C (K, X )  (2.1)  In the hard clustering case, w e assume that aij eq uals 1 if xi is in cluster C and 0 otherw ise.3 F or soft clustering alg orithms, each row of A is a 2 N otationally, we use a superscript to denote a part of the procedure that has been modifi ed to accommodate the perturbation parameter λ 3 N otationally, we consistently use i to index the data point and j, k, and to index clusters.  20  distrib ution over the clusters g iving the partial memb ership of each point. N ote that these defi nitions imply that j aij = 1. L ik ew ise, suppose that, for a g iven instance of the random variab le λ, the perturb ed clustering function also C (K, X , λ) returns a similarly defi ned assig nment matrix . E x plicitly denoting the dependence on λ, w e have A (λ) = [aij (λ)] = C (K, X , λ)  (2.2)  F or our purposes, w e req uire C (K, X , λ) to b e a deterministic function of λ. H ow ever, this does not necessarily ex clude alg orithms such as k-means w hich inherently incorporate some deg ree of randomness as w e show later. Also, w e assume X and K are constants throug hout our procedure. Ag ain, this does not ex clude clustering alg orithms that return a variab le numb er of centroids, as w e do allow empty clusters.  2 .1 .1  Th e Av erag ed Assig nm ent Matrix  R ecall that the modifi ed clustering alg orithm C (K, X , λ) is a deterministic function of λ. This allow s us to integ rate out the parameter λ from aij (λ) w ith respect to π(dλ|θ). This g ives us an n × K matrix Φ such that φij ex presses the averag e memb ership of xi in cluster Cj under perturb ation. M ore formally, suppose C (K, X , λ) returns, for each point of λ, an n×K assig nment matrix , and let π(λ|θ) b e a prior distrib ution over λ. Then the averag ed assig nment matrix Φ is g iven b y  Φ = [φij ] =  A (λ)π(λ; θ)dλ.  (2.3)  N ote that if one interprets the assig nment matrix A (λ) as a prob ab ility distrib ution over the lab els, i.e. aij (λ) = p(aij |λ),  (2.4)  then eq uation (2.3) has the form of a Bayesian posterior distrib ution. This eq uation formaliz es the core concept of our framew ork . The integ ration spreads the b inary memb ership matrix A (λ) across the clusters b ased on the b ehavior of those points under perturb ation. E ach row of Φ can b e interpreted as a prob ab ility vector such that φij indicates the prob ab ility that datum xi b elong s to cluster j w hen perturb ed. If the perturb ation type and corresponding prior are w ell chosen, this prob ab ility matrix provides sig nifi cant information ab out the b ehavior and stab ility of the clustering . 21  2 .1 .2  Th e Match ing Matrix  W e can defi ne the averag ed matching matrix M = [mjj ] in terms of Φ: M = AT Φ ⇔ mjk =  i  aij φik =  i:aij =1  φik  (2.5)  In this matrix , mjk represents the total point-mass (each point having a mass of 1) in the unperturb ed cluster Cj that moves to cluster Ck under perturb ation. In an analog ous w ay to the matching matrix for comparing tw o clustering s (see section 1.2.2), mjk / |Cj | (normaliz ing M across row s) is the prob ab ility of a point in the unperturb ed cluster Cj b elong ing to cluster Ck under perturb ation. L ik ew ise, mjk /n is the prob ab ility that a randomlyselected point b elong s to cluster Cj in the unperturb ed clustering and to cluster Ck under perturb ation.  2 .1 .3  P ertu rb ations and Lab el Match ing  As w e mentioned ab ove, one assumption req uired in our framew ork is that C (K, X , λ) is a deterministic function of the parameter λ. H ow ever, this is not necessarily enoug h to g uarantee that the integ ral in eq uation (2.3) g ives a sensib le answ er. W e discuss here one reason that is often referred to as the lab el matching prob lem. The lab el matching prob lem refers to the fact that many alg orithms may g ive similar answ ers across diff erent runs b ut only modulo a permutation of the lab el indices. As a simple ex ample, tw o runs of random-start k-means may g ive identical fi nal locations for the centroids, b ut the centroid g iven the index 1 in the alg orithm on the fi rst one may have the index 5 on the second. This means that the points associated w ith one centroid in the fi rst run may b e associated w ith a diff erently lab eled centroid in the second. E ven w ithout randomness in the clustering alg orithm, many w ays of introducing perturb ations into the clustering function w ill cause the cluster lab els to sw itch g iven diff erent values of λ. The lab el matching prob lem is inherent in the data perturb ation methods and has received some attention. There are g enerally tw o w ays of dealing w ith it. The fi rst is to rely on scalar stab ility indices such as those describ ed in section 1.2.2; these are invariant to permutations of the lab els and thus provide a w ay to compare clustering s w hen the lab el matching prob lem is an issue. The other w ay is to permute the lab els to max imiz e their similarity [Bre00, L R BB04]. This corresponds to solving a b ipartite g raph matching prob lem b etw een the tw o sets of clusters lab els, w here the edg e w eig hts are proportional to the numb er of shared points. F inding the optimal matching 22  can b e done in O(K 3 ) time using the H ung arian method [K uh55] or netw ork fl ow simplex [Chv83]. Avoiding Label Matching Comp letely To handle the lab el matching prob lem in our framew ork , w e have a third option in addition to these tw o. Introducing the perturb ation and hyperparameter into the CP step, so C (K, X , λ) = CP (CS (K, X ), λ), provides an alternate w ay around this prob lem. Because the cluster statistics are already calculated, perturb ations w ill not mix up the lab els. This avoids the lab el matching prob lem completely, something not possib le w ith dataperturb ation methods. Should w e w ant to perturb the clustering function in a w ay that req uires matching up the lab els, w e can introduce a K ×K permutation matrix P(λ) to formally ex press this. This permutation matrix matches up the lab els at each point of λ. E q uation (2.5) then b ecomes: Φ = [φij ] =  A (λ)P(λ)dΠ (λ)  (2.6)  Λ  w here P(λ) = arg max trace AT A (λ)Q  (2.7)  Q∈P  The defi nition for eq uation (2.5) w hich follow s remains unchang ed. W hile this is tractab le w hen testing a small set of λ, how ever, it can still pose a sig nifi cant prob lem.  2 .2  Visu aliz ations and Statistical Su m m aries of th e Av erag ed Assig nm ent Matrix  In this section, w e discuss w ays of ex tracting useful information ab out the clustering from the averag ed assig nment matrix . The fi rst techniq ue w e present is a picture of how the clusters g ive and tak e points under perturb ation. W e then ex pand this to include various statistical summaries of these properties.  2 .2 .1  H eatm ap P lot of Φ  W e present here a simple w ay to intuitively visualiz e the b ehavior of a clustering , b y plotting a rearrang ed form of Φ as a heat map. This heat map 23  req uires the cluster lab els to b e matched up; thus it is computationally diffi cult using data perturb ation methods and, w hile it is a simple procedure to display the clustering s, to our k now ledg e it has not b een previously introduced. G iven the averag ed assig nment matrix and the unperturb ed assig nment, w e construct the heat map that separates out the row s b y their assig nment in the unperturb ed clustering and the prob ab ility they are assig ned to that same cluster under perturb ation. An ex ample of how w e do this is show n in F ig ure 2.1. As can b e seen, it g ives us a visually appealing and informative picture of the b ehavior of the clustering under perturb ation. To b e precise, w e b uild an index mapping that rearrang es the row s of A so that the nonz ero elements are all in contig uous b lock s along the columns and these b lock s are arrang ed in descending order along the row s. W ithin the b lock s, w e permute the row s so the elements in the b lock Φ in descending order. W e then apply this index mapping to Φ and further refi ne it so that the row s w ithin each b lock are sorted b y the entry corresponding to the cluster to w hich the point is assig ned in the b aseline clustering . Alg orithmically, this can all b e ex pressed in tw o lines: hi = arg max aik  (2.8)  ψ = arg sort (hi + φihi , i = 1, 2, ..., n)  (2.9)  k  w here ψ is the resulting mapping index . To illustrate the properties of the averag ed assig nment matrix and the the corresponding heatmap, w e show a 2d toy ex ample in F ig ure 2.1. The specifi c perturb ations (scaled distance) and prior (location ex ponential) used to g enerate the plots are introduced in chapter 3; w e focus here on the intuitive understanding . In the heatmap plot, the b lock s along the “ diag onal” denote the stab ility of points relative to their assig ned clusters under perturb ation, and the “ off diag onal” b lock s represent memb ership in other clusters under perturb ation. E ach row show s how the corresponding point b ehaves w hen the cluster is perturb ed; the point mass in unstab le points w ill show up as red and orang e colors to the side of such b lock s. Dark red or b lack indicates separation b etw een clusters. This allow s us to visualiz e clearly how clusters ex chang e points w hen perturb ed.4 4  N ote that all comparisons must be done keeping in mind that only the rows are normaliz ed; comparisons within columns are not on the same scale and, though suggestive, are not formally justifi ed.  24  2  1  0  −1  −2  −3  −4 −4  −3  −2  0  −1  1  2  3  4  (a) C lustered S ample 1.0  1400  B  2  0.9  A  D  1200 0.8  1  1000  0.7  0  0.6 800  C  0.5  −1  600 0.4  E −2  400  −3  200  0.3  0.2  0.1  0 −4 −4  −3  −2  −1  0  1  2  3  (b) P ointwise S tability matrix  4  0.0 A  B  C  D  E  (c) H eatmap P lot  F ig ure 2.1: A 2d toy ex ample illustrating the use of the heatmap plot for analyz ing how clusters ex chang e point mass under perturb ation. (a) show s the clustered 2d sample. (b ) show s the sample w ith the pointw ise stab ility (eq uation (2.11)) of each (x ,y) coordinate plotted in the b ack g round, w ith w hite representing a pointw ise stab ility of 1 and b lack a pointw ise stab ility of 0. (c) show s the corresponding heatmap plot of the clustering .  25  As can b e seen, cluster separation or closeness can b e easily noted in the heatmap. F or ex ample, the 2d plot indicates that cluster E is w ell separated from cluster A; the heatmap plot show s this as A shares no point mass w ith E and vis-versa. Clusters A and B how ever, are not w ell separated, and ex chang e sig nifi cant point mass as illustrated b y the red and orang e (g ray) in b lock s (A,B) and (B,A) in the heatmap.  2 .2 .2  Scalar Stab ility Ind ices  W hile the heatmap presents the clearest picture of the properties of the clustering , statistics summariz ing the b ehavior of the clustering are often useful. W e here present a set of summary statistics that condenses the relevant information in Φ. W e here provide indices that serve several purposes. The pointw ise stab ility of a point, PW i , is a per-point stab ility index that indicates how w ell point i fi ts into the clustering . The cluster-w ise stab ility, CW k , summariz es the stab ility of each clustering . An inter-cluster index IC jk index es the separation of tw o clusters j and k. F inally, the averag e pointw ise APW summariz es the stab ility of the entire clustering . The primary motivating idea b ehind the w ay w e chose to condense the information is competitive loss. E ssentially, w e measure the stab ility of a point according to the “ closeness” of the competition. If the prob ab ility that a point b elong s to one cluster under perturb ation is nearly eq ual to the prob ab ility that it b elong s to a diff erent cluster, it lends evidence that the clustering is unstab le. L ik ew ise, if the prob ab ility that a point b elong s to a g iven cluster is close to one, implying that it is sig nifi cantly g reater than the prob ab ility of it b elong ing to any other cluster, it lends evidence that the clustering is g ood and fi ts the data w ell. Pointwise Stability W e defi ne the pointw ise stab ility of point i in terms of the diff erence in assig nment prob ab ility b etw een the cluster it is assig ned to in the unperturb ed clustering and the max imum prob ab ility of all the others. This, ultimately, g ives us a scalar measure of confi dence in the point’s assig nment. F ormally, hi = arg max aik  (2.10)  k  PW  i  = φihi − max φi =hi  (2.11)  F or a dataset w ith n points, this produces a vector of leng th n. W hile other 26  summary statistics mig ht b e possib le – e.g . the entropy of φi· (see b elow ) – w e fi nd this defi nition w ork s q uite w ell in practice. As can b e seen in F ig ure 2.1, the pointw ise stab ility imposes a type of partitioning on the space around the cluster centers. In this plot, w e fi x the location of the cluster centers and plot the pointw ise stab ility of each (x ,y) coordinate g iven the location of the cluster centers, i.e. w hat the pointw ise stab ility of a point at (x ,y) w ould b e. N ote that near the b oundary reg ions, the pointw ise stab ility is near z ero, w hereas near the cluster centers the pointw ise stab ility is near one. As mentioned, note that other summary statistics are also possib le. In particular, one could use the entropy of each row of Φ instead of the competitive loss as the resulting measure of stab ility. H ow ever, the pointw ise stab ility has the advantag e that it detects w hen the g iven clustering is w rong , i.e. the prob ab ility of a point b elong ing to the cluster it is assig ned to in the unperturb ed clustering is less than the prob ab ility of it b elong ing to a diff erent cluster. To see this, suppose that φi = 1 for = j (recall that j is the cluster it is assig ned to in the unperturb ed clustering ). In such a case, PW i w ould b e -1 b ut the entropy w ould b e the same as the correct case w here φij = 1. Stability Measures for Clusters W e defi ne the cluster-w ise stab ility as the averag e pointw ise stab ility of the points assig ned to it in the unperturb ed clustering . Specifi cally, CW  k  = mean PW i:xi ∈Ck  (2.12)  i  This statistic allow s for useful operations such as rank ing clusters b y their relevance to the structure of the data. This type of information is very diffi cult to attain using other methods. W e can also measure the separation of clusters b ased on the prob ab ility that the points assig ned to each of the clusters in the unperturb ed clustering stay there under perturb ation minus the prob ab ility that they move to the other cluster. Specifi cally,  IC jk =    1  |Cj | + |Ck |  φij − φik + i:xi ∈Ck  i:xi ∈Cj    φik − φij .  (2.13)  A value of IC jk near 1 indicates that clusters j and k are w ell separated as they ex chang e almost no point mass under perturb ation. A value of 0 27  (or less) means that they are practically indisting uishab le (assuming the method used to g enerate Φ is sensib le). N ote that it is also possib le to construct an asymmetric measure of intercluster stab ility b y measuring the averag e point mass that moves from cluster j to cluster k under perturb ation. F ormally, AssymmetricIC jk = mean φij − φik . i:xi ∈Cj  (2.14)  This could b e useful if the clusters had sig nifi cantly diff erent properties that, for instance, caused some clusters to b e unstab le and g ive point mass to their much more stab le neig hb ors. W hile practically such a b ehavior can b e seen in the heatmap plot, an index that allow s for the detection of such b ehavior could b e useful. Stability Indices for the Clustering F inally, the averag e of the pointw ise stab ility vector g ives us an overall measure of the stab ility of the clustering : APW  = mean PW i  i.  (2.15)  W e refer to this as the averag e pointw ise stab ility of the clustering . In chapter 5, w here w e present our results, w e use the diff erence b etw een the APW of a clustering of the g iven data and the APW on a b aseline null clustering , defi ned in the nex t chapter. W e show that this method can b e as accurate, in terms of predicting the numb er of clusters, as the other leading methods.  2 .2 .3  Ex tensions to Oth er Stab ility Ind ices  Because many of the scalar scalar indices w ork b y comparing tw o partitioning s, it is natural to compare the unperturb ed assig nment matrix and Φ using these indices as an alternative method of ob taining a statistical summary of the clustering . In this section, w e discuss using tw o such indices, the H ub ert-Arab ie Adjusted R and Index and the V ariation of Information. W e introduced b oth of these in the contex t of comparing tw o clustering s in section 1.2.2. R ecall from section 1.2.2 that the n-invariant form of the H ub ert-Arab ie Adjusted R and Index is g iven b y  28  j  AR = 1 2  j  pA j  2  k  p2jk −  +  k  j  pB k  pA j  2  2 k  −  j  pB k  pA j  2  2 k  pB k  2  (2.16) and the V ariation of Information (section 1.2.2) is g iven b y A pA j log pj −  VI = − j  B pB k log pk − 2  pjk log j  k  k  pjk B pA j pk  (2.17)  W e have the prob ab ilities needed to calculate these from A and Φ as follow s: pA The prob ab ility of a point b elong ing to cluster j in the unj perturb ed clustering is simply the total numb er of points in that cluster divided b y the total numb er of points, so pA j = pB k  1 n  aij .  (2.18)  i  L ik ew ise, The prob ab ility of a point b elong ing to cluster k in the averag ed assig nment matrix is simply the total point mass in that cluster divided b y the total numb er of points, so 1 φik . (2.19) pB k = n i  pjk  The prob ab ility that a point b elong s to cluster j in the unperturb ed clustering and to cluster k under perturb ation. R ecall that the matching matrix M g ives the total point mass shared b etw een clusters the unperturb ed clustering and clusters in the averag ed assig nment, so pjk =  mjk n  (2.20)  W e now have everything needed to calculate VI and AR . H ow ever, w e ob served that these indices do not w ork as w ell as the averag e pointw ise 29  stab ility, so w e present them here as possib le ex tensions and instead recommend using the averag ed pointw ise stab ility for g eneral use. This hints that in b ypassing the lab el matching prob lem – as these indices do – they discard important information ab out the dataset and the clustering .  2 .3  Conclu sion  In this chapter, w e have presented a g eneral framew ork for clustering validity b ased on introducing perturb ations into the clustering function. This method may b e seen as a natural ex pansion of clustering stab ility analysis to include more g eneral perturb ations of the clustering function. This ex pansion is consistent w ith the Bayesian tenet of conditioning on the data b ut ex pressing uncertainty in the modeling procedure. N ow that w e have presented the framew ork at a hig h, ab stract level, w e g o throug h the details of how w e introduce the perturb ation into the clustering function and how w e select a prior.  30  Ch ap ter 3  Bay esian Clu ster Stab ility Alg orith m s W hile the previous chapter proposed a g eneral framew ork to use in developing cluster validation techniq ues, this chapter uses this framew ork to develop alg orithms for cluster validation. R ecall that the k ey idea in this framew ork is to alter the clustering function to tak e a parameter λ w hich index es some sort of perturb ation, then integ rate out λ over a prior π(λ|θ) to g et an averag e assig nment matrix Φ. F or reference, recall here the principle eq uation as  Φ = [φij ] =  A (λ)π(λ; θ)dλ  (3.1)  w here A (λ) is the perturb ed assig nment matrix returned b y the modifi ed clustering function, A (λ) = [aij (λ)] = C (K, X , λ).  (3.2)  At this point, there are tw o outstanding issues to address. F irst, w hat part of the clustering function should b e perturb ed? Certainly, there are numerous possib ilities. Show ing that a particular perturb ation techniq ue is “ optimal” w ould b e ex tremely diffi cult, and the proof w ould lik ely b e hig hly dependent on a particular application, clustering alg orithm, or type of data. H ow ever, it is far less diffi cult to demonstrate that a particular techniq ue can w ork w ell on several types of data and a variety of clustering alg orithms. In this chapter and later, w e advocate scaling the distance metric used to partition the data points b y the random variab le 31  index ing the perturb ation. W e demonstrate in section 3.1 that this method has some desirab le g eometric q ualities. In chapter 5, w e show that it can perform q uite w ell, matching or outperforming other leading methods. O n the theoretical side, w e show in chapter 6 w e show that it has some nice properties as the dimension increases. Second, g iven the perturb ation method, w hich prior should b e used? Ag ain, there are numerous possib ilities to consider, and proving that a prior is somehow optimal is q uite diffi cult and w ould lik ely depend sig nifi cantly on the type of data and other factors; thus it is b eyond our scope. O ur aim here is to fi nd a prior that w ork s w ell on a w ide variety of data and can b e calculated q uick ly enoug h to b e used in a w ide variety of applications. Ultimately, w e leave the full prior selection q uestion for future research, and instead focus on choosing the b est prior out of a class of priors. W e choose the hyperparameters of the prior to max imiz e the diff erence in overall stab ility ag ainst a b aseline null clustering and the overall stab ility of the clustered data. F or a class of perturb ations b ased on distance scaling , this techniq ue turns out to w ork q uite w ell, as w e demonstrate in chapter 5. W e propose several classes of priors for the distance scaling perturb ations, a location ex ponential and shifted G amma distrib ution. In section 3.3.1, w e derive an effi cient alg orithm to calculate the averag ed assig nment matrix Φ w ith a location ex ponential prior in O(nK log K) time. In section 3.4.1, w e do the same for the shifted G amma prior, thoug h this alg orithm req uires O nK 3 time to calculate. H ow ever, w e show in section 3.6 that these prior classes have several nice properties, the most sig nifi cant b eing provab le b ounds on the error if some of the centroids are ex cluded from the calculation; this can vastly improve the running time. F inally, in section 3.5, w e propose a g eneral and reasonab ly fast M onte Carlo alg orithm, w hich allow s the partial assig nment matrix to b e calculated provided the prior can b e sampled from.  3 .1  P ertu rb ing th e d istance m etric  In the alg orithms w e discuss here, w e restrict ourselves to clustering functions w here the fi nal partitioning depends on the distances to a set of centroids or ex emplar points, and data points are assig ned to the nearest centroid or ex emplar. Thus w e assume the fi rst stag e of the clustering alg orithm returns an n × K matrix D = [dij ], w here dij is the distance b etw een point i and centroid (or ex emplar) j. Then the assig nment matrix A in the unperturb ed clustering w ould b e  32  aij =  1 dij ≤ di ∀ = j . 0 otherw ise  (3.3)  W e propose that scaling the distance metrics used to partition the points is a reasonab le w ay to introduce the perturb ation. Specifi cally, g iven D = [dij ], the perturb ed assig nment matrix A (λ) w ould b e aij (λ) =  1 dij λj ≤ di λ ∀ = j , 0 otherw ise  (3.4)  and the elements of Φ = [φij ] can b e ex pressed as φij =  I[dij λj ≤ di λ ∀ = j]π(λ; θ)dλ.  (3.5)  F or convenience, w e here introduce an alternative functional formulation of eq uation (3.5) that ex plicitly denotes the dependence on the distances. L et di = (di1 , di2 , ..., diK ) b e the vector of points to centroids. Then, defi ne the function ψ as ψj (di , θ) = φij =  I[dij λj ≤ di λ ∀ = j]π(λ; θ)dλ.  (3.6)  W e use this formulation often in sub seq uent proofs. F or several reasons, w e b elieve this is a reasonab le w ay to introduce the perturb ations. F irst, b ecause the distance measures are the statistics used to partition the data points, chang ing them suffi ciently w ill result in an altered confi g uration. Second, it g ives us an ex cellent g eometrical interpretation w hich w e elab orate on nex t. Third, w e are ab le to derive computationally effi cient and accurate alg orithms to calculate φij – for a location ex ponential prior, the most useful prior w e’ve found, it can b e done in amortiz ed O(log K) time. F inally, w e demonstrate in chapter 5 that it can yield ex cellent results that match or ex ceed all other methods w e’ve compared ag ainst.  3 .1 .1  Intu itiv e Und erstand ing  R estating eq uation (3.5) in terms of prob ab ilities g ives us φij = P(dij λj ≤ di λ ∀ )  (3.7)  φij is, intuitively, a measure of how “ competitive” the distances to the clusters are. If cluster j is a clear w inner – meaning it is sig nifi cantly closer than the others – then P(dij λj ≤ di λ ∀ ) w ould b e hig her as the condition 33  w ould b e true for a larg er set of λ. Conversely, if the distances to another cluster is similar to the distance to cluster j, φij w ould b e less as the event {dij λj ≤ di λ ∀ } w ould occur for a smaller set of λ. E ssentially, then the values in Φ provide information ab out the relative density of points close to the b oundary reg ions versus the density around the cluster centers.  3 .1 .2  Ex am p le: Ex p onential P rior  At this point, it is helpful to consider an ex ample prior for scaled distance b ased perturb ations that has an intuitive analytic evaluation for φ. The priors that w e ultimately propose for use in actual cluster validation w ork yield a very unintuitive eq uation for φij (see section 3.3.1, b ut w ork much b etter than this one in practice. W e fi rst present the result as a proof and then discuss the fi nal eq uation. Theorem 3.1.1. φij with an exp onential p rior. L e t φij be d e fi n ed as in eq u atio n (3 .5 ), an d le t π(λ) =  E xp(λ |θ) =  θe −θ λ .  (3.8)  T h en d−1 ij  φij =  d−1 i  (3.9)  P ro o f. F or convenience, w e drop the i’s as they are constant throug hout the proof. By defi nition, w e have that  34  φj =  E xp(λ |θ)dλ  I[dj λj ≤ d λ ∀ = j] θe −θλ  =  I[dj λj ≤ d λ ]θe−θλ  (3.10)  dλ  (3.11)  =j ∞  = 0 ∞  = 0 ∞  = 0    θe−θλ    θe−θλ   dj λj /d  exp − =j  θ exp−1 + θ  dj =j d  θ 1+  =  =j     = =  ∞  1  0  =j ∞    θe−θλ dλ  dλj  (3.12)    θdj λj  dλj d   dj   θλj dλj d    Exp λj |θ1 +  (3.13)  (3.14 )  =j   dj  dλj d  dj =j d  1+ d−1 j  (3.15 ) (3.16 ) (3.17 )  d−1  T h is resu lt – th a t th e sta b ility is in dic a ted b y th e n o rm a liz ed in v erse dista n c e m etric s – is a sen sib le. If o n e o f th e dista n c es is q u ite sm a ll rela tiv e to th e o th ers, th e c o rrespo n din g en try in φij w ill b e c lo se to 1. S im ila rly , if a ll o f th em a re c lo se to th e sa m e v a lu e, φij b ec o m es c lo se to 1/K.  3.2  Prior Selection Using a Baseline Null Distribution  T h e fu ll q u estio n o f prio r selec tio n is h isto ric a lly a n extrem ely diffi c u lt pro b lem , a n d th e presen t c a se is n o diff eren t. A s a resu lt, fi n din g a g o o d prio r, o r c la ss o f prio rs, is a n eff o rt/ rew a rd tra de-o ff – th e g o a l is to fi n d so m e g u idin g prin c iples th a t a re rela tiv ely ea sy to w o rk w ith b u t g rea tly im pro v e th e  35  tech n iq u e. W h ile th ere a re su rely a dditio n a l u n disc o v ered prin c iples b ey o n d th e o n e w e o u tlin e h ere, th e fo llo w in g prin c iple h a s pro v ed v ery u sefu l.  3.2.1  O bserv ations from H igh Dim ensions  In h ig h er dim en sio n s, th e dista n c es b etw een po in ts b ec o m e sim ila r a n d m a n y m eth o ds b rea k do w n . M u ch litera tu re h a s b een pu b lish ed o n th is eff ec t. In fa c t, if ea ch dim en sio n c o m po n en t xi in a set o f ra n do m da ta is i.i.d. fro m a c o n tin u o u s distrib u tio n , m a xi=j xi − xj m in i=j xi − xj  2  −→P  1  (3.18 )  2  a s th e dim en sio n g o es to in fi n ity . T h is is pro v ed b y H in n eb u rg et. a l. [H A K 00] (th eir v ersio n o f th e pro o f h a s m o re g en era l c o n ditio n s th a n i.i.d., b u t i.i.d. a n d c o n tin u o u s is su ffi c ien t.) T h e im plic a tio n o f th is fo r u s is th a t c lu sterin g a lg o rith m s in h ig h dim en sio n s o ften pro du c e a c lu sterin g b a sed o n v ery sm a ll rela tiv e diff eren c es in dim en sio n . T h is do es n o t n ec essa rily im ply th e c lu sterin g is u n sta b le; ra th er, it refl ec ts th e fa c t su ch sm a ll diff eren c es c a rry sig n ifi c a n t in fl u en c e in determ in in g th e o u tc o m e. T h is o b serv a tio n is so m eth in g th a t c lu ster v a lida tio n sch em es m u st ta k e in to c o n sidera tio n s.  3.2.2  T unable Priors  In o u r c a se – sc a led dista n c e pertu rb a tio n s, w h ere th e sc a lin g is determ in ed b y a prio r – th is m ea n s th a t w e m u st b e a b le to a dju st th e sev erity o f th e sc a led dista n c e pertu rb a tio n s to a c c o m m o da te th is eff ec t. In o th er w o rds, w e m u st b e a b le to a dju st π(λ|θ) so th a t P(dij λj ≤ di λ ∀ ) h a s th e pro per sen sitiv ity to th e m a g n itu de o f th e rela tiv e diff eren c es b etw een th e dij ’s. W e su spec t th a t b ein g a b le to rea dily tu n e o u r m eth o d is o n e th in g th a t g iv es it a n edg e o v er th e o th er m eth o ds fo r testin g sta b ility in h ig h er dim en sio n s, a s n o o th er m eth o ds th a t w e k n o w o f explic itly a c c o u n t fo r it. B o th o f th e prio r c la sses w e in tro du c e in sec tio n s 3.3 a n d 3.4 c a n b e tu n ed to b e a rb itra rily sen sitiv e, a resu lt w e pro v e fo rm a lly in ch a pter 6 . F ig u re F ig u re 3.1 illu stra tes th is b y fi xin g th e c lu ster c en ters a n d plo ttin g th e po in tw ise sta b ility o f th e c o o rdin a tes (x, y ) a ro u n d th ese c en ters. T h is is do n e fo r sev era l v a lu es o f th e prio r h y perpa ra m eter θ. A s c a n b e o b serv ed, a s th e v a lu e o f θ in c rea ses, po in ts m u st b e m u ch c lo ser to th e b o u n da ry reg io n s b efo re th ey a re u n sta b le. A s disc u ssed a b o v e, th is is h ig h ly a dv a n ta g eo u s in h ig h er dim en sio n s. 36  Pointwise Stability for β = 0.100000  Pointwise Stability for β = 0.750000  2.0  2.0  1.5  1.5  1.0  1.0  0.5  0.5  0.0  0.0  −0.5  −0.5  −1.0  −1.0  −1.5  −1.5  −2.0 −1.5  −1.0  −0.5  0.0  0.5  1.0  1.5  −2.0 −1.5  −1.0  (a) β = 0.1  −0.5  0.0  0.5  1.0  1.5  (b ) β = 0.7 5  Pointwise Stability for β = 2.000000  Pointwise Stability for β = 10.000000  2.0  2.0  1.5  1.5  1.0  1.0  0.5  0.5  0.0  0.0  −0.5  −0.5  −1.0  −1.0  −1.5  −1.5  −2.0 −1.5  −1.0  −0.5  0.0  0.5  (c ) β = 2  1.0  1.5  −2.0 −1.5  −1.0  −0.5  0.0  0.5  1.0  1.5  (d ) β = 10  F ig u re 3.1: P lo ts o f th e po in tw ise sta b ility a t ea ch po in t o f a 2d pla in g iv en th e c lu ster c en ters. In o u r c a se, w e u se a b a selin e n u ll c lu sterin g – a c lu stered distrib u tio n o f po in ts in w h ich a n y c lu sters fo u n d a re spu rio u s, ra n do m eff ec ts ra th er th a n tru e c lu sters – to tu n e th e prio r to a c c o u n t fo r th e eff ec ts o f h ig h dim en sio n s. T h e b a selin e da ta n eeded fo r o u r pro c edu re is a po in t-to -c en tro id dista n c e m a trix th a t represen ts a c lu stered da ta set th a t h a s n o a c tu a l c lu sters. T o tu n e th e prio r, π(λ|θ), w e ch o o se th e θ th a t m a xim iz es th e diff eren c e b etw een th e a v era g e po in tw ise sta b ility o f th e c lu sterin g w e a re exa m in in g a n d th e b a selin e c lu sterin g . U sin g th is prio r in o u r m eth o d, in prin c iple, diff eren tia tes b etw een a b a d c lu sterin g , w h ich h a s m o re po in ts in th e u n sta b le b o u n da ry reg io n s, a n d a g o o d c lu sterin g , w h ich do es n o t. A n o th er w a y o f sa y in g th is is th a t w e ch o o se th e pa ra m eters fo r th e 37  prio r a s th o se w h ich m a xim iz e th e diff eren c e b etw een th e sta b ility o f th e tru e da ta a n d th e sta b ility o f th e b a selin e. T h is c a u ses th e dista n c e m etric to “ fi t” th e reg io n s o f sta b ility to th e tru e c lu sters. In th e b a selin e, th e den sity o f th e da ta set a ro u n d th e b o u n da ry reg io n s is ro u g h ly th e sa m e a s th a t c lo ser to th e c lu ster c en ters. T h u s m a xim iz in g th e diff eren c e pen a liz es prio rs th a t fo rc e po in ts to b e v ery c lo se to th e b o u n da ry reg io n s b efo re th ey c a n b e c o n sidered sta b le. B ec a u se th e b a selin e c a n b e m a de a rb itra rily sta b le (a ssu m in g th e prio rs w e pro po se la ter in th is ch a pter) b y ch o o sin g th e prio r pa ra m eters so th a t o n ly po in ts v ery c lo se to th e b o u n da ry reg io n s a re u n sta b le, th is pen a liz es prio rs th a t c a u se to o m u ch o f th e spa c e to b e c o n sidered sta b le. C o n v ersely , if th e den sity a ro u n d th e c lu ster c en ters o f th e tru e da ta rea lly is h ig h er th a n th a t c lo se to th e b o u n da ries, th en m a k in g to o little o f th e spa c e w ill dec rea se th e diff eren c e. T h e o ptim u m b a la n c es th e tw o , a n d in pra c tic e th is w o rk s v ery w ell. T h is c a n b e fo rm u la ted a s a n o ptim iz a tio n pro b lem . L et A P W D (θ) den o te th e diff eren c e in a v era g e po in tw ise sta b ility , i.e. A P W  D (θ ) = A P W  (D, θ ) −  ⇒ θ = a rg m a x A P W  1 NB  = a rg m a x A P W  (DB i ,θ )  1 NB  A P W  (3.19 )  i  (D, θ ) −  θ  A P W  (DB i ,θ )  (3.20)  i  D (θ )  (3.21)  θ  w h ere w e express th e a v era g ed po in tw ise sta b ility , A P W , a s a fu n c tio n o f th e po in t to c en tro id dista n c e m a trix D a n d th e prio r h y perpa ra m eter θ. W e u se th e a v era g e o v er NB b a selin e dista n c e m a tric es a s th e b a selin e A P W . B ec a u se it is a n o ptim iz a tio n pro b lem , th e a lg o rith m s a re m u ch m o re effi c ien t if th ey c a n m a k e u se o f th e fi rst deriv a tiv e o f A P W D (θ) w ith ∂φ respec t to θ; th u s w e a lso deriv e ∂θij .  3.2.3  T y p es of Baseline Distance M atrices  M ea su rin g th e sta b ility o f a b a selin e n u ll c lu sterin g is o ften u sed to a c c o u n t fo r depen den c e o f th e m ea su re o n n o n -c lu sterin g eff ec ts o f th e da ta . T h e g a p sta tistic u ses a c lu sterin g o n a u n ifo rm distrib u tio n to a dju st fo r th e depen den c e in th e m ea su re o n th e n u m b er o f c lu sters. H o w ev er, th e g a p sta tistic is q u ite sen sitiv e to th e distrib u tio n o f th e n u ll da ta sin c e it depen ds o n o n ly th e c lo sest c lu ster. In c o m pa riso n , c a lc u la tin g φij depen ds o n th e distrib u tio n o f dista n c es, so it is fa r less sen sitiv e to th e n u ll da ta . T h is 38  a llo w s u s sig n ifi c a n tly m o re fl exib ility in h o w w e g en era te th e dij ’s fo r th e b a selin e. U ltim a tely , w e c o n sider th ree ty pes o f b a selin e dista n c e m a tric es to u se to tra in th e prio r. T h e fi rst c o m es fro m th e b a selin e distrib u tio n pro po sed prev io u sly a s pa rt o f th e g a p sta tistic [T W H 01], n a m ely rec lu sterin g a set o f po in ts distrib u ted u n ifo rm ly in th e b o u n din g b o x o f th e da ta set a lo n g its P C A c o m po n en ts. T h e sec o n d is a lso u ses a u n ifo rm da ta set in th e sa m e b o u n din g b o x, b u t rec y c les th e sa m e c lu ster c en ters. T h e th ird u ses a n idea fro m n o n pa ra m etric s a n d sim ple u ses a ra n do m perm u ta tio n o f th e elem en ts in th e dista n c e m a trix. In ch a pter 5 w e sh o w th a t in h ig h er dim en sio n s a n d o n m o re diffi c u lt da ta sets, a ll th ree b a selin e distrib u tio n s pro du c e c o m pa ra b le resu lts, b u t th e ch ea pest c o m pu ta tio n a lly , b y fa r, is th e th ird. Re c lu ste re d, U n ifo rm ly Distrib u te d Da ta T ib sh ira n i [T W H 01] pro po ses ru n n in g th e c lu sterin g a lg o rith m o n a da ta set u n ifo rm ly distrib u ted in th e b o u n din g b o x o f th e o rig in a l da ta a lo n g its P C A c o m po n en ts. T h ey ju stify th is em piric a lly a n d b y n o tin g th a t in o n e dim en sio n , a u n ifo rm distrib u tio n is pro v a b ly th e m o st lik ely to c o n ta in spu rio u s c lu sterin g s a n d pro v ide a n in tu itiv e a rg u m en t th a t th is g en era liz es in to h ig h er dim en sio n s. O n e do w n side to th is is th a t it do es req u ire ru n n in g th e c lu sterin g a lg o rith m m u ltiple tim es. In u sin g th is b a selin e fo r o u r m eth o d, w e fi n d th a t, in lo w er dim en sio n s, it w o rk s b est w ith da ta less ea sily desc rib ed b y sim ple, sy m m etric disstrib u tio n s su ch a s 2 dim en sio n a l A N O V A da ta . H o w ev er, in h ig h er dim en sio n s a n d w ith less stru c tu red da ta , th e n ext m eth o ds th a t w e pro po se y ield c o m pa ra b le o r ev en su perio r resu lts w ith o u t req u irin g m u ltiple ru n s o f th e c lu sterin g fu n c tio n . U n ifo rm ly Distrib u te d Da ta In th e g a p sta tistic , th e pu rpo se o f rec lu sterin g th e da ta is to c a ptu re spu rio u s c lu sterin g s in th e da ta set a n d th u s a c c o u n t fo r th o se in th e c lu sterin g o n rea l da ta . H o w ev er, sin c e o u r pu rpo se is to tu n e th e prio r, w e c a re m o re a b o u t th e estim a ted den sity o f po in ts b etw een th e c lu ster reg io n s th a n a b o u t spu rio u s c lu sterin g s. T h is pro v ides so m e ju stifi c a tio n fo r sim ply rec y c lin g th e c en tro ids fro m th e o rig in a l da ta set (w e pro v ide so m e em piric a l ju stifi c a tio n in ch a pter 5 ). T h is b o th sa v es th e c o m pu ta tio n a l expen se o f reru n n in g th e c lu sterin g a lg o rith m a n d y ields th e n a tu ra l g eo m etric in terpreta tio n th a t tu n in g th e prio r is a dju stin g th e sh a rpn ess o f th e g eo m etric pa rtitio n in g o f th e spa c e. 39  P e rm u ta tio n o f th e Dista n c e M a trix T h is m eth o d, in spired b y perm u ta tio n tests fro m n o n pa ra m etric s, sim ply u ses a ra n do m perm u ta tio n o f a ll th e elem en ts in th e dista n c e m a trix. P erm u ta tio n tests [R W 7 9 ] a re u sed in n o n -pa ra m etric sta tistic s a s a w a y o f o pera tio n a liz in g n u ll h y po th esis testin g w ith o u t h a v in g to spec ify a n u ll distrib u tio n prec isely . H ere, w e tu n e th e prio r to m a xim a lly distin g u ish b etw een th e n u ll distrib u tio n a n d th e tru e distrib u tio n . F u rth erm o re, it h a s th e a dv a n ta g e o f b ein g distrib u tio n free in th e sen se th a t o n ly th e o rig in a l dij ’s a re n eeded; th is c a n b e a h u g e a dv a n ta g e fo r da ta ty pes w h ere it is n o t c lea r h o w to g en era te a n u ll distrib u tio n .  3.3  Scaled -d istance Perturbations w ith a L ocation E x p onential Prior  T h e m o st u sefu l prio r w e’v e fo u n d th u s fa r o n sc a led dista n c e pertu rb a tio n s is th e expo n en tia l distrib u tio n w ith th e lo c a tio n sh ifted b y 1: π(λ|θ) = LocExp(λ|θ) = θe−θ(λ−1) 1λ≥1 .  (3.22)  N o te th a t sin c e fa c to rs th a t sc a le th e en tire distrib u tio n dro p o u t (sec tio n 3.1.2), it w o u ld b e redu n da n t to h a v e pa ra m eters c o n tro lin g b o th th e sh ift a n d th e slo pe; th u s w e set th e slo pe to 1. W e den o te th e c o rrespo n din g a v era g ed a ssig n m en t m a trix w ith a su persc ript LE:  ΦLE = [φLijE ] =  I[dij λj ≤ di λ ∀ ]  LocExp(λ |θ ) dλ  (3.23)  A lth o u g h w e deriv e a n a n a ly tic a l w a y to c a lc u la te th is fo r th e c a se w h en th e pa ra m eter θ c a n v a ry b etw een prio rs, in pra c tic e a n d in o u r tests w e tied th em a ll to th e sa m e v a lu e, so θ = θ. T h e eq u a tio n s fo r th is c a se a re a stra ig h tfo rw a rd sim plifi c a tio n o f th e m o re g en era l resu lt. B y sh iftin g th e expo n en tia l, φij = P(dij λj ≤ di λ ∀ ) is n o lo n g er in depen den t o f th e sc a lin g pa ra m eter θ. T h is g iv es u s a tu n in g pa ra m eter to a dju st th e deg ree o f pertu rb a tio n – n o te th a t a s θ in c rea ses, th e prio r pu ts its m a ss c lo ser 1, c a u sin g φij to resem b le th e in dic a to r fu n c tio n I[dij ≤ di ∀ ] u n less th e diff eren c es in di· ’s a re su ffi c ien tly sm a ll. A s w e disc u ss in sec tio n 3.2.2, th is a llo w s u s to o ptim a lly fi t th e pertu rb a tio n to th e c lu sterin g . W h ile it do es m a k e th e deriv a tio n su b sta n tia lly m o re diffi c u lt th a n th e m o re in tu itiv e n o n -sh ifted expo n en tia l desc rib ed in sec tio n 3.1.2, th e c a lc u la tio n is still q u ite effi c ien t. W e deriv e a n a lg o rith m in sec tio n 3.3.1 th a t 40  c a lc u la tes ΦLE in O(nK lo g K). A dditio n a lly , in sec tio n 3.6 , w e u se th e la ck o f su ppo rt in th e prio r distrib u tio n fo r λ < 1 to pro v e a rig o ro u s a n d u sefu l lo w er b o u n d o n th e a c c u ra c y w h en w e ig n o re sets o f dista n c es la rg er th a n a n ea sily c o m pu ted th resh o ld.  3.3.1  A naly tic C alculation of φLE  In th is sec tio n , w e deriv e a c o m pu ta tio n a lly sim ple a lg o rith m to c o m pu te φLE , sc a led dista n c e pertu rb a tio n s w ith a lo c a tio n expo n en tia l prio r. W e do th e c a lc u la tio n in depen den tly fo r ea ch ro w o f φ, so fo r n o ta tio n a l c o n v en ien c e w e dro p th e ro w in dex i. P ro p o sitio n 3 .3 .1. Ca lc u la tio n o f φLE Suppose d is a list of K distances and θ is a list of K slope parameters for a location exponential prior. Let ψ be a bijective mapping of the indices, {1, ..., K} → {1, ..., K}, that puts d in sorted order, i.e. dψ(1) ≤ dψ(2) ≤ · · · ≤ dψ(K) .  (3.24 )  Let j  Bj = k=1  Cj =  θψ(k) dψ(k)  (3.25 )  1  j=1  exp −  dψ(j) dψ(k)  j−1 k=1 θψ(k)  K  Ck Bk−1 Bk−1  Dj = k=j+ 1  −1  dψ(k) +1 θψ(k)  j ∈ {2, ..., K}  (3.26 )  −1  (3.27 )  T hen φLjE =  θj dj  Cψ−1 (j) − Dψ−1 (j) Bψ−1 (j)  P roof. F ro m eq u a tio n (3.23), w e h a v e th a t  41  (3.28 )  φLjE =  πj (λj , θj )  I[dj λj ≤ d λ ]π (λ , θ )dλ dλj  (3.29 )  =j ∞  ∞  ∞  θj e−θj (λj −1)  = 1  θ e−θ =j  (λ −1)  δ((d λ − dj λj ) − t )dt dλ dλj 0  1  (3.30) b  w h ere δ(·) is th e D ira c delta fu n c tio n , defi n ed su ch th a t a δ(x − t)dt eq u a ls ∞ o n e if a ≤ x ≤ b a n d z ero o th erw ise, so 1a≤b = 0 δ(b−a−t)dt, a n d f (x) = f (t)δ(x − t)dt [A W R 9 6 ]. W e c a n th en sw itch th e o rder o f in teg ra tio n a n d in teg ra te o v er λ fi rst: ∞  φLjE =  ∞  θj e−θj (λj −1) 1  =j  0  θ −θ e d  “t  ∞  θj e−θj (λj −1)  = 1  exp −θ =j  +dj λj d  −1  ”  1h t  +dj λj d  θ dj λj −1 − t d d  ≥1  i  dt dλj  (3.31) dλj  t =m a x (d −dj λj ,0)  (3.32) ∞  θj e−θj (λj −1)  = 1  =j  dj d λj −1  exp −θ 1  λj ≥ ddj o th erw ise  dλj  (3.33)  A t th is po in t, w e c a n u se th e fa c t th a t m a n y o f th e term s in th e pro du c t a re o n e fo r c erta in reg io n s o f th e in teg ra tio n to b rea k th e reg io n o f in teg ra tio n in to sec tio n s. If w e so rt th e term s b y in c rea sin g d , w e c a n h a n dle th e c o n ditio n a l term s in th e pro du c t b y b rea k in g th e in teg ra tio n u p in to K + 1 po ssib ly em pty in terv a ls, do in g th e in teg ra tio n sepa ra tely o n ea ch in terv a l. R ec a ll th a t ψ is a b ijec tiv e m a ppin g su ch th a t dψ(1) ≤ dψ(2) ≤ · · · ≤ dψ(K) . T h en th e b o u n da ries o f th ese reg io n s a re g iv en b y  Am =  dψ(m) /dj ∞  m ∈ {1, ..., K} m=K +1  (3.34 )  T h e in terv a ls to in teg ra te o v er a re th en (0, A1 ] ∩ [1, ∞) , (A1 , A2 ] ∩ [1, ∞) , ..., (AK , AK+  1  = ∞) ∩ [1, ∞) .  A s Am = dψ(m) /dj ≤ 1 fo r m < ψ −1 (j), th e fi rst ψ −1 (j) o f th ese in terv a ls a re em pty . T h u s th e in teg ra tio n b ec o m es 42  K  Ak+  φj =  k  1  LE  θj e k=ψ −1 (j) K  −1  −θj (λj −1)  e−θψ(m) (Am  Ak Ak+  dλj  (3.35 )  m=1 ψ(m)=j k  1  θψ(m) (A−1 m λj − 1) dλj  θj exp −  = k=ψ −1 (j)  λj −1)  Ak  (3.36 )  m=1  w h ere w e’v e u sed th e fa c t th a t Aψ−1 (j) = 1 to sim plify th e expressio n . T h e in teg ra l c a n n o w b e ea sily ev a lu a ted: K  φLjE =  k  Ak+  θψ(m)  θj exp m=1  k=ψ −1 (j)  k  1  θψ(m) A−1 λj dλj m  exp − Ak  m=1  (3.37 ) K  = k=ψ −1 (j)     θj  exp  k m=1 θψ(m)  k −1 m=1 θψ(m) Am  λj =Ak  k  θψ(m) A−1 m λj  exp  (3.38 )  m=1  λj =Ak+  1  If w e c o llec t sim ila r term s a n d u se th e g iv en defi n itio n s o f C a n d D, th is c a n b e expressed a s: φLjE =  θj dj  K k=ψ −1 (j)  1 (Ck − Ck+ 1 ) Bk  (3.39 )  N o w th e C term s m a y b e q u ite c lo se, so fo r b etter n u m eric a l sta b ility w e c a n re-express eq u a tio n (3.39 ) to a v o id w o rk in g w ith th e diff eren c e o f tw o sim ila r n u m b ers: θj dj  Cψ−1 (j) Cψ−1 (j)+ 1 Cψ−1 (j)+ − + Bψ−1 (j) Bψ−1 (j) Bψ−1 (j)+  Cψ−1 (j)+ Bψ−1 (j)+  CK B K 1 1 (3.4 0) w h ere th e a rg u m en ts to φLjE , B, a n d C a re im plied. N o w fo r b etter n u m eric a l sta b ility – n eeded h ere a s Bk−1 a n d Bk a re lik ely to b e c lo se to g eth er – −1 Bk−1 − Bk−1 c a n b e expressed a s φLjE =  43  1  −  2  + ··· +  1 Bk−1  −  θψ(k) 1 1 = − Bk−1 + Bk Bk−1 dψ(k) = Bk−1 Bk−1  dψ(k) +1 θψ(k)  −1  (3.4 1) −1  (3.4 2)  P u ttin g th is b a ck in to eq u a tio n (3.4 0) g iv es:  C −1 θ j  ψ (j) − φLjE = dj Bψ−1 (j) =  θj dj  K  Ck Bk−1 k=ψ −1 (j)+ 1  dψ(k) Bk−1 +1 θψ(k)  Cψ−1 (j) − Dψ−1 (j) Bψ−1 (j)  −1     (3.4 3)  (3.4 4 )  w h ich c o m pletes th e th eo rem . W e c a n prec o m pu te c o m m o n term s to c rea te a n effi c ien t a n d a c c u ra te a lg o rith m , w h ich w e presen t in a lg o rith m (1). T h e ru n n in g tim e o f th is a lg o rith m is lin ea r in K sa v e fo r th e c a ll to a rg so rt (d) to so rt th e dista n c es, w h ich ru n s in O(K lo g K) tim e. T h u s th e o v era ll ru n n in g tim e is O(K lo g K).  3.3.2  A naly tic C alculation of ∂φLjE /∂θ  A s w e disc u ssed prev io u sly , w e n eed to fi n d th e o ptim u m v a lu e fo r θ b y m a xim iz in g th e diff eren c e b etw een th e a v era g ed po in tw ise sta b ility o f th e c lu sterin g a n d th a t o f th e n u ll dista n c e m a trix. W h ile th ere a re a lg o rith m s th a t do n o t req u ire th e g ra dien t o r fi rst deriv a tiv e o f th e fu n c tio n , th ese c a n ta k e a v ery lo n g tim e, espec ia lly in h ig h er dim en sio n s. T h is rela tes direc tly to th e A P W , a s ∂ A P W ∂θ ∂ P W ∂θ  = m ea n i  i  =  E ∂φLih i  ∂θ  ∂ P W i ∂θ ∂φLigEi − ∂θ  (3.4 5 ) (3.4 6 ) (3.4 7 )  w h ere  44  A lg o rith m 1: C a lc u la tio n o f sc a led dista n c e pertu rb a tio n s w ith a lo c a tio n expo n en tia l prio r. In p u t: A v ec to r d o f dista n c es a n d a v ec to r θ o f prio r pa ra m eters. O u tp u t: A pro b a b ility v ec to r φLE o f pa rtia l m em b ersh ips. K ← len g th d, ψ ← a rg so rt (d) In itia liz e ds , θs , B, C, D, r a s v ec to rs o f len g th K. fo r j = 1 to K do  dsj ← dψ(j) , θjs ← θψ(j) , rj ← θjs /dsj  B1 ← r1 , t ← 0, C1 ← 1 fo r j = 2 to K do Bj ← Bj−1 + rj s t ← t + θi−1 Cj ← exp t − Bj−1 dsj en d DK ← 0 fo r j = K − 1 to 1 ste p −1 do Dj ← Dj+ 1 + Cj+ 1 / Bj Bj rj+ en d  1  +1  E fo r j = 1 to K do φLψ(j) ← rj Cj /Bj − Dj  re tu rn φLE  hi = a rg m a x aik  (3.4 8 )  gi = a rg m a x φik  (3.4 9 )  k  =hi  T o m a k e th e o ptim iz a tio n pro c ess m o re effi c ien t, w e presen t h ere th e deriv a tio n fo r ∂φLjE /∂θ . A s ∇θ A R a n d ∇θ V I depen d o n th e pa rtia l deriv a tiv es o f φLE w ith respec t to θ, w e h ere deriv e a n a lg o rith m to c a lc u la te ∂φLjE /∂θ in a m o rtiz ed c o n sta n t tim e. F o r sim plic ity , w e a ssu m e in o u r deriv a tio n s th a t th e d’s a re a lrea dy in so rted o rder; w h en w e g iv e th e fi n a l a lg o rith m w e retu rn to u sin g th e so rted in dex m a p. T h en , u sin g φLE fro m eq u a tio n (3.28 ), w e h ave  45  A lg o rith m 2 : C a lc u la tio n o f ∂φ∂ j θ fo r sc a led dista n c e pertu rb a tio n s w ith a lo c a tio n expo n en tia l prio r a n d diff eren t prio r pa ra m eters fo r ea ch K. In p u t: A v ec to r d o f dista n c es a n d a v ec to r θ o f prio r pa ra m eters. O u tp u t: A K × K m a trix R g iv in g  ∂φ LjE ∂θ .  K ← len g th (d), ψ ← m a k eS o rtin g In dexM a p(d) In itia liz e ds , θs , B, C, D, E, F, G, H, L, r a s v ec to rs o f len g th K. In itia liz e R a s a 2d a rra y o f siz e K × K. fo r j = 1 to K do  dsj ← dψ(j) , θjs ← θψ(j) , rj ← θjs /dsj  // Some common terms to simplify the computation. B1 ← r1 , t ← 0, C1 ← 1, H1 ← 1/r1 fo r j = 2 to K do Bj ← Bj−1 + rj s t ← t + θj−1 Cj ← exp t − Bj−1 dsj Ej ← Bj−1 (Bj−1 /rj + 1) Lj ← Cj /Ej 2 /E r θ s Gj ← Lj Bj−1 j j j Hj ← Cj /Bj en d // Now the terms that build on each other. DK ← 0, FK ← 0 fo r j = K − 1 to 1 ste p −1 do Dj ← Dj+ 1 + Lj+ 1 Fj ← Fj+ 1 + Lj+ 1 dsj+ 1 + (2Bj /rj+ 1 + 1)/Ej+ 1 en d // Finally, calculate R. fo r m = 1 to K do Rψ(m),ψ(m) ← rm [Fm /dsm − Dm − Hm /Bm dsm ] + (Hm − Dm )/dsm fo r n = m + 1 to K do Rψ(m),ψ(n) ← rm (Fn /dsn − Dn − Gn ) Rψ(n),ψ(m) ← rn Fn /dsm − Dn + Hn /dsm dsm − dsn − 1/Bn en d en d re tu rn R  46  ∂φLjE  ∂ θj Cj − Dj ∂θ ∂θ dj Bj θj ∂ Cj ∂Dj = − dj ∂θ Bj ∂θ =  (3.5 0) +  Cj − Dj 1 Bj  1 dj  =j  (3.5 1)  N o w a s ev ery th in g b u ilds o n prev io u s deriv a tiv es, let’s sta rt w ith th e deriv a tiv es o f Bj , Cj : ∂Bj ∂ = ∂θ ∂θ  j m=1  θψ(m) 1 = 1 dψ(m) d  ∂Cj ∂ = exp − Bj−1 dj + ∂θ ∂θ  (3.5 2)  ≤j j−1  θm 1j≥2  (3.5 3)  m=1  dj 1<j d dj Cj 1− 1<j − 2 1 d Bj d  (3.5 4 )  = Cj 1 − ⇒  Cj ∂ Cj = ∂θ Bj Bj ∂ ∂ Dj = ∂θ ∂θ  K k=j+ 1  K  Ck = Ek  k=j+ 1  (3.5 5 )  ≤j  1 ∂Ck Ck ∂Ek − 2 Ek ∂θ Ek ∂θ  (3.5 6 )  2 w h ere Ek = Bk−1 dk /θk + Bk−1 1k≥2 . C o n tin u in g :  ∂ ∂Ek = ∂θk ∂θ = ⇒  ∂Dj = ∂θ  Bk−1 Bk−1  dk +1 θk  2Bk−1 dk 1 +1 12≤ θk d K k=j+ 1  Ck 1 1− Ek d  dk +  (3.5 7 )  < k  −  2 d Bk−1 k 12≤ θk2  1 2Bk−1 dk + Ek θk Ek  1  (3.5 8 )  =k  < k  +  2 C d Bk−1 k k 1 θk2 Ek2  =k  (3.5 9 ) K  =        k=j+ 1    Ck dk Ek 1 − d 2 Ck Bk−1 dk θk2 Ek2  −  1 Ek d  0  2Bk−1 dk θk  +1   < k    =k    > k  (3.6 0)  G ro u pin g term s depen den t o n ly o n k to g eth er a llo w s u s to express th is in a fo rm th a t c a n b e c a lc u la ted in c o n sta n t tim e per j, : 47  w h ere    Dj − Fj ∂ d Dj = C B 2−1 d  ∂θ +D − θ2 K  Fj = k=j+ 1  Ck Ek  1+  ≤j F d  > j  2Bk−1 1 dk + Ek θk Ek  (3.6 1)  (3.6 2)  a n d c a n b e c o m pu ted b efo reh a n d. T h e fi n a l a lg o rith m is th en g iv en in a lg o rith m (2). C a lc u la tin g a ll th e in term edia te a rra y s ta k es o n ly O(K) tim e, a n d a n y so rtin g is o n ly O(K lo g K), so th e a lg o rith m ta k es O K 2 tim e o v era ll. O b v io u sly , m u ch o f th e c a lc u la tio n o v erla ps w ith a lg o rith m (1) a n d in pra c tic e th e tw o sh o u ld b e c a lc u la ted to g eth er.  3.3.3  A naly tic C alculation of φLE and ∂φLjE /∂θ for T ied θ  In m a n y c a ses w e w a n t to tie th e prio r h y perpa ra m eters to g eth er, so θ = θ a n d a ll K pertu rb a tio n pa ra m eters h a v e th e sa m e prio r. In th is c a se, w e c a n sim plify so m e a spec ts o f th e c a lc u la tio n . T h e deriv a tio n s a re v ery sim ila r to th e o n es fo r m u ltiple θ, th u s w e lea v e o u t th e tedio u s deta in s. In stea d presen tin g th e fi n a l a lg o rith m to c a lc u la te b o th v ersio n s in a lg o rith m (3). A s c a n b e seen w ith a q u ick c o m pa riso n – espec ia lly th e pa rt w h ich c a lc u la tes th e fi rst deriv a tiv e – is q u ite sim ple c o m pa red to th e pa rt fo r m u ltiple θ’s. A lg o rith m ic a lly , th o u g h , th e ru n n in g tim e is th e sa m e fo r c a lc u la tin g φLE , O(K lo g K). T h is is a g a in b ec a u se o f th e c a ll to a rg so rt; ev ery th in g else h a s lin ea r ru n n in g tim e. C a lc u la tin g ∂φLjE /∂θ , h o w ev er, is n o w lin ea r in K in stea d o f q u a dra tic , so th a t g iv es u s a sig n ifi c a n t speedu p. In th e en d, ΦLE w ith tied prio r h y perpa ra m eters is o u r preferred m eth o d, a n d w e illu stra te in ch a pter 5 th a t its perfo rm a n c e is in deed c o m pa ra b le w ith m a n y o th er lea din g m eth o ds.  3.4  Scaled Distance Perturbations w ith a Sh ifted G am m a Prior  In th is sec tio n , w e slig h tly exten d th e resu lts fo r th e lo c a tio n expo n en tia l distrib u tio n desc rib ed in sec tio n 3.3 b y u sin g a sh ifted v ersio n o f th e G am m a(α = 2, β ) prio r: 48  A lg o rith m 3 : C a lc u la tio n o f sc a led dista n c e pertu rb a tio n s w ith a tied lo c a tio n expo n en tia l prio r. In p u t: A v ec to r d o f dista n c es a n d a sc a la r pa ra m eter θ g iv in g th e prio r. O u tp u t: A pro b a b ility v ec to r φLE o f pa rtia l m em b ersh ips a n d a v ec to r R g iv in g ∂φLjE /∂θ . K ← len g th d, ψ ← a rg so rt (d) In itia liz e ds , B, C, D, E, F, G, H, a n d R a s v ec to rs o f len g th K. fo r j = 1 to K do dsi ← dψ(i) , B1 ← 1/ds0 , D1 ← 1, E1 ← ds1 fo r j = 2 to K do Bj ← Bj−1 + 1/dsj Cj ← j − dsj Bj−1 Dj ← exp[θCj ] Ej ← Dj /Bj Fj ← Dj / Bj−1 (dsj Bj−1 + 1) en d GK ← 0 fo r j = K to 2 ste p −1 do fo r j = 1 to K do  Gj−1 ← Gj + Fj  E ← (Ej − Gj )/dsj , φLψ(j)  E // The rest of the code calculates ∂φLψ(j) /∂θ . HK ← 0, fo r j = K to 1 ste p −1 do Hj−1 ← Hj + Cj Fj  fo r j = 1 to K do  Rj ← (Cj Ej − Hj )/dsj  π(λ|α, β ) = Z(α, β )(λ − 1)α−1 e−β(λ−1) 1λ≥1 = ShiftedG am m a(λ|α, β )  (3.6 3) (3.6 4 )  w h ere Z(α, β ) is a n o rm a liz in g c o n sta n t. T h is sim ple a dditio n sig n ifi c a n tly c o m plic a tes th e fi n a l eq u a tio n s, ra isin g th e o v era ll tim e c o m plexity to O nK 3 . A dditio n a lly , b ec a u se o f n u m eric a l in sta b ility issu es, w e a re fo rc ed to to u se a rb itra ry prec isio n v a ria b les in ev a lu a tin g th e fi n a l a lg o rith m . O n e w a y a ro u n d th is, w h ich a lso h a s its do w n sides, is to u se th e M o n te C a rlo in teg ra tio n m eth o d o u tlin ed in sec tio n 3.5 . T h is m eth o d is still rea so n a b ly 49  effi c ien t, h o w ev er, fo r sm a ll K, so w e presen t it h ere. T h e deriv a tio n s a re stra ig h tfo rw a rd, a lb eit m essy .  A naly tic C alculation of φSG  3.4 .1  As in the derivation with the location exponential prior, we drop all the i’s for notational convenience; this calculation happens once per data point i. We begin the derivation by noting that φSjG =  I[dj λj ≤ d λ ∀ = j]  ShiftedGamma(λ ; α , β )dλ  =  I[dj λj + dj ≤ d λ + d ∀ = j]  =  I[βdj λj + dj ≤ βd λ + d ∀ = j]  Gamma(λ ; α , β )dλ  (3.65) (3.66)  Gamma(λ ; α , 1)dλ. (3.67)  We eventually derive the case for α = α = 2 (recall that the α = 1 case is the location exponential prior); other values would be better handled by using Monte Carlo integration. Now assume that the d’s are in sorted order, so d1 ≤ d2 ≤ · · · ≤ dK .  (3.68)  This can be accomplished by mapping the indices before and remapping them afterwords. For conciseness, we here omit the explicit index mapping used in the derivations of the location exponential prior, but the principle is the same. Recall from the derivation of φLE that we can rewrite an indicator function b using a Dirac delta function. This function is defined such that a δ(x − t)dt ∞ equals one if a ≤ x ≤ b and zero otherwise, so 1a≤b = 0 δ(b − a − t)dt, and f (x) = f (t)δ(x − t)dt [AWR96]. Thus equation (3.65) becomes: φSjG =  π(λ |α, β)δ((λ β d + d ) − (λj βj dj + dj ) − t )dt dλ.  (3.69)  Using the Dirac delta function again,  φSjG =  π  π(λj ) =j  1 1 (t + βj dj λj + dj − d ) dt dλj βd βd 50  (3.70)  We set the shape parameter α to 2 to mak e the derivation tractable while still preserving the desired shape. At this point, we set α = 2, so π(λ ) is π(λ ) = λ e−λ .  (3.71)  Defining Bj (λ) =  β j d j λj + d j − d βd  (3.72)  for convenience, we have  φSjG =  λj e−λj  [t + Bj (λ)] e−t  −Bj (λ)  I[t + Bj (λ) ≥ 0] dt dλj (3.73)  =j  =  (1 + Bj (λ))e−Bj  λj e−λj  (λ)  I[Bj (λ)≥0]  dλj  (3.74 )  =j  To evaluate the integral, we can use the division points at Bj (λ) = 0 ⇔ λj =  d − dj β j dj  (3.75)  to break it up into at most K + 1 discrete regions and integrate each separately. L et d − dj β j dj  Aj =  (3.76)  denote the division points (recall that d· is in sorted order), and for con1 venience, let AK+ = ∞. Note that Ajj = 0, so we only need to integrate j  1 K+ between the division points Ajj , Aj+ j , ..., Aj  K  φSjG = m=j  Am+ j Am j  1  to calculate φSjG :  m=K  1  λj e−λj  (1 + Bj (λ))e−Bj  (λ)  dλj  (3.77)  =1 =j  We use C j = β j d j λj + d j  (3.78)  to transform the variable of integration, yielding, after some straightforward but tedious algebra: 51  K  dm+  1  m=K  1−  φSjG = m=j  dm  =1 =j  Cj − dj − Pm= 1 Cβj −d 1 d e dCj β j dj β j dj  Cj 1 + β βd  (3.79) 1 β j dj  =  K  dm+  1  Pm(Cj ) − Pm(Cj ) m=j  dm  P C −d β j dj − m= 1 βj d e dCj β j dj − dj + Cj  (3.80) where Pm(y) is a polynomial, equal to m  1−  Pm(y) = =1 m  1 y + β βd  (3.81)  k pm k y .  =  (3.82)  k=0  We can evaluate the integral by first calculating the coeffi cients pm k using recursion. We then need to integrate the diff erence of two polynomials times an exponential, so the result is in the same form: m  pm k − k=0  y β j d j pm k y k e− Em dy = β j dj − dj + y  m  y  qkm − rkjm y k e− Em (3.83) k=0  where m  1 βd  Em = =1  −1  .  (3.84 )  We can calculate qkm and integrating successively lower powers of Cj , resulting in another recurrence. L ik ewise, the division and integration on rkjm can be expressed as a recurrence. Finally, we have  φSjG where   1  = Gj, β j dj  K =j (dj )e  −Fj  (Gj (d ) − Gj,  + =j+ 1  52  −1 (d    ))e−F   (3.85)  m  qkm − rkjm y k  (3.86)  dm − d βd  (3.87)  Gjm (y) = k=0  F = m=1  and  pm k   m−1 m−1  βm d−1 m pk−1 + dm (βm−1)pk = 1  0  qkm =  rkjm  Em (k + 0  m 1)qk+ 1  +  pm k  0 ≤ k ≤ m, 1 ≤ m ≤ K m=k=0 otherwise (3.88) 0 ≤ k ≤ m, 1 ≤ m ≤ K otherwise  (3.89)  jm jm   [Em (k+1) − dj (βj −1)] rk+ 1 +Em dj (βj −1)(k+2)rk+ 2 + βj pm k+ 1 = 0 ≤ k ≤ m − 1, 1 ≤ m ≤ K   0 otherwise (3.90)  For the general case, evaluating Gjim (·) tak es O(K) time per m for O K 2 time per i, j pair and O nK 3 overall. Note that calculating pm k once per i is suffi cient. In practice, this recursion is subject to several points of numerical failure due to tak ing diff erences between similar terms. For our calculations, we wrapped the result in a routine that verifies that the sum of the entries was suffi ciently close to 1, and that all of the individual entries were between 0 and 1. All the numerical errors we detected would cause these tests to fail. If they did, we reran the algorithm again, doubling the precision. This approach was quick and accurate enough for all the experiments we ran using it.  3.5  General Monte Carlo Algorithm  In this section, we develop a more general Monte Carlo algorithm to calculate φij when the priors π (λ ), = 1, 2, ..., K, can be sampled from, but 53  A lg o rith m 4 : Monte Carlo Calculation of φ. In p u t: A vector d of distances and a vector θ of prior parameters and sample size N . O u tp u t: A probability vector φ of partial memberships and matrix R giving ∂φj / ∂θ . K ← length(d), Np ← length(θ) φ ← 0K , R ← 0K×Np fo r n = 1 to N do dra w λ fro m π(θ) fo r j = 1 to K do φj ← φj + fj (λ, d) fo r = 1 to Np do Rj ← Rj + fj (λ, d)  ∂ log π λ , θ ∂θ  λ =λ  en d en d en d φ ← φ/ N , R ← R/ N re tu rn φ, R  when analytically evaluating equation (2.3) is impossible or computationally prohibitive. The simple Monte Carlo algorithm we propose can be applied to any equations for φ of the form: φj =  fj (λ, d)π(λ|θ)dλ.  (3.91)  In the case where we perturb the distance metric as described in section 3.1, we would have fj (λ, d) = I[λj dj ≤ λ d ∀ = j]  (3.92)  Fortunately, evaluating equation (3.91) using Monte Carlo is a simple application of importance sampling. B ecause j φj = 1, we sidestep the issue that importance sampling can only yield unnormalized answers simply by normalizing φ at the end. Thus, to approximate φ, we repeatedly: ˜ from the prior. 1. S ample a vector of K parameters λ 54  A lg o rith m 5 : Monte Carlo Calculation of φ for S caled Distance P erturbations. In p u t: A vector d of distances and a vector θ of prior parameters and sample size N . O u tp u t: A probability vector φ of partial memberships and matrix R giving ∂φj / ∂θ . K ← length(d), Np ← length(θ) φ ← 0K , R ← 0K×Np fo r n = 1 to N do dra w λ fro m π(θ) k ← argminj dj λj φk ← φk + 1 fo r = 1 to Np do Rk ← Rk +  ∂ log π λ , θ ∂θ  λ =λ  en d en d φ ← φ/ N , R ← R/ N re tu rn φ, R ˜ ˜ , ..., fK λ ˜ , f2 λ 2. E valuate f1 λ  .  3. Add it to the vector φ. 4 . Repeat steps 1 to 3 N times. 5. Normalize φ. O nce we have repeated this N times, we can normalize φ to get the final answer. Alternately, we can use the assumption that j fj (λ, d) = 1 and simply divide the entries of φ by N . For the optimization step, we can calculate the derivative of φ with respect to one of the hyperparameters of the prior – something advantageous when optimizing for the best hyperparameters – using the same technique on a similar expression:  55  ∂φj (d, θ) = ∂θ =  ∂ π(λ, θ)dλ ∂θ ∂ log π(λ, θ) f (λ, d) π(λ, θ)dλ ∂θ  f (λ, d)  (3.93) (3.94 )  π(λ,θ) , We can thus sample from π(λ) and weight the sample by f (λ, d) ∂ lo g∂θ again tak ing the average of N samples to get our estimate of the derivative. This approach motivates the general Monte Carlo algorithm for calculating the distance metrics outlined in algorithm (4 ). In algorithm (5), we apply this using equation (3.92) for f , giving a slightly more effi cient algorithm.  3.6  Ap p rox imation Algorithms  In many practical clustering validation problems, especially in dealing with a large number of clusters, a number of terms in the partial membership vector will probably be negligibly small. B eing able to exclude these irrelevant clusters from consideration beforehand may vastly improve computation speed and numerical accuracy, especially when using the shifted G amma prior because of its O nK 3 running time. We present here an accurate and easily computed bound that can be used exclude distances whose resulting value in φ will negligibly small. The central idea, presented formally in lemma 3.6.1, is to use the fact that the priors discussed above have support only for λ ≥ 1 to lower bound the value of φj with the integral over the region where dj λj has nonzero probability but d λ does not. After presenting the formal idea in the following lemma, we apply it to the location exponential prior and the shifted G amma prior. L e m m a 3 .6 .1. S u p p o se , fo r = 1, 2, ..., K, th a t π (λ ) = 0 ∀ λ < 1. L e t S ⊂ {1, ..., K} a n d j ∈ / S, a n d le t κ = argmin dk .  (3.95)  k∈S  T h en  ∞  φLkE ≤ k∈S  πj (λj )dλj dκ / d j  P ro o f. Now S c = {1, ..., K} − S. Then, for each k ∈ S,  56  (3.96)  φLkE =  π (λ )1{dk λk <  πk (λk )  d λ } dλ  dλk  (3.97)  =k  =  πk (λk )πj (λj )1{dk λk <  π (λ )1{dk λk <  dj λj } dλj  d λ } dλ  ∈S\k  ×  πm (λm )1{dk λk <  dm λm } dλm dλk  (3.98)  m∈S c \j  ≤  πk (λk )πj (λj )1{dk λk <  dj λj } dλj  π (λ )1{dk λk <  d λ } dλ  ∈S\k  ×  πm (λm )dλm dλk  (3.99)  m∈S c \j  =  πk (λk )πj (λj )1{dk λk <  π (λ )1{dk λk <  dj λj } dλj  d λ } dλ  dλk  ∈S\k  (3.100) where the inequality holds as integration over the removed indicator functions can only increase the area of integration and thus increase the area of integration. Now none of the priors has support on (−∞, 1), so mink ∈S dk ≤ mink ∈S dk λk ≤ dk λk ∀ k ∈ S, so  φLkE ≤  πk (λk )πj (λj )1{dk λk <  dj λj } dλj  k∈S  k∈S  ×  1{dk λk <  d λ }π  (λ )dλ dλk  (3.101)  ∈S\k  ≤  πk (λk )πj (λj )I[dκ < dj λj ]dλj k∈S  ×  1{dk λk < ∈S\k  We can now separate the integral:  57  d λ }π  (λ )dλ dλk .  (3.102)  φLkE ≤  πj (λj )I[dκ < dj λj ]dλj  k∈S  ×  π (λ )dλ ∩  k∈S  ∈S\k {λS  (3.103)  : dk λk < d λ } ∈S  where λS = (λk1 , λk2 , ..., λk|S| ) for S = k1 , k2 , ..., k|S| . The derivation is complete if the second term is ≤ 1, which we now show. We do by showing that the area of integration of each of the integrals under sum is disjoint from all the other integrals, giving us an upper bound of one for the sum. Formally ∀ k1 , k2 ∈ S, k1 = k2 , ∩  ∈S\k1  {λS : dk1 λk1 < d λ } ∩ ∩  ∈S\k2  {λS : dk2 λk2 < d λ }  (3.104 )  ⊆ {λS : dk1 λk1 < dk2 λk2 } ∩ {λS : dk2 λk2 < dk1 λk1 }  (3.105)  =∅  (3.106)  B ecause each area of integration is mutually disjoint, we can incorporate the sum over k directly into the integral as a union of the sets. Thus  φLkE (d, θ) ≤  πj (λj )I[dκ < dj λj ]dλj  k∈S  ×  π (λ )dλ ∪k∈S ∩  ≤  ∈S\k {λS  (3.107)  : dk λk < d λ } ∈S  πj (λj )I[dκ < dj λj ]dλj  π (λ )dλ  (3.108)  ∈S  =  πj (λj )I[dκ < dj λj ]dλj  (3.109)  ∞  =  πj (λj )dλj ,  (3.110)  dκ / d j  which completes the proof.  3.6 .1  E rror B ou nd s for the L oc ation E x p onential P rior  L emma 3.6.1 allows us to easily upper bound the probability mass in a subset of the elements of φ when the corresponding distances are suffi ciently 58  large. In essence, we are arguing that the error from setting a suffi ciently large distance to infinity, and thus ignoring it, is negligible. T h e o re m 3 .6 .2 . L e t π (λ ) = LocE xp(1, β ), ε > 0. L e t j ∈ {1, 2, ..., K}. If k : dk ≥ dj 1 −  S=  = 1, 2, ..., K, a n d su p p o se  log ε θj  ,  (3.111)  th e n φLkE ≤ ε  (3.112)  k∈S  P ro o f. L et κ = argmin dk .  (3.113)  k∈S  From lemma 3.6.1 we have that ∞  φLkE ≤ k∈S  πj (λj )dλj  (3.114 )  dκ / d j  Now  dκ ≥ dj 1 − ⇒ θj ⇒ exp −θj  dκ −1 dj  log ε θj  (3.115)  ≥ − log ε  (3.116)  ≤ε  (3.117)  πj (λj )dλj ≤ ε  (3.118)  φLkE ≤ ε  (3.119)  dκ −1 dj  ∞  ⇒ dκ / d j  ⇒ k∈S  59  3.6 .2  E rror B ou nd s for the S hifted Gamma P rior  T h e o re m 3 .6 .3 . L e t π (λ ) = LocE xp(1, β ), ε > 0. L e t j ∈ {1, 2, ..., K}. If  = 1, 2, ..., K, a n d su p p o se  −1 S = k : dk ≥ dj FGa (1 − ε) + 1 ,  (3.120)  −1 w h e re FGa is th e in v e rse cd f o f th e G a m m a d istrib u tio n , th e n  φSkG ≤ ε  (3.121)  k∈S  P ro o f. L et κ = argmin dk .  (3.122)  k∈S  L et FS−1 G be the inverse cdf of the shifted G amma distribution, defined in equation (3.64 ). Then  ⇒ 1 − FS G  dκ −1 ≥ FG−1 a (1 − ε) + 1 = FS G (1 − ε) dj dκ ≤ε dj  (3.123) (3.124 )  From lemma 3.6.1 we have that ∞  φSkG ≤ k∈S  πj (λj )dλj  (3.125)  dκ / d j  = 1 − FS−1 G  dκ , dj  (3.126)  so the theorem is proved. These theorems allow us to maintain a guaranteed level of accuracy while, in some cases, asymptotically improving the running time. Many fl avors of k-means for large scale clustering use data structures lik e kd-trees to avoid computing distances from points to irrelevant centroids. In such cases, the above bound allows us to worry only about the centroids within a specified radius – often called a range query – which kd-trees are easily able to handle.  60  3.7  Ad d itional P rop erties of the P ointw ise S tab ility  S ometimes it may be easier to directly calculate the pointwise stability instead of the entire partial membership. This can be particularly true if, for instance, the integration has to be done numerically using more expensive approximation methods such as those described in chapter 7. Using the theorems we present here, a good estimate of the pointwise stability can be obtained quite easily. Additionally, and perhaps more importantly, this theorem can be used to prove other desirable properties of the pointwise stability measure. We thus present an additional theorem using this one that shows the pointwise stability behaves as one would expect if one of the distance measures changes.  3.7 .1  D iff erenc es B etw een tw o P ointw ise S tab ility T erms  We derive results for the slightly more general case of the diff erence between any two points; the pointwise stability is then just a special case. Note that in practice it will probably easier to just evaluate this directly from the averaged assignement matrix; however, this version is useful as a tool to prove several theorems later on. T h e o re m 3 .7 .1. S u p p o se dj < dk . T h e n ψj (d, θ) − ψk (d, θ) =  π(λ, θ)dλ  (3.127)  dj d λk ≤ λj ≤ λ ∀ = j dk dj  (3.128)  Rjk (d)  w h e re Rjk (d) =  λ:  P ro o f. We begin by combining the common terms in the integral:  ψj (d, θ) − ψk (d, θ) =  π(λ) I{dj λj ≤ d λ ∀ = j}dλ −  =  π(λ) I{dk λk ≤ d λ ∀ = k}dλ (3.129)  π(λ) I{dj λj ≤ d λ ∀ = j} − I{dk λk ≤ d λ ∀ = k} dλ 61  (3.130)  To translate the indicator functions into sets, we define Sjk = {λ : dj λj ≤ d λ ∀ = j, k}  (3.131)  ⇒ Skj = {λ : dk λk ≤ d λ ∀ = j, k}  (3.132)  Tjk = {λ : dj λj ≤ dk λk }  (3.133)  ⇒ Tkj = {λ : dk λk ≤ dj λj }  (3.134 )  We can express the indicator functions in equation (3.130) using these sets:  ψj (d, θ) − ψk (d, θ) =  π(λ) 1Sjk ∩Tjk − 1Skj ∩Tkj dλ  (3.135)  O ur intent now is to isolate the regions of Sjk ∩ Tjk and Skj ∩ Tkj where the integral will return the same answer. We begin by break ing Sjk ∩ Tjk apart as follows. Recall that dk ≥ dj , so we can split T , which upper bounds λj at d / dj , along λj by break ing it at d / dk :  Sjk ∩ Tjk = =  =  λ : λj ≤  d λ ∀ = j, k dj  dk λk (3.136) dj d d λ ≤ λj ≤ λ ∀ = j, k dk dj  ∩ λ : λj ≤  d λ ∀ = j, k ∪ dk dk ∩ λj ≤ λk (3.137) dj d dk λj ≤ λ ∀ = j, k ∩ λj ≤ λk dk dj d dk d λ ≤ λj ≤ λ ∀ = j, k ∩ λj ≤ λk . ∪ dk dj dj (3.138) λj ≤  We split the second set using the same technique:  62  Sjk ∩ Tjk =  d λ ∀ = j, k dk dj dj dk λk ≤ λj ≤ λk ∩ λj ≤ λk ∪ dk dk dj d dk d λ ≤ λj ≤ λ ∀ = j, k ∩ λj ≤ λk dk dj dj  λj ≤  ∪  (3.139) dj d λ ∀ = j, k ∩ λj ≤ λk dk dk dj d dk λj ≤ λ ∀ = j, k ∩ λk ≤ λj ≤ λk dk dk dj dk d d λ ≤ λj ≤ λ ∀ = j, k ∩ λj ≤ λk dk dj dj  λj ≤  =  ∪ ∪  . (3.14 0)  We can simplify and combine the second and third line:  Sjk ∩ Tjk =  ∪  =  λj ∪  =  dj d λ ∀ = j, k ∩ λj ≤ λk dk dk dj d λk ≤ λj ≤ λ ∀ = j, k dk dk d d λ ≤ λj ≤ λ ∀ = j, k ∪ dk dj dk ∩ λj ≤ λk dj dj d ≤ λ ∀ = j, k ∩ λj ≤ λk dk dk dj d dk λk ≤ λj ≤ min λ , λk ∀ = j, k dk dj dj dj d ≤ λ ∀ = j, k ∩ λj ≤ λk dk dk dj d λk ≤ λj ≤ λ ∀ = j . dk dj  λj ≤  λj ∪  The lemma follows directly from observing that 63  (3.14 1)  (3.14 2)  (3.14 3)  n  o n o π(λ)dλ d d λj ≤ d λ ∀ =j,k ∩ λj ≤ d j λk k  k  =  n  o π(λ)dλ o n d d λk ≤ d λ ∀ =j,k ∩ λk ≤ d j λj k  (3.14 4 )  k  as the variables of integration λj and λk can be switched without aff ecting the value of the integral. Thus we have that π(λ) 1Sjk ∩Tjk − 1Skj ∩Tkj dλ  (3.14 5)  π(λ)dλ  = R(d)  +  {(· · · , λj , · · · , λk , · · · ) : (· · · , λk , · · · , λj , · · · ) ∈ Skj ∩ Tkj }  −  π(λ)dλ  π(λ)dλ (3.14 6)  Skj ∩Tkj  π(λ)dλ,  =  (3.14 7)  R(d)  which completes the proof. The result of theorem 3.7.1 is most useful if it can apply to arbitrary φj and φk , without the restriction that dj ≤ dk . To cover this case, we present the following corollary. C o ro lla ry 3 .7 .2 . S u p p o se dj ≥ dk . T h e n ψj (d, θ) − ψk (d, θ) = −  π(λ, θ)dλ  (3.14 8)  dk d λj ≤ λk ≤ λ ∀ = k dj dk  (3.14 9)  Rkj (d)  w h e re Rkj (d) =  λ:  P ro o f. Trivially, note that ψj (d, θ) − ψk (d, θ) = − [ψk (d, θ) − ψj (d, θ)] .  (3.150)  Thus we can apply the result of theorem 3.7.1, tak ing the negative of the result to get the corollary. 64  3.7 .2  B ehav ior of the P ointw ise S tab ility  In this section, we investigate how the pointwise stability changes given changes in the distances between points and centroids. These proofs, in essence, are a sanity check that what it measures is sensible. In particular, we prove that if the distance to centroid j decreases while all the other distances remain fixed, then the diff erence between φj and any of the other φ· ’s increases or remains constant. This is what we would expect, given that such a move should only increase the stability of a point. S imilarly, if the distance to another centroid k, k = j, is decreased, than φj − φ decreases or stays the same for = j. We present these formally in the following theorems. T h e o re m 3 .7 .3 . L e t d = (d1 , d2 , ..., dK ) be a v ec to r o f ce n tro id -to -p o in t d ista n ce s, a n d le t ψ be a n y sta b ility m ea su re ba sed o n sca led d ista n ce p e rtu rba tio n s. L e t j, k be a n y in d e x in {1, 2, ..., K} su c h th a t dj < dk . S u p p o se 0 < ε < dj , a n d le t d = (d1 , ..., dj − ε, ..., dK ).  (3.151)  ψj (d , θ) − ψk (d , θ) ≥ ψj (d, θ) − ψk (d, θ)  (3.152)  T h en  P ro o f. The proof follows directly from theorem 3.7.1. B y theorem 3.7.1, we have that ψj (d , θ) − ψk (d , θ) − [ψj (d, θ) − ψk (d, θ)] π(λ|θ)dλ −  = Rjk (d )  π(λ|θ)dλ (3.153) Rjk (d)  where  Rjk (d) = Rjk (d ) =  dj d λk ≤ λj ≤ λ ∀ = j dk dj dj − ε d λ: λk ≤ λj ≤ λ ∀ =j dk dj − ε λ:  (3.154 ) (3.155)  Now dk > 0, so dj dj − ε < . dk dk 65  (3.156)  S imilarly, d > 0 so d d > dj − ε dj  (3.157)  Rjk (d) ⊂ Rjk (d )  (3.158)  Thus  This allows us to combine the two integrals in equation (3.153):  ψj (d , θ) − ψk (d , θ) − [ψj (d, θ) − ψk (d, θ)] =  π(λ|θ)dλ. Rjk (d )−Rjk (d)  (3.159) H owever, π(λ|θ) ≥ 0 ∀ λ, so the value of the integral is nonnegative. Thus ψj (d , θ) − ψk (d , θ) − [ψj (d, θ) − ψk (d, θ)] ≥ 0,  (3.160)  proving the theorem. This method of proof can be repeated to get additional inequalities that cover all the possible cases where a single distance metric changes. Together, this gives a fairly solid sanity check that the pointwise stability is not going to do anything too surprising. Note also that the converse of all of these theorems apply as well for suffi ciently small ε; while we discuss specifically the cases for shrink ing a distance metric by a value ε, the reverse inequality holds for the case of increasing the distance metric as well. For the above theorem, we present this as another corollary. C o ro lla ry 3 .7 .4 . L e t d = (d1 , d2 , ..., dK ) be a v ec to r o f ce n tro id -to -p o in t d ista n ce s, a n d le t ψ be a n y sta b ility m ea su re ba sed o n sca led d ista n ce p e rtu rba tio n s. L e t j, k be a n y in d e x in {1, 2, ..., K} su c h th a t dj < dk . S u p p o se dj < ε < dk , a n d le t d = (d1 , ..., dj + ε, ..., dK ).  (3.161)  ψj (d , θ) − ψk (d , θ) ≤ ψj (d, θ) − ψk (d, θ)  (3.162)  T h en  66  P ro o f. This theorem is simply a restatement of theorem 3.7.3. If we swap the prime notation on the distance vectors and assume we were given dj + ε instead of dj , we have theorem 3.7.3 exactly. The proof of that theorem is suffi cient. We can now continue with the other 2 cases, changing k and changing = j, k. T h e o re m 3 .7 .5 . L e t d = (d1 , d2 , ..., dK ) be a v ec to r o f ce n tro id -to -p o in t d ista n ce s, a n d le t ψ be a n y sta b ility m ea su re ba sed o n sca led d ista n ce p e rtu rba tio n s. L e t j, k be a n y in d e x in {1, 2, ..., K} su c h th a t dj < dk . S u p p o se dj < ε < dk , a n d le t d = (d1 , ..., dk − ε, ..., dK ).  (3.163)  ψj (d , θ) − ψk (d , θ) ≤ ψj (d, θ) − ψk (d, θ)  (3.164 )  T h en  P ro o f. Again, the proof follows directly from theorem 3.7.1. B y theorem 3.7.1, we have that  [ψj (d, θ) − ψk (d, θ)] − ψj (d , θ) − ψk (d , θ) π(λ|θ)dλ −  = Rjk (d )  π(λ|θ)dλ (3.165) Rjk (d)  where  Rjk (d) = Rjk (d ) =  dj d λk ≤ λj ≤ λ ∀ = j dk dj dj d λk ≤ λj ≤ λ ∀ = j λ: dk − ε dj λ:  (3.166) (3.167)  Now dk > 0, so dj dj < . dk − ε dk  (3.168)  Rjk (d ) ⊂ Rjk (d)  (3.169)  Thus  67  This allows us to combine the two integrals in equation (3.165):  [ψj (d, θ) − ψk (d, θ)] − ψj (d , θ) − ψk (d , θ) =  π(λ|θ)dλ. Rjk (d)−Rjk (d )  (3.170) H owever, π(λ|θ) ≥ 0 ∀ λ, so the value of the integral is nonnegative. Thus [ψj (d, θ) − ψk (d, θ)] − ψj (d , θ) − ψk (d , θ) ≥ 0,  (3.171)  proving the theorem. C o ro lla ry 3 .7 .6 . L e t d = (d1 , d2 , ..., dK ) be a v ec to r o f ce n tro id -to -p o in t d ista n ce s, a n d le t ψ be a n y sta b ility m ea su re ba sed o n sca led d ista n ce p e rtu rba tio n s. L e t j, k be a n y in d e x in {1, 2, ..., K} su c h th a t dj < dk . S u p p o se 0 < ε, a n d le t d = (d1 , ..., dk + ε, ..., dK ).  (3.172)  ψj (d , θ) − ψk (d , θ) ≥ ψj (d, θ) − ψk (d, θ)  (3.173)  T h en  P ro o f. Again, this theorem is simply a restatement of theorem 3.7.5. If we swap the prime notation on the distance vectors and assume we were given dk + ε instead of dk , we have theorem 3.7.5 exactly. The proof of that theorem is suffi cient. T h e o re m 3 .7 .7 . L e t d = (d1 , d2 , ..., dK ) be a v ec to r o f ce n tro id -to -p o in t d ista n ce s, a n d le t ψ be a n y sta b ility m ea su re ba sed o n sca led d ista n ce p e rtu rba tio n s. L e t j, k be a n y in d e x in {1, 2, ..., K} su c h th a t dj < dk , a n d le t m be a n y in d e x o th e r th a n j o r k. S u p p o se 0 < ε < dm , a n d le t d = (d1 , ..., dm − ε, ..., dK ).  (3.174 )  ψj (d , θ) − ψk (d , θ) ≤ ψj (d, θ) − ψk (d, θ)  (3.175)  T h en  P ro o f. Again, the proof follows directly from theorem 3.7.1. B y theorem 3.7.1, we have that  68  [ψj (d, θ) − ψk (d, θ)] − ψj (d , θ) − ψk (d , θ) π(λ|θ)dλ −  = Rjk (d )  π(λ|θ)dλ (3.176) Rjk (d)  where  Rjk (d) = Rjk (d ) =  dj dm d λm λk ≤ λj ≤ λ ∀ = j, m and λj ≤ (3.177) dk dj dj dj d dm − ε λk ≤ λj ≤ λ ∀ = j, m and λj ≤ λm λ: dk dj dj (3.178) λ:  Note that we have explicitly highlighted the terms with m. Now dj , dm > 0, so dm − ε dm < . dj dj  (3.179)  Rjk (d ) ⊆ Rjk (d)  (3.180)  Thus  This allows us to combine the two integrals in equation (3.176):  [ψj (d, θ) − ψk (d, θ)] − ψj (d , θ) − ψk (d , θ) =  π(λ|θ)dλ. Rjk (d)−Rjk (d )  (3.181) H owever, π(λ|θ) ≥ 0 ∀ λ, so the value of the integral is nonnegative. Thus [ψj (d, θ) − ψk (d, θ)] − ψj (d , θ) − ψk (d , θ) ≥ 0,  (3.182)  proving the theorem. C o ro lla ry 3 .7 .8 . L e t d = (d1 , d2 , ..., dK ) be a v ec to r o f ce n tro id -to -p o in t d ista n ce s, a n d le t ψ be a n y sta b ility m ea su re ba sed o n sca led d ista n ce p e rtu rba tio n s. L e t j, k be a n y in d e x in {1, 2, ..., K} su c h th a t dj < dk , a n d le t m be a n y in d e x o th e r th a n j o r k. S u p p o se 0 < ε, a n d le t d = (d1 , ..., dm + ε, ..., dK ). 69  (3.183)  T h en ψj (d , θ) − ψk (d , θ) ≥ ψj (d, θ) − ψk (d, θ)  (3.184 )  P ro o f. Again, this theorem is simply a restatement of theorem 3.7.7. If we swap the prime notation on the distance vectors and assume we were given dm + ε instead of dm , we have theorem 3.7.7 exactly. The proof of that theorem is suffi cient.  3.8  E x tensions to O ther Ind ic es  We wrap up this chapter by tying up some loose connections to chapter 2. As mentioned in section 2.2.3, we can naturally incorporate other stability indices into our method. We continue this extension in this section by deriving the gradient of each of these indices with respect to the prior parameters. This allows us to effi ciently tune the priors using these methods as well should we desire to use them.  3.8 .1  O p timiz ing A R  and V I  ov er θ  In finding the θ that maximizes A R (or V I ), or a linear combination thereof, it speeds the computation up immensely if we k now the gradient of A R (V I ) with respect to θ. K nowing this allows us to use much more sophisticated and powerful optimizers. In this section, we calculate the gradient of A R and V I . Ultimately, both of these depend on first calculating ∂φLjE / ∂θ for each row of φ; we derive an algorithm in the next section θ =θ  to do this effi ciently in amortized constant time per j, pair. For the H ubert-Arabie Adjusted Rand Index,  ∇θ A R  = ∇θ = ∇θ  j 1 2 1 2 (B  j  p2j −  p2j +  p  A − BB + B ) − BB  j 2  p2j  −  p j  pj 2  2  p  2  (3.185) (3.186)  where, to mak e the calculation easier, we’ve defined the following terms:  70  A=  p2j  j  B=  j  B =  (3.187)  p2j p  (3.188)  2  (3.189)  Continuing:  ∇θ A R  =  1 2 (B  1 + B ) − BB  1 −B A R 2  ∇θ A − B +  ∇θ B (3.190)  Now ∇θ p2j =  ∇θ A = j  ∇θ B =  ∇θ p  2  =  1 n  1 n  2pj ∇θ pj  (3.191)  j  2p ∇θ p  (3.192)  where ∇θ φi  ∇θ pj =  (3.193)  i:ai j =1  ∇θ φi  ∇θ p =  (3.194 )  i  Thus we can calculate ∇θ A R given the partial derivatives of φj with respect to the prior parameters θ . The V ariation of Information is similar:  ∇θ V I    = ∇θ − =−  p log p − 2  pj log pj −  j  j  log p + 1 +  K ∇θ p − 2 p  j   pj  pj log pj p  (3.195) pj log + 1 ∇θ pj pj p (3.196)  71  Again, using equations (3.193) and (3.194 ), this can be calculated easily given ∂φLjE / ∂θ .  72  Chap ter 4  S y nthetic D ata for Clu ster V alid ation T ests Recall from section 1.2.1 that the canonical way to compare cluster validation procedures is to test how well they predict the true number of clusters in a k nown dataset. S uch an analysis, however, has limits. O ne is that it depends on the type of data. Another is that it depends on the quality of data. Issues such as missing data, truncated features, and contaminations points, and the methods used to control them, can have a significant infl uence on the results. We thus propose a method to control these issues and abstract them, as much as possible, from the process of testing clustering validation methods. Many of the papers proposing or describing cluster validation methods only compare them on a handful of datasets. This is largely due to two factors. First, there are few robust methods for generating “ diffi cult” synthetic data with a k nown number of clusters, and many of these do not work in high dimensions. S econd, most real datasets are for classification purposes; while sometimes people ignore the labels and compare methods based on how well they recreate the k nown partitions, most of these datasets do not have classes that separate well into clusters. We thus felt a method to generate non-G aussian, “ diffi cult” datasets with a k nown number of clusters would be a valuable and useful tool for the clustering community. A reasonable collection of synthetic datasets used to compare the accuracy of clustering validation techniques should meet several requirements. First, each dataset should have Ktru e distinct clusters. S econd, it should contain datasets with a variety of types – i.e. cluster shapes1 that deviate 1  By shape, we mean a level set or contour curve of the joint distribution. A Gaussian,  73  non-trivially from a “ nice” G aussian distribution. We thus propose a procedure to create a multimodal distribution from which such synthetic datasets can be drawn. The procedure tak es as input the dimension of the space, the desired number of modes Ktru e , and a handful of interpretable tuning parameters that control the separation and shape of the modes. O ur procedure also includes a verification step that ensures, with reasonably high probability, that there are exactly Ktru e distinct modes in the mixture model. At a high level, our procedure consists of five steps, each of which we will outline in more detail in the following sections. 1. Choose input parameters. These include a closeness index β governing the required separation between components and parameters governing the shape of the component distributions. 2. S et the locations and average variance of the mixture components based on the specified dimension, number of components, and required separation between components. 3. Draw the individual weights and variances of each component based on the average variance and input parameters governing the spread of the weights and variances. 4 . S hape each component using a sequence of randomly drawn, invertible transforms. The distribution of each component is based on a symmetric G aussian distribution; this set of transforms scales, rotates, and translates the coordinate system the distribution is based on to reshape the component distribution. This allows significant diversity in component shapes while preserving the modal structure of the mixture model. 5. Adjust the components to ensure that all Ktru e (Ktru e − 1) pairs of component distributions are suffi ciently well separated. In the following sections, we begin by discussing several other procedures for generating synthetic data and how our method is distinctive. We then define some of the basic notation and equations used throughout the procedure. S ection 4 .3 describes creating the base component distributions (steps 2-3), and section 4 .4 describes shaping these components (step 4 ). In section 4 .5 we describe the verification process (step 5). S ection 4 .6 describes for ex ample, would have a hyperspherical or hyperelliptical shape.  74  sampling from the mixture model produced by these steps. We end with a summary of the input parameters to the procedure and a brief description of each.  4 .1  R elated W ork  While numerous papers mention generating synthetic data from mixture models, the vast majority sample data from a mixture model of standard distributions and with visual inspection or no check ing at all to ensure the components are not overlapping. The one exception we found was a synthetic data generator implemented by P ei and Z aiane [P Z ]. To generate synthetic data with a specified number of components, they start with complex shapes – e.g. letters, sqaures, etc. – and then locate them in the space, ensuring they do not overlap. These shapes are then filled with a uniform sampling of points. Additionally, clusters can come from G aussians, and noise points can be added. They provide five levels of diffi culty determined by geometric configurations and cluster shapes. O ur method has several advantages over this approach. The first is that in using a random transformation approach coupled with a verification and adjustment process, the complex shapes of our method are more random and arguably more representative of real data than those in [P Z ]. Furthermore, the pdf of our method can easily be realized, while this is more diffi cult in their method. Finally, the cluster shaping and diffi culty are continuously tunable parameters, so the user has more control over the resulting mixture model.  4 .2  D efi nitions  The general procedure of our algorithm is to start with a base distribution – here a standard Normal – and reshape it using a sequence of non-linear transformations. The base distribution, which we denote here as g, is common between all the components. The transformation sequence – which we compose into a single function denoted by Tj – varies between components. L ik ewise, the component variance σj , component location µj , and component weight wj also vary between components. P utting all these together, we can define the final pdf for mixture component j as hj (x) = Dx (Tj )gj Tj 75  x − µj σj  (4 .1)  Combining all these together into the pdf of the full mixture model gives us Ktru e  h(x) =  wj hj (x).  (4 .2)  j=1  where wj are the weights assigned to each component.  4 .2 .1  Comp onent S ep aration and P rox imity  As mentioned, one of the distinctive features of our procedures is that the user specifies the diffi culty of the problem in part by specifying the minimum separation between components. We quantify this in terms of a proximity index Sjk , where a proximity of 0 indicates maximal separation and a proximity of 1 indicates that the modes of the two components are indistinguishable. The user sets the separation parameter β, and our procedure guarantees that all component pairs have a proximity index of at most β. We define the proximity Sjk between components j and k in terms of the probability density on the line between the two components. Formally, 1  Sjk =  min 1,  hjk uµj + (1 − u)µk γjk  0  du  (4 .3)  where the mixture pdf hjk of the jth and kth components is hjk (x) =  wj hj (x) + wk hk (x) . wj + wk  (4 .4 )  γjk is the minimum between the values of the pdf at the two centroids, given by γjk = min hjk µj , hjk (µk )  (4 .5)  Note that Sjk = 0 requires the pdf between the the j and k mixture components to be identically 0. We assume that µj is the mode of component j – our procedure is designed to ensure this – so Sjk = 1 indicates that there are not two distinct modes in hjk (x). Formally, the condition we guarantee our cluster distribution to have in the end is min Sjk ≤ β j,k  (4 .6)  We do this by first setting the variances of each of the components so this condition is met. Then, after we shape the components – a procedure which 76  may result in component pairs violating equation (4 .6) – we iteratively shrink the variances of subsets of the components until the resulting mixture model satisfies it.  4 .3  S tage 1 : P retransformed Comp onent D istrib u tions  The base distribution of each component is a normal2 with a randomly drawn variance and a location determined by a random process described below. The primary challenge is to choose the component locations and average cluster variance in a way that pack s the components together in a geometrically reasonable way. B y geometrically reasonable, we mean avoiding a component configuration in which two components are so close relative to the rest that it is conceptually ambiguous whether data drawn from them should be regarded as one cluster or two. In this ambiguous case, while the variance of the two close components can be adjusted to ensure they are well separated, it mak es the separation parameter less meaningful as a summary of the overall structure of the mixture model since it refers to an isolated case. As an analogy, one could argue that the minimum value out of a set of samples carries more information about the distribution if the sampling distribution is k nown not to generate outliers. Ideally, then, we want a configuration in which the many of the cluster pairs have a separation close the required separation parameter.  4 .3.1  Choosing L oc ations for the Comp onents  P ractically, there are several options. O ne viable option is to generate a random dataset from a single normal distribution, then cluster it using kmeans with the desired number of centroids. The resulting cluster locations would then be the new component centers. This would ensure that the clusters are positioned with reasonably good separation. H owever, this may also tend to position the components too evenly, a possibly undesirable eff ect (though one could severely limit the number of iterations, or number of points in the test dataset, to prevent it). The current version of our software chooses the cluster centers by iteratively refining a set of candidate component centers. It starts with a 2  W hile we use the normal as our base distribution, in theory any symmetric, unimodal distribution would work .  77  list M1 = M11 , M12 , ..., M1Nb of Nb candidate sets M1i , where M1i = {µ1i 1 , µ1i 2 , ..., µ1i K } with each element consisting of K centers drawn from a spherical normal distribution. The algorithm then iteratively refines this list a specified number of times using a type of max-min criteria. At each iteration, it chooses the set of centers from the list having the largest minimum distance between centers, then forming a new list of sets of centers based on that set. L et Mt = Mt1 , Mt2 , ..., MtNb be the list of sets at iteration t. We then choose a set Mt from this as Mt = argmax M∈Mt  min  µ1 ,µ2 ∈Mt  µ1 − µ2  (4.7)  2  Mt+1 = Mt ∪ newCentersSetList (Mt )  (4.8)  where newCentersSetList generates a new list of sets of centers by randomly replacing one of the centers in the pair of two closest points – breaking that pair – with a new point randomly drawn. By choosing the next Mt+1 from the union of a new, randomly generated list and Mt , we ensure that at each iteration our locations of centers do not get any worse. Furthermore, if the smallest inter-center distance is a significant outlier, the probability that an iteration increases this statistic is quite high as it is easy to find an alternate location. We repeat this iterative step a specified number of times.  4.3.2  Setting the Mean of the Initial Cluster Variance Distribution  Recall that the mean cluster variance is not one of the user specified parameters; rather, this must be determined by the user-set minimum separation parameter β. We choose the variance of the clusters by initially setting all base components – at this stage, they are symmetric Normals – to a common variance determined by distance between the two closest points and the minimum separation parameter given by the user. From equation (4.6), this means we set σavg such that Sjk = β where j, k = argmin µj − µk  2  (4.9)  j ,k  where Sjk is the separation parameter of symmetric Normals with variances equal to σavg and locations given by µj and µk . Formally, this translates into determining σavg by numerically solving the following equation for σavg : 78  1 − 2F − σadv g d σa v g  f (0) + f − σadv g  =β  (4.10)  where d is the distance between the two components, f is the pdf and F is the cdf of the standard normal distribution. P ractically, this is done numerically using an iterative bisection algorithm. O nce the mean of the cluster variance distribution is set, we sample the variance of individual components as described in the next section. Note that these variances may be further tuned to ensure that the condition given in equation (4.6) holds for the final mixture model.  4.3.3  Ind iv id ual Mix ture Com p onent Settings  T here are two input parameters that control the spread of the individual component weights and variances, sw and sσ . A fter the component locations 2 and σavg are set, we draw the individual cluster variances from a distribu2 tion indexed by σavg and an input parameter, sσ , that governs the spread 2 around σavg of the individual component variances. Likewise, the component weights are drawn from a distribution indexed by an input parameter sw that governs the standard deviation of the component weights around 1/Ktru e . T he input parameter sσ governs the spread of the component variances. Formally, we use the model σj = σavg σj , where σj is a random variable from a G amma distribution with E σj = 1 and V ar σj = s2σ . Formally, 1 2 ,s s2σ σ [σ1 , σ2 , ..., σK ] = σavg [σ1 , σ2 , ..., σK ]  ∼ Gamma (σj 1 , σj 2 , ..., σj K ) iid  (4.11) (4.12)  Similarly, the input parameter sw governs the spread of the component weights around the mean weight, 1/Ktru e . P ractically, we draw the weights from a G amma distribution with mean 1 and variance s2w and then normaliz e the resulting distribution. H owever, this is the same as a Dirichlet theorem, as it can be proved that if Yi ∼ Gamma(αi , 1) and 79  (4.13 )  n  Yi  V =  (4.14)  i=1  then Y1 Y2 Yn , ,··· , V V V  ∼ D irichlet(α1 , α2 , ..., αn )  (4.15)  H ere, then, drawing the weights is formally the same as drawing them from a Dirichlet distribution with parameters [Ktru e /s2w , ..., Ktru e /s2w ]. T hus (w1 , w2 , ..., wKtru e ) ∼ D ir  Ktru e Ktru e ,··· , 2 2 sw sw  (4.16)  U nlike the component variances, which may be adjusted later, the parameters drawn here are the final weights for the mixture components.  4.4  Stage 2: Shap ing the Com p onents  While other methods use complex base shapes to get complex structure in the resulting distribution, our approach is to generate a sequence of invertible transformation functions that each operate on one or two dimensions of the input vector. T here are three classes of transformations, rotation, scaling, and translation. Rotations rotate the two components a random amount; scaling transformations scale one component, and translations add a value determined by one component to the other. T hese pair of dimensions that each of these transformations is applied to is chosen at random, and the output of one transformation is fed to the next. Shaping the components using these methods is a distinctive of our method, and we describe each of them in detail below. T he four user parameters governing this process are τro tatio n , τsc alin g , τtran slatio n , and α. T hese are, respectively, the average number of rotations, scalings, and translations per dimension component. Recall that p denotes the dimension. Specifically, there are τro tatio n p/2 rotations, τsc alin g p scaling transformations, and τtran slatio n p translations in total. T he order and the components that each operate on are chosen randomly. E ach of these transformation types (except rotations) take a severity parameter, α, that ranges between 0 and 1. When α = 0, the transformations have no eff ect on the component shapes, but higher values of α cause increasing changes in the shape.  80  4.4.1  F orm al N otation  Formally, the entire operation can be seen as a sequence of nested functions. For notational conciseness, we omit the index of the mixture model component. T he composite transformation function T (x) is given by T (x) = Tr (Tr−1 (· · · T2 (T1 (x)) · · · )) = (Tr ◦ Tr−1 ◦ · · · ◦ T2 ◦ T1 )(x)  (4.17) (4.18)  where Tt is the tth transformation function. T he distribution function after this set of transformations is then h(x) = Dx T (x)g(T (x))  (4.19)  = Dyr−1 (Tr ) · Dyr−2 (Tr−1 ) · · · Dy1 (T2 ) · Dx (T1 )g(T (x))  (4.20)  = Dx (T )g(T (x))  (4.21)  where Dz (f ) is the determinant of the J acobian of f evaluated at z, yt is the location of the current value after the first t evaluations in T , i.e. yt = (Tt ◦ Tt−1 ◦ · · · ◦ T2 ◦ T1 )(x)  (4.22)  and Dx (T ) collects the sequence of J acobians into a single term.  4.4.2  A ccep table T ransform ation F unctions  O bviously, choosing classes of transformation functions must be done carefully. P erhaps the most important criteria is that our aggregate transformation function needs to preserve the unimodal structure of the distribution; otherwise, the final mixture model will not have the correct number of modes. Second, the reshaping operation must not be undesirably severe. O ur method ensures that the first criteria is met by drawing each individual transformation from classes that will preserve the unimodal structure of the distribution and, with reasonably high probability, cannot yield a sequence or ordering that destroys the unimodal structure of the component.3 We meet the second criteria by admitting only transformations that pass several tests. Denote the composite transformation sequence at step t as 3  We believe that it is impossible to sequence the possible set of transformations in such a way, but have not proved this rigorously.  81  Tt = Tt ◦ Tt−1  (4.23 )  = Tt ◦ Tt−1 ◦ · · · ◦ T2 ◦ T1 .  (4.24)  A proposed transformation function T passes the test if it keeps a set of 2p points located at {±ˆe 1 , ±ˆe 2 , ..., ±ˆe p } within rm ax of the origin, where ˆe q is the unit vector of the qth dimension. T he second test is similar, except that T passes the test if T ◦ Tt−1 keeps the same set of points within rm ax of the origin. T he main idea of these transformations is to prevent the transformations from overly spreading apart the central mode of the components. O ptionally, if more cluster consistency is required, T must pass a second pair of tests similar to the first pair, except these ensure that the set of points {±2ˆe 1 , ±2ˆe 2 , ..., ±2ˆe p } stays within 2rm ax of the origin. T his second round of tests place more severe restrictions on some of the translation functions, as many of these apply a much larger translations to points farther from the center. When constructing the sequence of transformations, we fix initially how the specified number of each type is ordered. H owever, when setting the parameters for one particular transformation, we redraw the component(s) and transformation parameters until it passes the required tests. Figure 4.1 show three plots of the pdf of a 5 component mixture model. T he input parameters are identical except for the use of these tests. T he left one shows the mixture model with no tests, the middle with transformations filtered using the first pair of tests, and the right plot with transformations filtered using both pairs of tests.  4.4.3  R otation  A rotation function Tro tatio n rotates two components m1 and m2 of the the imput vector yt by θ. In choosing a function from this class, we choose m1 , m2 ∼ U n {1,...,p} , m1 = m2 and θ ∼ U n [0 ,2π) , then the two entries in the point vector would be yt+1,m1 yt+1,m2  =  cos θ sin θ − sin θ cos θ  T he full transformed point would then be:  82  yt,m1 yt,m2  (4.25)  Mixture Model pdf, No Transform Spread Tests  Mixture Model pdf, Both Transform Spread Tests  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  −4 −4  4  (a) the pdf of a mix ture model with no transformation check s.  −3  −2  −1  0  1  2  3  4  (b) the pdf of a mix ture model with only fi rst order transformation check s.  Mixture Model pdf, No Secondary Transform Spread Tests 4  3  2  1  0  −1  −2  −3  −4 −4  −3  −2  −1  0  1  2  3  4  (c) the pdf of a mix ture model with both transformation check s.  Figure 4.1: E xample plots of three 2d mixture models when there are no checks on the transforms, first order checks, and both first and second order checks. Note that with no transformation checks, exceedingly long and thin cluster shapes are much more likely.  Tro tatio n (y) = [y1 , y2 , ..., ym1 −1 , ym1 cos θ + ym2 sin θ, ..., − ym1 cos θ + ym2 sin θ, ym2 +1 , ..., yp ]T (4.26) 83  Note that rotation does not aff ect the normaliz ing constant of the distribution, so Dyt (Tro tatio n ) = 1. T he inverse, needed for the sampling stage, is also easily calculated by inverting the rotation.  4.4.4  Coord inate T ranslation  A translation function Ttran slatio n takes one dimension component and uses it to determine the amount of translation for another component. It is determined by choosing a scalar function f randomly from the set listed below, a base component b of y, a separate mapping component m, and a term A drawn from a G amma distribution indexed by user input parameters. It then maps ym to ym + f (yb ). T hus Tb (y) = [y1 , y2 , ..., ym−1 , ym + f (yb , A), ym+1 , ..., yp ]T  (4.27)  We chose the list below as a non-exhaustive set of function classes that will, with a reasonably high probability, fulfill the criteria given in section 4.4.2. A dditionally, the input parameter, α dictates the expected severity of the transformation; α = 0 has no expected eff ect and α = 1 has a severe expected eff ect (recall that all the distributions before the transformation stage have a variance of 1). 1. f1 (z , A) = Az 2. f2 (z , A) = Az 2 , A ∼ Gamma(α, 1) 3 . f3 (z , A) = Az 3 , A ∼ Gamma(α, 1) 4. f4 (z , A) = eAz − 1, A ∼ Gamma(α, 1) where A ∼ Gamma(α, 1) is the severity of the eff ect and α is a user input parameter. Note that the G amma function is parameteriz ed so E A = V ar A = α. T he J acobian in this case is the detriment of the identity matrix with one non-z ero off -diagonal and is thus 1. T he inverse is easy to compute: Tb−1 (y) = [y1 , y2 , ..., ym−1 , ym − f (yb ), ym+1 , ..., yp ]T  4.4.5  (4.28)  Coord inate Scaling  A function Tsc alin g chooses a random component m of y and scales it by an amount A. T hus 84  Tt (y) = [y1 , y2 , ..., ym−1 , Aym , ym+1 , ..., yd ]T  (4.29)  T he scaling factor A is drawn from a G amma distribution parameteriz ed so that E A = 1 and V ar A = α. T hus α = 0 again denotes no expected eff ect and α = 1 denotes relatively severe expected eff ect. We are only modifying one component, so the J acobian for this function is just Dy Tsc alin g = A  (4.3 0)  T he inverse is also trivial: Tsc−1alin g (y) = y1 , y2 , ..., ym−1 ,  4.4.6  ym , ym+1 , ..., yd A  T  (4.3 1)  E x am p le Cluster Shap ing P aram eters  We show in Figure 4.2 mixture models, with corresponding samples, having values of the transformation severity parameter α set at 0.1, 0.2, 0.4, and 0.75. T he number of components is 5, and both sσ and sw were set to 1. A s can be seen, this parameter has a significant eff ect on the type of mixture model produced. In Figure 4.3 , we show the pdf and corresponding samples for a 5 component mixture models as a function of the number of transformations applied per dimension component. T his parameter, along with α, has the most infl uence over the component shapes. In general, the more transformations, the less the components resemble G aussian distributions.  4.5  Stage 3: A d justing the P rop osed Cluster Distribution  O ne postcondition of the model generated by our procedure is that all components are suffi ciently well separated, i.e. Sjk ≤ β ∀ j, k  (4.3 2)  T he first step in ensuring this condition is met was to set the mean of the cluster distribution. H owever, when drawing the component variances from a distribution with nonz ero variance and shaping the components, there is no guarantee that this is enough. T hus to meet the condition, we selectively 85  Mixture Model pdf, α = 0.100000  Sample of 1000 points, α = 0.100000  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (a) A 2d mix ture model with corresponding samples and α = 0 .1 Mixture Model pdf, α = 0.200000  Sample of 1000 points, α = 0.200000  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (b) A 2d mix ture model with corresponding samples and α = 0 .2  Figure 4.2: 2d mixture models with corresponding samples (2000) as a function of the transform severity parameter α shrink the variances of the components until equation (4.3 2) is satisfied. Recall that the shaping procedure is independent of the variance, so the rest of our process is unaff ected. T o shrink this, we give every component a score based on how severely it contributes to violating equation (4.3 2). Specifically, the score Sj for component j is max {0, Sjk − β},  Sj =  (4.3 3 )  k=j  We then shrink the component with the highest score until its score is 0. T his causes all the component pairs that component j is a member of to 86  Mixture Model pdf, α = 0.400000  Sample of 1000 points, α = 0.400000  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (c) A 2d mix ture model with corresponding samples and α = 0 .4 Mixture Model pdf, α = 0.750000  Sample of 1000 points, α = 0.750000  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (d) A 2d mix ture model with corresponding samples and α = 0 .7 5  Figure 4.2: 2d mixture models with corresponding samples (2000). T he severity of the transformations is far greater than the previous two in Figure 4.2. satisfy equation (4.3 2). T his, of course, will also reduce the score of other components. We then recalculate the scores as needed and repeat the procedure, stopping when there are no violations.  4.6  Sam p ling from the Distribution  T he only remaining process to describe is how to sample a set of N points (X1 , X2 , ..., Xn from h(x). T here are two aspects to this process; the first is choosing the number of points to draw from each component, and the second is sampling from a given component. 87  Mixture Model pdf, Transforms-per-dim = 0  Sample of 1000 points, Transforms-per-dim = 0  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (a) A 2d mix ture model with corresponding samples and no transforms per dimension. Mixture Model pdf, Transforms-per-dim = 2  Sample of 1000 points, Transforms-per-dim = 2  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (b) A 2d mix ture model with corresponding samples and 2 transforms per dimension.  Figure 4.3 : 2d mixture models with corresponding samples (2000) as a function of the number of transformations per dimension. T he severity of the transformations was set to 0.4, and all the types had equal representation.  4.6 .1  Determ ining Com p onent Sam p le Siz es  We draw the number of points for the clusters from a type of truncated multinomial distribution where an input parameter, nm in , specifies the minimum number of points drawn from each component. A llowing this as an input parameter helps ensure that all the modes of the distribution are represented, in some way, in the final dataset. In this section, we present the algorithm we use to generate the cluster siz ing information. 88  Mixture Model pdf, Transforms-per-dim = 5  Sample of 1000 points, Transforms-per-dim = 5  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (c) A 2d mix ture model with corresponding samples and 5 transforms per dimension. Mixture Model pdf, Transforms-per-dim = 12  Sample of 1000 points, Transforms-per-dim = 12  4  4  3  3  2  2  1  1  0  0  −1  −1  −2  −2  −3  −3  −4 −4  −3  −2  −1  0  1  2  3  4  −4 −4  −3  −2  −1  0  1  2  3  4  (d) A 2d mix ture model with corresponding samples and 12 transforms per dimension.  Figure 4.3 : 2d mixture models with corresponding samples (2000) as a function of the number of transformations per dimension. We ensure that at least nm in points are drawn from each component using a simple recursive algorithm that is guaranteed to produce the number of points to be drawn from each cluster provided that nm in Ktru e ≤ N . T he main idea is simple. We start with a vector n = (n1 , n2 , ..., nK ) drawn from a multinomial distribution with weights set as describe in section 4.3 .3 . If any of the cluster siz ings have less than nm in points, we set these to nm in and then randomly decrease the siz es of clusters that have more than nm in points so that the total number of points is still n. We then repeat this until all the clusters have siz e nm in or greater. We present the algorithm more  89  A lg o rith m 6 : Determining the number of points to assign to draw from each mixture model component. Inp u t: T he total number of points N , the minimum number of points nm in , and a vector w = (w1 , w2 , ..., wK ) giving the weighting of each component. O u tp u t: A vector n = (n1 , n2 , ..., nK ) giving the number of points to be drawn from each component. a sse rt N ≥ nm in ∗ K (n1 , n2 , ..., nK ) ← M ultinomial(N, w) wh ile ∃ i s.t. ni < nm in d o A ← {i : ni < nm in } B ← {i : ni ≥ nm in + 1} m ← i∈A nm in − ni fo r i ∈ A d o ni = nm in d ← M ultinomial(m, [max {0, ni − nm fo r i ∈ B d o ni ← ni − di e nd  in }/  i max {0, ni  − nm  in }])  re tu rn n formally in algorithm (6).  4.6 .2  Sam p ling from the Com p onents  Now all that is left is to sample nj points from hj (x) for each cluster j. We can do this easily by drawing samples from the base distribution gj and then transforming them by inverting the transformation functions. Formally, we have that X ∼ g(x) hj (y) ∝ g Tj  (4.3 4) y − µj σj  (4.3 5)  so if X ∼ gj (x), Y = σj Tj−1 (X) + µj ∼ hj (y)  (4.3 6)  Because each of the transforms described in section 4.4 is invertible, we are able to effi ciently sample from the distribution and thus easily create the dataset. 90  4.7  U ser Set P aram eters  O ur procedure has a collection of tunable parameters that control the shape of the mixture model components, how well separated they are, and, in general, the diffi culty of a sampled dataset for both clustering algorithms and validation procedures. M ost of the parameters indexing the diffi culty take values between 0 and 1, with larger values indicating a more diffi cult problem. We summariz e the parameters in the tables below: M ixture M odel P roperties Ktru e T he number of mixture components. p T he dimension of the space. β T he minimum separation between modes, as defined in equation (4.3 ). If β = 0, the components are maximumly separated (reducing the components to delta functions), and β = 1 denotes no separation required. sσ Controls the spread of the cluster variances as outlined in section 4.3 .3 . sw Controls the spread of the cluster weights around 1/Ktru e as described in section 4.3 .3 . nm in T he minimum number of points in a cluster; if any cluster has fewer points than this, the weights and number of points in the clusters are redrawn. M ixture Component P roperties τro tatio n T he average number of rotation operations aff ecting each entry in the point vector. τsc alin g T he average number of scaling operations aff ecting each entry in the point vector. τtran slatio n T he average number of translation operations aff ecting each entry in the point vector. α  4.8  T he expected severity of each transformation. If α = 0 there is no change; α = 1 indicates a severe expected transformation.  Conclusion  Now that we have outlined a reasonably reliable way of generating synthetic data for testing clustering and cluster validation procures. In particular, 91  while our method can produce data with regular G aussian components, its strength is in producing data with clusters that can not easily be modeled by any of the standard distributions. We are now ready to compare our method with all the others using this data.  92  Chap ter 5  T esting and Verifi cation We show that our approach compares quite favorably to other leading methods. We tested our method extensively on two classes of data, A NO V A data with symmetric G aussian mixture components, and shaped data, where the data was non-G aussian and shaped by the procedure presented in chapter 4. We found that our method outperformed all the other methods when the data was shaped, and outperformed the data perturbation methods and was nearly comparable to the gap statistic when the points were drawn from spherical G aussians. T his was the expected behavior, as the squared-error cost function in the gap statistic would work far better when the clusters were symmetric than when they were shaped. O ur method makes only weak modeling assumptions and thus performs significantly better on data that does not fit the modeling assumptions made by other methods. We also tested three varieties of data perturbation methods, and found them to all work fairly well across both A NO V A and shaped data, consistently proving to be decent but slightly less accurate than our method. We first present in more detail each of the methods tested in section 5.1. We describe the setup of the tests in the next section. T hen, in section 5.3 , we present a detailed analysis of the results, and in section 5.4 we conclude with a summary discussion.  5 .1  Method s T ested  We compare three classes of cluster validity estimation techniques. T he first is our method, which has several variants depending on the type of baseline distance matrix, summary statistic, and prediction method. T he second is  93  the gap statistic, described in detail in section 1.3 . Finally, we compare three diff erent types of data perturbation using both the H ubert-A rabie adjusted Rand index and the V ariation of Information. T hese methods are variants of one of the more popular data perturbation methods, subsampling. While our tests against data perturbation methods are far from exhaustive – such a comparison would be quite diffi cult – we believe the results on these methods are indicative of many other data stability approaches. A ll of these methods are outlined in more detail in the following sections. E ach of the techniques above is a method to get a stability estimate for a single clustering. T o use these clusterings to predict the number of clusters in a dataset, we generate a set of candidate clusterings of the data, corresponding to a range of possible numbers of clusters, and calculate the specified validity indices of each. We then use this set of validity estimates to predict the number of clusters; this stage is discussed in section 5.1.4.  5 .1 .1  B ay esian Cluster Valid ation Method s  From our method, we estimate the validity of a clustering using the averaged pointwise stability as described in section 2.2.2. We use this in combination with scaled distance perturbations (section 3 .1) with a location exponential prior (section 3 .3 ). T he stability of the clustering is the diff erence in average pointwise stability between the given data and a null baseline clustering. We tested the three diff erent baseline types described in section 3 .2.3 : a reclustered uniform sampling in the bounding box of the data, which we label as A P W-RC, a similar uniform sampling but recycling the cluster centers of the clustered data (A P W-U ), and a baseline distance matrix formed by permuting the original (A P W-P erm). Between these three, we found that in the lower dimensions and data with simpler models, the A P W-RC tended to work best followed by the A P W-U and A P W-P erm, but, as the dimension or model complexity increased, A P W-P erm tended to dominate.  5 .1 .2  G ap Statistic  T he second method we tested was gap statistic, described in detail in section 1.3 . It uses a reclustered uniform distribution in the bounding box of the data as its baseline.1 G iven the gaps for each k, we test the method using two predictors. First is the one the authors propose, and the second picks 1 We did attempt to use the gap statistic with the other two baseline types described in in reference to our method, but doing this causes it to perform horribly so we omit those results.  94  the smallest k within a standard deviation of the best. T hese will be be described more in section 5.1.4.  5 .1 .3  Data P erturbation  We chose to compare our method against three variants of the subsampling techniques described in section 1.2.1. Recall that subsampling, closely related to bootstrapping, draws n separate datasets from the original data, each containing a subset of the original points. Running each method consists of cluster a set of subsampled datasets, comparing each to the unperturbed clustering using the index, and taking the average as the final stability estimate. A ll of the methods below use one of the similarity indices described in section 1.2.2 to compare the perturbed clustering with the original clustering. T he first is the H ubert-A rabie adjusted Rand index and the second is the V ariation of Information. In general, we found the second to perform better. In the first method, we randomly draw a subset of the data, recluster it, and compare the resulting labeling against the labeling of the same points in the unperturbed data set. We compared this using 10% , 20% , 3 0% , 40% , 50% , 60% , 70% , 80% , and 90% of the original points, for 9 stability indices in total. T he final stability index vk is the average of these, and sk is the population standard deviation. We label this method as SS-Siz eRange. T he second method is more careful about which points it chooses. For it, we break the data into 10 non-overlapping folds and cluster the data from 9 of these folds at each run, leaving out a diff erent fold each time for 10 total runs. E ach of these runs is then evaluated based on how well they predict the labels on the original data. A gain, the final stability index vk is the average of these, and sk is the population standard deviation. T his method we call SS-Folds. T he third method is similar to SS-Siz eRange, but diff ers in that there are 10 independent draws of n/2 points each rather than a variety of subset siz es. T he rational here, like bootstrapping, is that we’re simulating multiple samples of the same data set. T he final stability index vk and standard deviation vk are set the same way. We’ll call this method SS-Draws later on, and, since we present more detailed analysis that involves the two diff erent similarity indices, we refer to the one using the H ubert-A rabie adjusted Rand index as SS-Draws-A R and the one using the V ariation of Information as SS-Draws-V I.  95  5 .1 .4  P red ictors  E ach of the above techniques yields a list of validity estimates,2 where each k tested has a validity index vk . A dditionally, recall that each validity estimate has an associated standard deviation sk . G iven the list of validity indices and standard deviations, we test several predictors for each method, where ˆ for the number here a predictor is simply a way to extract an estimate K of clusters from the lists vk and sk . For each validation method, we present results from all the predictors that work reasonably well on at least some of the datasets. E stimating the number of clusters is done using one of three methods, ˆ estimator and validity estimation work. though not all combinations of K T he first method is to simply choose the k that has the best validity estimate (we refer to this estimation method as BestK ). T he first predictor, which we call BestK , is an obvious first choice. ˆ as the k that has the highest validity index vk . T his apIt chooses K proach tended to work best for the data perturbation approaches, but lost to StdBeforeBest when used with our method. T he second predictor, which we label StdBeforeBest, is a standard method in machine learning. For it, we estimate k as the smallest k that is within one standard deviation of the best, i.e. ˆ = min {k : vk ≥ v − s } , where K  = argmax v  (5.1)  T his method usually outperformed BestK with our method, but lost on the data perturbation methods. It is the recommended method for our cluster validation approach. T he third method, proposed as part of the original gap statistic, looks for the place in the curve where the validity index levels off and stops improving with increasing k. Specifically, ˆ = min {k : vk ≥ vk+1 − sk+1 } K  (5.2)  T his prediction method only worked in combination with the gap statistic, so we omit results using it in combination with the other methods. 2  While the perturbation based methods produce stability indices, we here use validity indices to include the gap statistic, which is not stability based.  96  5 .2  T est Setup  A s we discuss in chapter 1, the canonical test for cluster validation methods is how well they predict the true number of clusters in a given dataset. While this method is somewhat artificial – often, in real data, it is diffi cult to even say what the true number of clusters is – it does allow for some helpful comparisons. H owever, in much of the literature, the tests on synthetic data (and on real data) have been on only a handful of datasets and thus the comparisons have revealed limited information about the methods. O ur simulation is massive by comparison; in total we compare the performance of each of the methods on 22,500 datasets of diff ering type, dimension, number of true clusters, sample siz e, and cluster shape. T hese help reveal the strengths and weaknesses of each method.  5 .2.1  Data T y p es  A s mentioned, we test the methods using two types of data, the first being A NO V A style data with spherical cluster centers, and the second being nonG aussian data shaped according to the method described in chapter 4. For each of these types, we generate datasets having 2, 10, 20, 50, and 100 dimensions and having 2 through 16 true clusters Ktru e . For each type, dimension, and Ktru e , we do 100 runs with 750 points and 50 runs with 100 points to test both medium and small siz ed data sets. T hus there are 2 × 5 × 15 × 150 = 22500 datasets total. P a ra m e te rs fo r S h a p e d D a ta Recall from chapter 4 that a handful of parameters govern the shape and “ diffi culty” of the generated mixture model as a clustering problem and a cluster validation problem. We found that the relative performance of our methods were not very sensitive to these input parameters; we thus adjusted them so that the high score on each data type indicated good but not perfect performance (in the 7-10 range according to the scoring system described below).  5 .2.2  Clustering  We used a slight variation on vanilla k-means as our primary clustering algorithm. O ur results could potentially be limited by the accuracy of the clustering algorithm, as an incorrectly clustering a data set Ktru e clusters with Ktru e centroids should not be any more stable than clustering it into the 97  incorrect number of clusters. T hus, to minimiz e this potential problem and ensure that each run gave excellent results, our clustering function returns the best clustering out of 20 runs of k-means as measured by the k-means cost function. T his seemed to give excellent results.  5 .2.3  Q uantitativ e E v aluation  We present the results for each method in several ways. T he first scores each method according to its cumulative accuracy on each type and dimension of data, summariz ing how closely it predicts Ktru e on each run and each Ktru e in a given set, normally 2-16. T his presentation provides a good summary of how well each method works. T he second way we present the results is as ˆ a given method a so-called confusion table that shows how many clusters K predicts as a function of the true number of clusters Ktru e . ˆ for We observed that when a method incorrectly predicts a value K Ktru e , often the predicted value is way off and becomes essentially random ˆ T o keep such estimates from skewing the across all the possible values of K. results, we constructed a scoring function that is a discrete variant of the Smoothly Clipped A bsolute Deviation (SCA D) loss function [FL01, Fan97], ˆ a score as follows: where we give each estimate K ˆ − Ktru e | |K 0 1 2 >2  Score 10 5 1 0  T able 5.1: T he possible scores of the cluster validation methods.  T he final score of a method is the average score across all 100 runs and the possible values of Ktru e . T o present a more detailed picture of how a certain method behaves, we will also present a table giving how often the method predicted the a given number of versus the true number of clusters. In such a table, a perfect ˆ = Ktru e , and z eros method would have 1’s along the diagonal where K elsewhere. T his type of table reveals significantly more about the behavior of each of the methods. For example, some methods perform well when Ktru e 98  is small (< 8) but break down when Ktru e is larger (> 10). We present and describe such results in more specific detail later in this chapter.  5 .3  R esults  T he results of our simulation, were, overall, quite favorable toward our method. U sing the diff erence in average pointwise stability between a baseline distance matrix and data performed comparable to the gap statistic and the data perturbation methods on most cases, and in many cases performed substantially better. In the next four sections, we present the results for two classes of data, A NO V A and shaped, with two diff erent sample siz es, 750 points and 100 points. T he A NO V A data, which allows one to make much stronger modeling assumptions, is perhaps easier. O ur shaped data does not readily fit any simple models and thus makes things more much more diffi cult for model based methods such as the gap statistic. In general, our methods beat the gap statistic and data perturbation approaches while generally matching the gap statistic on A NO V A data. E ach entry in the tables described in sections 5.3 .1 - 5.3 .4 represents the average score of that method across 100 runs at each of 2-16 true mixture components, for an average over 1500 datasets in total. T he test k ranges from 2 through 20 clusters on each dataset. In these tables, boldface entries denote the best performing method, and darker shades of blue (gray) denote higher scores. In section 5.3 .5, we present a more comprehensive and detailed analysis of how each method performs by showing histogram-like tables of how many ˆ when the true number of times each of the methods predicted a given K clusters was Ktru e . T his reveals more insight into the various methods, e.g. whether they tend to predict too high or too low or whether they are better at higher or lower k. Finally, in section 5.4, we summariz e the important points of the results.  5 .3.1  A N O VA Data, Med ium Sam p le Siz e  We begin with the easiest data type, given in T able (5.2). T his table shores how diff erent methods scored on A NO V A type data of varying dimension and 750 points in each dataset. T he separation index, as given by equation (4.3 ), was 0.6. In this particular case, our methods do quite well, with one fl avor of A P W beating or matching all other methods at every dimension  99  U niform G ap Draws SS  Folds Siz eRange  100D A nova  RC U niform  50D A nova  A PW  20D A nova  P erm  10D A nova  Subtype  2D A nova  T ype  9.23 9.42 9 .6 3 Best 9.62 9.22 Best 9.59 4.3 9 Best 9.61 7.10 6.84 5.05 4.88 6.3 6 6.22  9 .8 8 9 .8 8 9 .8 8 9 .8 8 9.87 9 .8 8 9 .8 8 9 .8 8 9.64 9.65 8.81 8.76 9.66 9.62  9 .4 4 9 .4 4 9 .4 4 9.43 9 .4 4 9 .4 4 9 .4 4 9.40 9.13 9.3 5 6.83 7.81 8.94 9.25  9.3 8 9 .4 0 9.3 8 9.3 7 9 .4 0 9.3 9 9.3 8 9.3 4 8.95 9.20 7.59 8.44 8.91 9.20  8.86 8.86 8.87 8.86 8 .9 0 8.87 8.89 8.76 8.3 2 8.71 6.84 7.92 8.12 8.59  P redictor Best Std Before Best Std Before Best Std Before Regular Std Before Best A R Best V I Best A R Best V I Best A R Best V I  Best  T able 5.2: Score results for A NO V A type data with 750 points. Boldface numbers indicate the best performing method(s) one each data type, and darker shades of blue (gray) indicate higher scores. tested.3  5 .3.2  A N O VA Data, Sm all Sam p le Siz e  T he results change slightly when the sample siz e is suffi ciently small. In particular, one fl avor of the gap statistic has the best score in 3 out of 5 of the dimensions. Several fl avors of our method, though, were strong contenders; A P W-RC-Best in particular, was consistently close or better. We suspect that the good performance of the gap statistic here is indicative of its strong modeling assumptions. T hese assumptions hold perfectly 3  T he astute reader will observe that in 10 and 2o dimensions many of the methods received the same score. T his may be due to wrong clusterings with the correct k, which can defi nitely occur. If they do, then it’s possible that all the methods get 10 0 % on the runs with a clustering for the correct k but predictably fail on the clusterings in which it is wrong. F uture analysis will deal more directly with this issue.  100  U niform G ap Draws SS  Folds Siz eRange  100D A nova  RC U niform  50D A nova  A PW  20D A nova  P erm  10D A nova  Subtype  2D A nova  T ype  2.74 6.24 8.10 Best 6.92 2.3 0 Best 6.26 3 .05 Best 8 .2 4 5.02 5.72 3 .3 9 3 .45 3 .94 4.42  4.93 7.65 8.00 6.64 3 .29 6.58 6.43 8 .2 7 6.66 6.69 5.75 5.54 6.41 6.53  7.62 8.3 4 8 .4 9 8.11 6.91 8.06 8.04 8.10 7.53 7.88 6.53 6.97 7.24 7.61  5.79 6.91 7 .3 9 6.80 4.3 9 6.48 6.78 6.91 7.14 6.41 6.21 6.40 6.63 6.47  7.68 7.87 8.18 8.04 7.71 8.12 8 .2 6 7.60 7.73 8.19 6.61 7.3 2 7.3 0 7.82  P redictor Best Std Before Best Std Before Best Std Before Regular Std Before Best A R Best V I Best A R Best V I Best A R Best V I  Best  T able 5.3 : Score results for small sample siz e, A NO V A type data with several predictors. in this case and play more of a role because of the small sample siz e. O n the shaped data where these assumptions do not hold, however, the gap statistic performs significantly worse. It’s also noteworthy that with regards to the A P W-RC method, simply predicting the most stable k wins over picking the one that is within one standard deviation of the best. T his is the one case we observed where the latter is better than or matches the former.  5 .3.3  Shap ed Data, Med ium Sam p le Siz e  When the data is shaped and the clusters are no longer G aussian, the story changes. A s shown in T able (5.4), the best method is always within our proposed class of validation methods. M ost notably, in 2 and 100 dimensions, every variant of our approach wins over all the data perturbation methods and the standard version of the gap statistic. If we restrict ourselves to using 101  U niform G ap Draws SS  Folds Siz eRange  100D Shaped  RC U niform  6.06 7.84 8.56 Best 8 .8 0 6.64 Best 8.04 5.48 Best 7.3 8 5.49 5.40 3 .40 3 .45 4.69 4.66 Best  50D Shaped  A PW  Best Std Before Best Std Before Best Std Before Regular Std Before Best A R Best V I Best A R Best V I Best A R Best V I  20D Shaped  P erm  P redictor  10D Shaped  Subtype  2D Shaped  T ype  7.3 1 8 .5 5 7.54 8.29 7.90 8.26 6.65 1.3 4 7.3 3 7.46 5.45 5.63 7.08 7.20  7.64 8 .5 7 7.79 8.3 3 8.04 8.40 6.23 0.77 7.3 0 7.87 5.02 6.04 7.22 7.72  6.79 7 .8 0 7.12 7.54 7.12 7.52 5.47 0.46 6.59 7.28 4.70 6.18 6.3 0 7.02  7.21 7.26 7 .4 6 7 .4 6 7.45 7.45 6.91 1.46 6.3 0 7.10 4.47 5.93 6.17 6.94  T able 5.4: Score results for Shaped data with several predictors. the StdBeforeBest predictor – note that it outperforms BestK in every case – the same is true at every dimension. Within our method, A P W-P erm, the fastest one computationally, is the best in 10, 20, and 50 dimensions, and A P W-RC is the best on the other two sets.4 T he data perturbations methods, in this case, consistently outperformed the gap statistic but did not match our method. E xcept in 2 dimensions, results using the V ariation of Information were better than results using the H ubert-A rabie adjusted Rand index. A lso, note that the best method tended to be SS-Draws; we examine this in more detail later. T he proposed version of the gap statistic performed fairly well, outper4 We suspect that A P W-P erm would also be the best at 10 0 dimensions as well if that generated data was more shaped. H owever, when we ran these tests, some optimiz ations in the clustering generation code were not in place and thus we weren’t able to apply as many transformations to the mix ture components as would have been ideal. T hus that data could reasonably be viewed as half way between the shaped data and the A N O V A data.  102  forming some of the data perturbations methods on some counts but never winning over any of our methods. T he one case where the G ap statistic was able to beat all the data perturbation methods and the worst of our methods was in 2 dimensions when using the StdBeforeBest predictor. Note, however, that this predictor performs horribly on the other dimensions.  RC U niform U niform  G ap Draws SS  Folds Siz eRange  1.19 4.59 7.15 Best 7 .9 7 1.16 Best 4.91 3 .84 Best 5.27 5.25 5.94 2.72 3 .01 3 .90 4.46 Best  100D Shaped  A PW  Best Std Before Best Std Before Best Std Before Regular Std Before Best A R Best V I Best A R Best V I Best A R Best V I  50D Shaped  P erm  P redictor  20D Shaped  Subtype  10D Shaped  T ype  Shap ed Data, Sm all Sam p le Siz e  2D Shaped  5 .3.4  2.89 6.23 6.95 7 .0 4 2.09 5.18 5.96 0.99 6.62 6.3 1 4.65 4.62 5.87 5.85  5.81 7.56 7.61 7 .6 8 4.91 6.85 6.55 0.85 7.17 7.3 3 5.51 5.92 6.46 6.87  6.74 7.3 6 7.54 7.74 6.75 7.27 6.3 8 0.72 7.40 7 .8 7 5.87 6.57 6.90 7.3 5  6.86 7.00 7.56 7 .6 2 7.13 7.3 0 7.19 1.68 7.24 7.58 6.01 6.90 6.95 7.54  T able 5.5: T he score results for Shaped data with several predictors. H aving a smaller sample siz e (T able (5.5)) shifts the results slightly, though the relative performance between the classes is largely the same. In this case, A P W-RC, instead of A P W-P erm, is usually the best (with the exception of the runs in 50 dimensions, where SS-Draws-V I wins). T his is somewhat expected – A P W-P erm, in building the baseline on permutations of the original distance matrix, is the least parametric of the methods and thus requires more data to achieve accuracy comparable to methods that make more assumptions. 103  T he gap statistic and data perturbation methods tended to have similar relative scores to the results in the previous section. We don’t have a good explanation for why SS-Draws-V I was best on the 50 dimensional data; this remains an open question.  5 .3.5  Detailed Com p arison  While the tables in the previous sections off er an excellent summary of the performance of each approach. K nowing whether a method tends to overestimate or underestimate the number of clusters, or whether it performs well only when Ktru e is small, is valuable information. T herefore, in this section, we present histogram-like tables that give this information. We refer to these results as sampling distributions. Due to space constraints, we present tables for only a subset of the possible combinations of method, predictor, data type, and dimensions. T hese are to highlight various properties of the methods and what issues one must be aware of. In general, we try to look at the extremes that most illustrate the tendencies of the various methods. T hus we restrict ourselves to 2 and 100 dimensional data but examine the diff erent combinations within these dimensions more thoroughly. A lso, we found that the sampling distribution of SS-Draws-V I was generally representative of the other data perturbation methods, so we restrict ourselves to examining it. 2 D A N O V A , 7 5 0 P o ints T he first case we examine is the 2d A NO V A data with 750 points per dataset. T his data is fairly easy to cluster, and many of the methods had no problem distinguishing the correct number of clusters. In T able (5.6), we compare A P W with two diff erent baselines. T he first is the reclustered uniform distribution on the P CA bounding box of the data. Recall that this is the same baseline that the gap statistic uses. In lower dimensions, and with small sample siz es, this seems to work better than the other two baselines. P ossibly, this is because the structure in the original data has less infl uence on this baseline than using a uniform distribution but recycling the cluster centers or simply permuting the distance matrix. H owever, note that in only one case the prediction is far wrong from the truth. In contrast, the prediction distribution of the data perturbation method using SS-Draws and a V ariation of Information index, which we show in T able (5.7), tended to be way off if it got the result wrong. T his indicates 104  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 3 .9 5  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .11 .7 8  .11  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1 .11 .7 5  .13  ·  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1 .25 .5 7  .17  ·  ·  ·  .0 1 .0 1  .13 .8 4  .0 3  (a) A P W-R C ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 3 .9 7  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  .0 5 .9 4  4  .0 2 .0 2 .8 9  .0 3 .0 2 .0 1  5  ·  ·  6  ·  ·  .0 3 .9 7 ·  7  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .9 2  12  ·  ·  ·  ·  ·  ·  ·  ·  .0 1 .0 4 .9 1 .0 2 .0 2  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1 .9 9  .0 3 .9 7  .0 1 .9 8  .0 1 .9 9  .0 9 .8 2  .0 9  .0 5 .7 6  .18 .0 1  .7 1 .18 .0 1 .17 .5 6  .25 .0 2  (b) APW-Perm  Table 5.6: Sampling distribution of APW-RC and APW-Perm on 2D ANOVA data with 750 points using StdBeforeBest. there are effects in the data that can significantly confuse this method. While it is likely that the data perturbation methods could be further tweaked to give substantially better results, such a step would require careful thought when the ground truth is not known. On this particular set of data, the gap statistic (Table (5.8)), where the prediction was done using StdBeforeBestrule does remarkably well, matched in this case only by APW-RC. However, the original rule of the gap statistic performs horribly. 105  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  .12 .8 8  4  .19  ·  .8 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .10  .0 3  ·  .8 7  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .15 .0 2  ·  ·  .8 3  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  .13 .0 3 .0 1  ·  ·  .8 3  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  .14 .0 5  ·  ·  ·  ·  .8 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  .15 .0 1  ·  ·  ·  ·  ·  .8 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  .10  .0 9 .0 6 .0 1  ·  ·  ·  ·  .7 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  .16 .0 9 .0 6 .0 4  ·  ·  ·  ·  ·  .6 5  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  .16 .0 8 .10  ·  .0 1  ·  ·  ·  .6 1 .0 1  ·  ·  ·  ·  ·  ·  ·  13  .17 .19 .0 5 .0 9 .0 3 .0 1 .0 1  ·  ·  ·  .0 3 .3 9  .0 3  ·  ·  ·  ·  ·  ·  14  .23 .0 9 .10  ·  ·  ·  ·  .0 1 .3 1 .13  ·  ·  ·  ·  ·  15  .34 .13 .0 7 .0 3 .0 5 .0 4 .0 1  ·  ·  ·  ·  .0 1  ·  .2 3  .0 9  ·  ·  ·  ·  16  .27 .21 .13 .0 6 .0 4  ·  ·  ·  ·  ·  ·  .0 2 .17  .10  ·  ·  ·  .0 2 .0 1  .0 5 .0 5 .0 3 ·  · ·  Table 5.7: Sampling distribution of SS-Draws-VI on 2D ANOVA data with 750 points. 2D ANOVA, 100 Points There are several features of the ANOVA data with a lower sample siz e (100 points) that are noteworthy. The first is that one fl avor of the gap statistic is the winner in 3 our of 5 of the cases. M ore specifically, the gap using StdBeforeBest seems to do the best out of all the methods in 2 and 10 dimensions but gets steadily worse; the regular G ap improves, beating all other methods in 100 dimensions but performing poorly on 2 dimensional data. The other interesting feature is that, contrary to the norm, APWRC-Best beats APW-RC. In the following tables, we discuss these results in more detail. In addition, we describe this effect and furthermore compare APW-RC and APW-RC-Best to APW-Perm. Table (5.9 ) compares two methods of prediction using the best performing of all our methods. As can be seen by comparing Table (5.9 a) and Table (5.9 b), APW-RC, with the StdBeforeBest rule, tends to choose one or two k’s too small. This could be because the sample siz e does not provide a reliable baseline, but the ex act cause is diffi cult to determine. Regardless, we note that the number of times APW-RC is significantly far off is still quite small. APW-Perm, however, tended to greatly overestimate the number of clusters when it got it wrong, as seen in Table (5.10). E ven though APW-Perm achieves a reasonable score – 6.24 compared to 6.9 2 for APW-RC – its predictions are significantly less reliable. This is likely due to the permutation being unable to mask the significant structure present in the original distance matrix . This would be more of a problem 106  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .0 1  ·  ·  .9 9  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .0 5 .32  ·  ·  .6 3  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  .0 2 .48  ·  ·  ·  .5 0  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  .10  ·  ·  ·  ·  .3 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  .12 .56 .0 2  ·  ·  ·  ·  .3 0  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .52  10  .0 7 .65 .0 5 .0 1  ·  ·  ·  ·  .2 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  .11 .63 .0 3 .0 2  ·  ·  ·  ·  ·  .2 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  .18 .59 .0 2 .0 4 .0 1  ·  ·  ·  ·  .0 1 .15  ·  ·  ·  ·  ·  ·  ·  ·  13  .12 .67 .0 4 .0 4 .0 1 .0 1  ·  ·  ·  ·  ·  .11  ·  ·  ·  ·  ·  ·  ·  14  .17 .68  ·  ·  ·  ·  ·  .0 3 .0 3  ·  ·  ·  ·  ·  ·  15  .24 .66 .0 1 .0 3  ·  .0 3  ·  ·  ·  ·  ·  ·  .0 1 .0 2  ·  ·  ·  ·  ·  16  .21 .66 .0 3 .0 4  ·  ·  ·  ·  .0 1  ·  ·  ·  .0 1 .0 4  ·  ·  ·  ·  ·  ·  .0 5 .0 1 .0 3  (a ) G a p sta tistic . ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .9 5  .0 4 .0 1  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .8 7  .13  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .7 9  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1 .7 6  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .19 .0 2  .22 .0 1  .0 1 .5 7  .36 .0 6  (b) G a p sta tistic , p red ic tio n w ith S td B efo reB est.  Table 5.8: Sampling distributions of the G ap statistic using two prediction rules on 2D ANOVA data with 750 points. on this type of data than any other. The results for the gap statistic are shown in Table (5.11). One noteworthy fact is that the prediction rule proposed as part of the gap statistic performs horribly in this case, failing to get any correct on some of the datasets with many clusters and consistently underestimating the number of clusters in the data. However, using the gap statistic with StdBeforeBest as the prediction rule works quite well. Why this is the case is more diffi cult to say, but it is something to keep in mind when using the gap statistic in 107  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  .0 4 .9 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .0 2  ·  .0 2 .0 6 .9 0  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  .0 2 .0 4 .18 .7 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .54 .4 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  .0 2  ·  ·  .14 .38 .4 4  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  .0 4 .24 .46 .2 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  .0 8  ·  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  ·  ·  ·  ·  ·  15  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  .0 8 .12 .40  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  .0 6 .28 .6 6  ·  .0 6 .16 .34 .3 6  .0 2 .0 4 .0 4 .34 .32 .10  ·  .12 .0 2  .38 .34 .14  .0 4  .22 .10  .0 2 .0 6 .40  .0 4 .0 2  .24 .12  .0 8 .0 6  (a ) APW-R C , p red ic tio n u sin g S td B efo reB est. ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  .0 4 .9 4  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  .0 2 .0 2 .9 2  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .0 6 .9 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  .0 2 .14 .7 6  .0 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  .16 .5 0  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .22 .3 4  .12 .0 4 .0 2  ·  ·  .0 2  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 8 .28 .2 0  .30  .12 .0 2  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 2 .12 .20  .18  .22 .18 .0 4 .0 2  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .18 .5 8  · ·  .16 .0 2  ·  .24 .0 8 .0 2 .20  .0 4 .10  .14 .2 8  .14 .10  ·  .14 .0 6  (b) APW-R C , p red ic tio n u sin g B estK .  Table 5.9 : Sampling distribution of APW-RC on 2D ANOVA data with 100 points. practice. 2D S h a p e d On the 2 dimensional shaped data, the most interesting cases occur in the low sample siz e datasets. We here ex amine a sampling of the most interesting cases. With APW-RC (Table (5.12)), prediction using with StdBeforeBest performs better than prediction using BestK , which, in this case, tends to over108  ˆ → K 2 Ktrue ↓ 2 .8 8  3  4  5  6  7  8  9  10  11  12  13  14  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  .4 6  4  ·  ·  .4 8  .0 4 .0 2  ·  ·  ·  .0 2  ·  .0 2  ·  5  ·  ·  .0 8 .5 6  6  ·  ·  ·  .0 6 .7 0  .0 4  7  ·  ·  ·  .0 2 .10  .7 0  8  ·  ·  ·  ·  ·  .12 .7 0  9  ·  ·  ·  ·  ·  .0 2 .32 .5 2  10  ·  ·  ·  ·  ·  ·  .0 2 .36 .4 6  11  ·  ·  ·  ·  ·  ·  .0 2 .0 4 .28 .4 4  12  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  .30  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 8 .22 .2 6  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .12 .28 .18  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .0 4 .0 4 ·  .0 2 .0 2  15  16  .0 2 .0 4 .0 6 .0 4 .0 4 .0 8 .10  ·  .0 2 .0 2  ·  .0 4 ·  .0 4 .0 8  18  19  20  ·  ·  .0 2  .0 2 .0 4 .0 6 .0 2 .0 6 .12 .0 6 .0 4 .0 2  ·  ·  .14 .0 2  17  .0 4 .0 2 .0 2  ·  ·  .0 2  ·  ·  .0 2  ·  ·  ·  .0 2  ·  · ·  ·  ·  ·  ·  ·  .0 4  .0 2 .0 2 .0 2 .0 2 .0 8  ·  .0 2  ·  .0 6 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  .10  .0 2 .0 2  .3 4  .14 .0 8 .0 6  ·  .0 4 .0 4 .0 2 ·  .0 4  ·  ·  · ·  .0 2  ·  .0 2 .0 4  ·  ·  .0 4  ·  ·  .0 4 .0 2  ·  ·  .0 2  ·  .0 2 .0 2  ·  ·  .0 2  .0 2 .0 2  ·  .16 .12 .0 6 .0 8  ·  ·  .16 .12 .0 4 .0 4 .0 2 .0 2  .0 2 .14 .14 .3 0 ·  ·  .0 2 .0 2 .0 6 .0 6 .0 4  ·  .0 6 .20  .24 .14 .0 2 .3 0  .20  ·  ·  .14 .0 4 .0 6  Table 5.10: Sampling distribution of APW-Perm, StdBeforeBest, on 2D ANOVA data with 100 points. estimate the number of clusters. This is contrary to the analogous ANOVA results. In this case, using the StdBeforeBest prediction rule is quite justified; not only is the predictor more accurate, but the method does not tend to underestimate. Tables (5.13a) and (5.13b) illustrate the accuracy of using as the baseline distance matrix a permutation of the original. In the low sample siz e case, the effects of the original structure would be harder to mask with the permutation, and this is illustrated here. While both tend to overestimate the number of clusters, the accuracy when the sample siz e is higher is substantially better. Again, we give the sampling distribution of SS-Draws-VI, as this represents the other data perturbation methods. We present the results for the lower sample siz e in Table (5.14 ); they are indicative of the larger sample siz e as well. As in the ANOVA case, it tends to significantly underestimate the number of clusters. On this type of data, neither of the two prediction rules tested makes the gap statistic perform well (Table (5.15)). If the proposed rule is used, the method tends to underestimate the number of clusters. If, however, StdBeforeBest is used for predicting K, the method tends to get more correct but overestimates the cases it gets wrong. Perhaps more sophisticated predictors are possible, but selecting other rules would be a diffi cult task.  109  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  .0 8  ·  .9 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .18 .10  ·  .7 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .30  .0 2  ·  .4 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  .26 .36 .0 4 .0 2 .0 2 .3 0  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  .46 .42 .0 6 .0 2  ·  ·  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  .50  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .20  .38 .0 6 .0 4  10  .52 .34 .0 4 .0 2 .0 4  ·  ·  .0 4  11  .42 .46 .0 4 .0 2 .0 2  ·  ·  ·  12  .48 .38 .0 6 .0 6  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  13  .58 .38 .0 2  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  .62 .36  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  .68 .28 .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  16  .68 .32  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  · ·  .0 2 .0 2  (a ) G a p sta tistic . ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  .0 6 .9 2  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  .10  .9 0  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .0 8 .9 0  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  .0 4 .10  .7 6  .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  .28 .6 2  .10  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .16 .4 8  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .28 .0 2 .0 2 .0 2  .0 4 .32 .3 2  .24 .0 6  .0 6 .24 .2 0 .0 4 .20 ·  ·  .36 .14 .2 2  .36 .12 .0 6  .0 8 .24 .2 6  .16 .14 .12  ·  (b) G a p sta tistic u sin g S td B efo reB est fo r p red ic tio n .  Table 5.11: Sampling distribution of the gap statistic on 2D ANOVA data with 100 points using the StdBeforeBest prediction rule. 100D ANOVA F or the high dimensional case, we present the prediction distribution tables for the ANOVA and shaped data with 100 points. In general, the results we note here apply also to the 750 sample case, and we didn’t notice any substantial difference between the results. F or conciseness, we omit them. In the higher dimensions, the permutation based methods performed significantly better than in lower dimensions (Table (5.16)). This is likely due to the fact that lower dimensional distributions tend to have a less homo110  ˆ → K 2 Ktrue ↓ 2 .8 6  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .10  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  .8 8  .12  4  ·  ·  .9 6  5  ·  ·  ·  .8 6  .14  ·  6  .0 2  ·  ·  ·  .8 2  .10  .0 4 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  .0 2  ·  .7 6  .18 .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  .0 2 .8 6  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .26 .0 4 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  .0 4  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  13  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .3 6  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .0 2 .2 8  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  16  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  .0 4 .6 4  .0 2 .6 4  .20  .0 6 .4 6  .10  .28 .16 .0 4  .14 .3 0  .22 .30  .0 2 .0 2  .32 .10  .12 .0 4 .0 2  .26 .18 .12 .0 4 .0 2 .0 2 .16  .34 .22 .14 .0 2 .0 2  .0 8 .0 8  .18 .20  .28 .16  (a ) APW-R C u sin g B estK fo r p red ic tio n . ˆ → K 2 Ktrue ↓ 2 .9 4  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  .9 4  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  .0 2 .9 6  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .0 2  ·  ·  .0 2 .9 4  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  .0 2 .10  .8 4  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  .0 8 .9 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .16 .7 8  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  .0 4 .24 .6 6  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  .0 6 .44 .4 4  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 6 .36 .4 0  ·  ·  ·  ·  ·  ·  13  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .26 .18  .34 .10  .0 4  ·  .0 2  16  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 8 .20  .2 2  .12 .0 4  .0 6 .18 .4 0  .24 .0 8 .0 2  .0 8 .32 .3 2  .14 .0 8 .0 6  .28 .0 8 .0 6 .0 4  (b) APW-R C u sin g S td B efo reB est fo r p red ic tio n .  Table 5.12: Sampling distributions of APW-RC with two different predictors on 2D ANOVA data with 100 points. geneous distribution of distance measures, which would make the original structure harder to disguise using permutations. In comparing APW-Perm with APW-RC, we note from Table (5.17) that the sampling distribution for APW-Perm-Best and APW-RC-Best both seem to do well. The main thing to note here, however, is that APW-Perm, while achieving less accuracy than APW-RC, tends to get fewer far wrong. On the contrary, APW-RC can significantly underestimate the number of clusters when Ktrue is higher, whereas APW-Perm does not have this prob111  ˆ → K 2 Ktrue ↓ 2 .3 0  3  4  5  6  .0 6 .0 2 .0 2 .0 2  7  8  9  10  ·  ·  ·  .0 4 .0 2  11  12 ·  13  14  15  16  17  .0 4 .0 8 .0 4 .0 8 .10  3  ·  .2 0  .10  .0 2 .0 2  ·  4  ·  ·  .3 8  .0 4 .0 6  ·  5  ·  ·  ·  .3 4  .20  .0 8 .0 2 .0 2 .0 6 .0 2 .0 4 .0 2  6  ·  ·  ·  ·  .3 2  .14 .0 8 .0 4  7  ·  ·  ·  ·  .0 2 .5 2  8  ·  ·  ·  ·  ·  18  19  20  .0 8 .0 4 .0 6  .0 2 .0 4 .0 4 .0 4 .0 6 .0 2 .12 .0 6 .0 6 .0 8 .0 6 .0 2 .0 4 ·  .0 4 .0 2 .0 2 .0 6 .0 8 .0 2 .12 ·  ·  .0 6 .0 8  .0 6 .0 8 .0 2 ·  .0 4 .0 2  .0 4 .0 4 .0 4 .12 .0 2 .0 6 .0 4 .0 4  .14 .0 8 .0 4 .0 2 .0 6 .0 2  .0 4 .5 2  · ·  ·  ·  ·  .0 6 .0 2 .0 2  · · .0 2 ·  .12 .0 2 .0 6 .0 2 .0 2 .0 6 .0 2 .0 2 .0 4 .0 2 .0 2 .0 2  9  ·  ·  ·  ·  ·  ·  .10  10  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  .0 2  ·  .10  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .18  .38 .22 .14 .0 2 .0 2  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .10  .2 0  .4 8  .10  .0 2 .4 8  .0 2 .0 2 .0 6 .0 6 .0 4 .0 2 .0 6 .0 4  ·  ·  .24 .0 6 .0 6 .0 4 .0 2  .0 6 .0 2  ·  ·  .0 4 .0 2 .0 6  ·  ·  .3 0  .26 .10  .0 4 .2 4  .10  .16 .20  .0 8 .2 8  ·  ·  .0 6 .12 .0 6 .0 6 .0 4 .0 2  .26 .12 .0 6 .0 8 .0 4 .0 2 .0 6  .0 4 .3 2  .26 .20  .10  .0 4  ·  .0 4  .30  .20  .16 .0 2  17  18  19  ·  ·  ·  ·  ·  .0 1  (a ) APW-Perm w ith 1 0 0 p o in ts p er d a ta set ˆ → K 2 Ktrue ↓ 2 .8 7  3  4  5  6  7  8  9  10  11  12  .0 6  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  .17 .0 6 .0 4 .0 2 .0 2 .0 1 .0 2 .0 1  13  3  ·  .5 9  4  ·  ·  .6 4  5  ·  ·  ·  .7 6  6  ·  ·  ·  ·  .7 4  7  ·  ·  ·  ·  ·  .8 3  8  ·  ·  ·  ·  ·  .0 2 .7 9  9  ·  ·  ·  ·  ·  ·  ·  .8 4  10  ·  ·  ·  ·  ·  ·  ·  ·  .7 9  11  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .7 9  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 9 .0 6  ·  14  15  16  .0 1 .0 2 .0 2 .0 1 ·  ·  .0 1 .0 1 .0 2 .0 1  20  .0 2 .0 1  ·  .0 1  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  .16 .0 4 .0 1 .0 3  ·  .0 1  ·  ·  ·  ·  ·  ·  .0 1  ·  .11 .0 3 .0 1 .0 1 .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .18 .0 3 .0 1  ·  .0 3 .0 3 .0 1 .0 2 .0 2 .0 2 .0 2 .0 1 .0 2 ·  .13 .0 3 .0 1 .0 1  ·  ·  ·  .0 1  ·  ·  ·  ·  .0 9 .0 4 .0 3  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  .16 .0 1 .0 2 .0 1 .14 .0 3 .0 2  .0 1 .7 2  .22 .0 3 .0 1  .0 1 .5 5  .34 .0 5 .0 3 .0 1 .0 1  .0 1 .5 3  .30  .0 1 .3 7 ·  .0 7 .0 6 .0 2 .0 1 .47 .10  .0 1 .3 4  .0 3  ·  · .0 2  .37 .18 .0 7 .0 3  (b) APW-Perm w ith 7 5 0 p o in ts p er d a ta set  Table 5.13: Sampling distribution of APW-Perm, StdBeforeBest, on 2D Shaped data with various sample siz es. lem. Note, furthermore, that comparison of the same method on two sample siz es can be misleading, as the results are not standardiz ed between the two. Thus the comparisons between corresponding values in the histograms are not justified, but qualitative patterns in the histograms are. We are focusing here on the latter. Also, although APW-RC performs better on both dataset siz es, the differences between APW-RC and APW-Perm are more pronounced for lower 112  ˆ → K 2 Ktrue ↓ 2 .9 8  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  .0 6 .9 2  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  .0 8  .8 6  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .0 8 .0 4 .0 8 .0 4 .6 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  .0 6 .0 4 .0 2 .0 4 .16 .6 4  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  .12 .10  .0 2 .0 2  ·  ·  .0 8 .5 2  .10  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  .10  .0 8 .0 2  ·  ·  .0 8 .10  .3 2  .16 .0 4  ·  ·  ·  ·  ·  ·  ·  ·  11  .0 8 .18 .0 2 .0 2 .0 6  ·  .0 2 .0 6 .18 .3 0  ·  ·  ·  ·  ·  ·  ·  ·  12  .14 .12 .0 2 .0 6  ·  ·  ·  ·  ·  ·  ·  13  .10  ·  .0 2  ·  ·  ·  ·  14  .14 .10  ·  ·  ·  ·  ·  15  .12 .0 8 .0 6  ·  ·  ·  16  .0 8 .0 6 .0 2 .0 6  ·  ·  ·  .0 4 .0 8 .7 2  .0 2 .0 4 .0 8  .12 .0 2 .0 4 .0 2 .0 4 .4 8 .10  .0 6 .0 4 .0 4  .0 4 .0 8 .0 2 ·  .0 2 .0 2 ·  ·  ·  ·  .14 .0 4  ·  .0 8  .12 .18 .18  .12 .0 2  .0 2 .0 2 .0 2 .18 .2 4 ·  .16 .10  .0 2 .0 4 .14 .12 .14  ·  .0 2 .0 2  ·  ·  ·  .0 2 .0 2  ·  .0 2  .16  .0 2 .0 4 .12 .12 .2 2 ·  .12 .0 6  .0 4 .0 2 .24 .22 .14  .0 2 .0 4  Table 5.14 : Sampling distribution of SS-Draws-VI on 2D Shaped data with 100 points. sample siz es, again consistent with what we’d ex pect. Note, however, that normally APW-Perm tends to perform best in higher dimensions. On the 100 dimensional ANOVA data with 100 points, SS-Draws-VI performs fairly well, but the gap statistic beats it. As in other cases, it can also significantly underestimate the number of clusters. The gap statistic using StdBeforeBest, while usually fairly close, tends to overestimate the number of clusters by one or two. 100D S h a p e d The prediction tables for the 100 dimensional shaped data have many of the same characteristics as the 2 dimensional shaped data. We present here a few highlights. The data perturbation methods had similar distributions to the lower dimensional cases, so we omit them here. Also, most of the noteworthy characteristics we present also apply to the 100 dimensional shaped data with 750 points. In comparing APW-Perm with APW-RC, we note from Table (5.21) that the sampling distributions for APW-Perm and APW-RC have similar characteristics. Also, although APW-RC performs better on both dataset siz es, the difference is more pronounced for lower sample siz es, again consistent with what we’d ex pect (Note, however, that normally in shaped distributions in higher dimensions, APW-Perm tends to perform best). It’s worthwhile to note that APW-RC tends to underestimate, while APW-Perm tends to overestimate. 113  ˆ → K 2 Ktrue ↓ 2 .5 0  3  4  .10  .10  5  6  7  8  9  10  15  16  17  18  19  20  ·  .0 4  ·  .0 2  ·  ·  .0 2 .0 2 .0 4 .0 4 .0 2 .0 2 .0 2 .0 4 .0 6 .0 2 .0 4  ·  ·  ·  ·  ·  .0 2  .0 2 .0 4 .0 4 .0 2 .0 4 .0 2  11 ·  12  13  14  .0 2 .0 2 .0 2  3  ·  .6 4  .0 2  ·  4  ·  ·  .5 6  .0 6  5  ·  ·  ·  .6 0  6  ·  ·  ·  ·  .4 0  .14  7  ·  ·  ·  ·  ·  .6 2  .10  8  ·  ·  ·  ·  ·  ·  .5 4  9  ·  ·  ·  ·  ·  ·  ·  .5 2  10  ·  ·  ·  ·  ·  ·  ·  ·  .4 6  11  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .3 8  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .2 6  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .2 6  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 4 .0 2 .0 6 .0 2 .0 4  .0 6 .0 4 .0 2 .0 6 .0 2 .0 2 ·  .10 ·  ·  ·  .0 4 .0 6  ·  .0 4 .0 2  .0 4 .0 2 .0 2 .0 2 .0 2 .0 2 .0 4  .0 6 .0 6 .0 4 .0 2 .0 4 .0 6 .0 2 .0 2 .0 2 .0 2 .0 4 .0 2 .0 2 .0 2 .0 4  .22 .0 8 .0 2 .0 2  ·  .0 4  ·  ·  .0 8 .0 2 .0 2  ·  .0 2 .0 4 .0 2  .14 .0 8 .0 4 .0 2 .0 2 .0 2 .0 2 .0 2 .10 .20  .18 .0 8 .0 2 .0 2  ·  .26 .10  ·  .0 4 .2 6  .12  ·  ·  · ·  .0 2  ·  .0 4  ·  .0 4 .0 4 .0 4  ·  ·  ·  .28 .12 .0 8 .0 8 .0 6 .0 6 .0 2 .30  · .0 2  .22 .0 6 .0 6 .10  ·  · ·  .32 .0 8 .12 .0 8 .0 8 .0 4  .0 2 .12  .48 .22 .14 .0 2  ·  .0 4 .10  .26 .26 .26 .0 8  (a ) G a p sta tistic . ˆ → K 2 Ktrue ↓ 2 .8 0  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .16 .0 2  3  ·  .9 0  4  .0 2  ·  .9 2  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .0 2  ·  ·  .9 0  .0 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .16 .10  .0 2  ·  .6 2  .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  .14 .28 .0 4  ·  ·  .5 0  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  .32 .26 .0 2  ·  ·  .0 2 .3 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  .34 .32 .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .2 0  10  .30  .38 .0 4 .0 2 .0 2 .0 2 .0 2 .0 8 .12  11  .40  .44 .0 4  ·  ·  .0 2 .0 2 .0 4 .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  .58 .24 .12 .0 4  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  13  .56 .26 .10  ·  ·  ·  .0 2  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  14  .52 .40  .0 4 .0 2 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  .68 .30  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  16  .68 .22 .0 8  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  · .0 4  (b) G a p sta tistic , p red ic tio n u sin g S td B efo reB est.  Table 5.15: Sampling distribution on 2D shaped data with 100 points using the gap statistic. The performance of APW-Perm and APW-RC are similar to the 2 dimensional case, with some minor variations. The first variation is the number of cases in which APW-RC was far wrong and significantly underestimated the number of clusters; this was not a problem in 2 dimensions. The fl ip side is the APW-Perm tended, when there were more clusters, to overestimate the number of clusters slightly but was more reliable overall. The data perturbation method illustrates some of the same patterns as before; it seems to be fairly robust to changes in shape and dimension. The 114  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  .9 6  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  ·  .9 2  .0 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  .0 4 .8 0  .14  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 2 .5 6  ·  ·  ·  ·  .0 2  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .12  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .2 2  .26 .0 8 .0 4  .0 6 .3 4 ·  .48 .0 8 .0 2 .4 4  .32 .12 .0 4 .0 4 .0 4  .0 2 .0 4 .14  .36 .26 .12 .0 6  .32 .28 .14 .12 .0 2 .18 .24 .24 .0 8  (a ) APW-Perm w ith 1 0 0 p o in ts p er d a ta set ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  .9 9  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .0 1 .9 4  .0 4  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  .14  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  .0 3 .7 6  .21  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  .0 1 .0 2 .6 2  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 7 .6 0  .30  .0 2 .0 1  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .12 .4 2  .42 .0 2  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 3 .8 3  .34 .0 1  .0 2 .12 .4 0 ·  .34 .10  .0 7 .22 .3 7  .23 .0 9 .0 1 .0 1  (b) APW-Perm w ith 7 5 0 p o in ts p er d a ta set  Table 5.16: Sampling distribution of APW-Perm, StdBeforeBest, on 100D ANOVA data with various sample siz es. gap statistic, however, performs less well, as as the modeling assumptions it is based on do not hold. While it is still somewhat competitive with our method, its accuracy is still less than that to APW-Perm which has far fewer computational issues.  115  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .9 6 ·  9  ·  ·  ·  ·  .0 2  ·  10  ·  ·  ·  ·  ·  ·  .0 6 .9 2 ·  .20  11  .0 2  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  13  .0 2  ·  ·  ·  ·  ·  ·  ·  14  .12 .0 2  ·  ·  ·  ·  ·  ·  ·  15  .22  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .10  16  .16  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 2 .24 .26 .2 0  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .8 0  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .38 .5 6  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 8 .50  .3 4  .0 6  ·  ·  ·  ·  ·  ·  ·  .0 6  ·  ·  ·  ·  ·  ·  .0 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .12 .34 .4 4  .0 4 .16 .42 .16  .42 .10  .0 8 .0 4  .0 8 .0 2  (a ) APW-R C w ith 1 0 0 p o in ts p er d a ta set ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  .17 .8 3  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  .0 1 .23 .7 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 5  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .13 .43 .4 0  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1 .18 .36 .3 9  ·  ·  .0 1 .9 9  .0 6 .9 4  .0 2 .36 .6 2  .0 3 .36 .6 0  .0 5 .48 .4 2  .0 3 .0 1  .0 5 .0 1  (b) APW-R C w ith 7 5 0 p o in ts p er d a ta set  Table 5.17: Sampling distribution of APW-RC, StdBeforeBest, on 100D ANOVA data with various sample siz es.  5.4  Conclusions  In this chapter, we have demonstrated, in a large simulation, that our method can perform on par with or better than previous methods. We did this by testing our methods using a large simulation on two types of data, ANOVA and shaped, at several dimensions and sample siz es. We feel this test demonstrated well how and when the methods tested break down and when the appear to work well. We summariz e the results here across several categories. 116  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  .0 2  ·  ·  ·  ·  ·  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  .18 .7 2  .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  .0 2  ·  ·  ·  ·  ·  ·  .0 2 .20  .5 2  .20  .0 2 .0 2  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .32 .3 6  .26 .0 2  ·  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 6 .24 .4 0  ·  ·  ·  ·  ·  14  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .10  16  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .9 6  .0 2 .9 8  .12 .8 2  ·  .26 .0 2  .26 .2 0  .30  .0 8 .0 4  ·  ·  ·  .38 .2 4  .12 .14  ·  ·  ·  .0 2 .12 .28 .3 0  .12 .0 8 .0 4 .0 2  Table 5.18: Sampling distribution of SS-Draws-VI on 100D ANOVA data with 100 points.  5.4.1  Incre a sing D im e nsion  The methods that performed will in lower dimensions did not always perform well as the dimension increased. Here are some sample observations: Ga p S ta tistic The predictor proposed with the gap statistic performs better and better as dimension increases, but doesn’t work as well in lower dimensions. However, in 2 dimensions, StdBeforeBest works much better but fails as the dimension increases. APW  APW-RC tended to perform quite well regardless of the dimension. APW-Perm performed poorly in lower dimensions but better with increasing dimension.  Da ta Pe rturb a tion Me th ods Data perturbation methods tended to be fairly consistent in their accuracy, ex hibiting the same problems and strengths regardless of dimension. In only one case out of 20, though, did one of the data perturbation methods win.  5.4.2  Cluste r S h a p e  Cluster shape presented different challenges, most notably to the methods making some modeling assumptions. Ga p S ta tistic The gap statistic undoubtedly is the most sensitive to cluster shape. It performs best in lower dimensional, low sample siz e 117  ˆ → K 2 Ktrue ↓ 2 1  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 8  ·  ·  ·  .0 4 .9 6  9  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  .0 4 .9 2 ·  .0 4  11  ·  ·  ·  ·  ·  ·  ·  ·  .28 .5 6  12  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .28 .4 0  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .22 .44 .14  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 8 .22 .54 .10  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 6 .28 .24 .3 2  .16 .8 2  .14 .0 2 .30  .26 .0 2 .5 2  .0 4 .0 4  .16 .0 2 .0 6  (a ) G a p sta tistic . ˆ → K 2 Ktrue ↓ 2 .9 6  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .0 2  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .0 2 .0 2  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  .9 0  4  ·  ·  .9 4  .0 6  5  ·  ·  ·  .9 6  .0 2 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  .9 0  .0 8 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  .9 2  .0 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  ·  .9 2  .0 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  ·  .8 0  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  ·  .5 2  .0 2  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .3 4  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .4 0  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .12  .0 6  ·  .0 2  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 8  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .18 .0 2  .36 .10  .52 .12 .0 2  .32 .16 .12  .44 .26 .10  .38 .32 .12 .10 .14  ·  .16 .36 .28 .0 6  (b) G a p sta tistic , p red ic tio n u sin g S td B efo reB est.  Table 5.19 : Sampling distribution on 100D ANOVA data with 100 points using the gap statistic. ANOVA data, where its G aussian modeling assumptions hold. In all cases, it performs noticeably worse on the shaped data than on the comparable ANOVA data relative to the other methods. APW  APW-RC seems to ex hibit some dependence on the shape of the mix ture components in the generated data, something possibly implicit in the reclustering stage of forming the baseline. APW-Perm, however, seems to handle shaped data without any problems. In many such cases, it is the winner. 118  ˆ → K 2 Ktrue ↓ 2 .9 5  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .0 5  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 8 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  4  ·  ·  .9 0  5  ·  ·  ·  .9 4  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  .0 1  ·  ·  .9 4  .0 4  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  .0 2 .9 3  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  8  ·  ·  ·  ·  ·  .0 7 .8 5  .0 6 .0 1  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  .0 1 .20  .6 9  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  11  .9 1 .0 8 .0 1  .0 1 .0 1  .0 9 .0 1  .0 1 .39 .4 6  .13 .0 1  .0 4 .29 .4 6  .17 .0 2  12  ·  ·  .0 1  ·  ·  ·  ·  ·  .0 5 .34 .2 9  13  ·  .0 1  ·  ·  ·  ·  ·  ·  .0 2 .12 .33 .2 5  14  .0 1 .0 1  ·  ·  ·  ·  ·  ·  ·  .0 4 .15 .27 .2 4  15  .0 2 .0 2  ·  ·  ·  ·  ·  ·  ·  .0 1 .11 .24 .14 .16  16  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  .26 .0 3 .0 1 .0 1 .15 .0 9 .0 2  .17 .0 8 .0 3  .15 .11 .0 3 .0 1  ·  .0 2 .0 8 .29 .24 .11 .12 .0 8 .0 5  ·  (a ) APW-R C . ˆ → K 2 Ktrue ↓ 2 .9 6  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  .9 1 .0 6 .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 1  ·  ·  ·  ·  .39 .0 6 .0 4 .0 3  ·  .0 1  ·  ·  ·  ·  3  ·  .9 2  .0 6 .0 1  4  ·  ·  5  ·  ·  ·  .9 6  .0 4  6  ·  ·  ·  ·  .9 4  .0 4  7  ·  ·  ·  ·  ·  .9 4  8  ·  ·  ·  ·  ·  .0 1 .8 5  9  ·  ·  ·  ·  ·  ·  .0 2 .6 9  10  ·  ·  ·  ·  ·  ·  .0 1 .0 6 .4 5  11  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  .0 3 .15 .2 9  13  ·  ·  ·  ·  ·  ·  .0 1  ·  .0 2 .0 7 .15 .2 3  14  ·  ·  ·  ·  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .0 1  .11 .0 1 .0 1  .24 .0 4 .0 1  .39 .0 8 .0 1  .0 2 .0 8 .4 4  .34 .0 8 .0 3  .21 .23 .0 4  .0 1 .0 1 .0 5 .0 9 .19 .19 .0 1 .0 1 .0 6 .14 .10 ·  .0 2 .0 2  .19 .16 .0 7 .0 2 .0 1 .0 1 .20  .18 .0 7 .0 4 .0 2  .0 1 .0 3 .0 8 .17 .17 .0 9  .14 .17 .0 9 .0 5  .17  (b) APW-Perm.  Table 5.20: Sampling distribution of APW-RC and APW-Perm, StdBeforeBest, on 100D Shaped data with 750 points. Da ta Pe rturb a tion Me th ods Again, these methods seem fairly robust against changes in the data shape. This is what would be ex pected, as they make no modeling assumptions.  5.4.3  S a m p le S iz e  Ga p S ta tistic The gap statistic performed better relative to the other methods as sample siz e decreased if its modeling assumptions held 119  ˆ → K 2 Ktrue ↓ 2 .9 8  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  .9 8  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .0 2  ·  ·  .9 4  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  .0 2  ·  ·  ·  .9 6  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  .0 2  ·  ·  ·  .0 4 .9 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .0 8 .8 2  .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  .0 4 .26 .4 0  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  .0 2  ·  .0 4 .0 4  ·  ·  ·  ·  ·  .28 .2 4  ·  ·  ·  ·  ·  13  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  .0 4 .36 .10  ·  ·  ·  14  ·  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  15  .0 2 .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  16  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 4  ·  ·  .0 8 .8 6  .22 .0 2  .0 2 .28 .4 4  .24  .26 .0 8 .0 4  .18 .22 .0 6 .0 2  .0 8 .32 .2 8 .20  .18 .0 8 .0 2 .2 0  .24 .12 .0 8 .0 2  ·  .0 8 .12 .2 8  .26 .12 .0 2 .0 8  (a ) APW-R C . ˆ → K 2 Ktrue ↓ 2 .9 8  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  .9 8  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  .8 8  .10  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  .9 4  .0 4  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  ·  .9 2  .0 4  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  8  ·  ·  ·  ·  ·  .0 2 .8 6  ·  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .14  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  12  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4  13  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .16 .0 6  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  .0 2 .0 8 .14  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 8 .10  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .0 6 .16  .0 8  .0 4 .8 2  .12 .4 0  .28 .10  .12 .4 4 ·  ·  .0 6 .0 4  .24 .0 8 .0 8  .14 .2 0  ·  .0 2 .0 2  .26 .22 .0 6 .0 4 .0 2 .0 2 .26 .22 .16 .12 .0 2  .24 .18 .16 .12 .0 2 .0 2 .24 .20  .24 .12 .0 2  .26 .14 .10  .22  (b) APW-Perm.  Table 5.21: Sampling distribution of APW-RC and APW-Perm, StdBeforeBest, on 100D Shaped data with 100 points. (i.e. on ANOVA data), but performed worse if they didn’t (i.e. on shaped data). APW  APW-RC tended to perform the same relative to everyone else in decreasing sample siz e, but the performance of APW-Perm decreased. This is to be ex pected, as the permutation tests in APW-Perm work better with more data.  Da ta Pe rturb a tion Me th ods These methods, while performing slightly 120  ˆ → K 2 Ktrue ↓ 2 1 3  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .9 6  4  ·  ·  .9 8  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  .0 2  ·  ·  .9 8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  .0 2 .9 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  8  ·  ·  ·  ·  9  ·  ·  ·  .0 2 .0 2  .0 4 .9 4  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  .8 4  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .16 .7 8  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .36 .4 0  .12  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  .0 4 .0 2  ·  ·  ·  ·  11  .0 2  ·  ·  ·  ·  ·  ·  .0 6 .34 .4 4  .14  ·  ·  ·  ·  ·  ·  ·  ·  12  .10  ·  ·  ·  ·  ·  ·  .0 2 .18 .30  .3 0  .10  ·  ·  ·  ·  ·  ·  ·  13  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  .26 .38 .2 2  ·  ·  ·  ·  ·  14  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .20  ·  ·  ·  ·  15  .0 4  ·  ·  ·  ·  ·  ·  ·  ·  ·  .10  ·  ·  ·  ·  16  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 4 .12 .28 .32 .2 0  .0 2  ·  ·  ·  .0 8 .0 4  .38 .2 8  .0 6 .0 2  .26 .36 .2 0  .0 4  Table 5.22: Sampling distribution of SS-Draws-VI on 100D Shaped data with 100 points. ˆ → K 2 Ktrue ↓ 2 .8 4  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  .10  .0 6  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  3  ·  .8 2  .16  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  4  ·  ·  .7 0  .30  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  5  ·  ·  ·  .7 2  .24 .0 2  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  6  ·  ·  ·  ·  .8 2  .16  ·  .0 2  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  7  ·  ·  ·  ·  .0 4 .8 0  .16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  8  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  9  ·  ·  ·  ·  ·  ·  .10  ·  ·  ·  ·  ·  ·  ·  ·  ·  10  ·  ·  ·  ·  ·  ·  .0 2 .30  .0 2 .0 2  ·  ·  ·  ·  ·  11  ·  ·  ·  ·  ·  ·  ·  .0 2  ·  ·  ·  ·  ·  12  ·  ·  ·  .0 2  ·  ·  .0 2  ·  .0 6 .14 .4 0  .26 .0 8 .0 2  ·  ·  ·  ·  ·  13  ·  ·  ·  ·  ·  ·  .0 2  ·  .0 6 .12 .32 .2 2  ·  ·  ·  ·  14  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 6 .0 8 .32 .2 0  ·  ·  ·  15  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 6 .24 .24 .14  .24 .0 6 .0 2  ·  ·  16  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  .0 2 .0 6 .28 .20  .2 4  .16 .0 4  ·  ·  .0 6 .7 6  .16 .0 2 .7 6  .12 .0 2 .3 8  .18 .0 8  .0 4 .18 .5 2  ·  .22 .0 2  · .10  .12 .0 4  .16 .14 .0 2  Table 5.23: Sampling distribution of the gap statistic on 100D Shaped data with 100 points. worse, did not ex hibit a substantial drop in performance.  5.4.4  O th e r Consid e ra tions  It is worth noting that we compared the analytically weakest tool from our method – a scalar stability index – against the other methods. While our method provides many more analysis tools such as the heatmap plot and per-cluster stability indices, any additional tools from the other methods 121  provide limited additional useful information. F urthermore, while we did not ex plicitly test the other stability statistics provided by our method, the fact that the scalar statistic summariz ing all of them performs so well indicates that the more informative parts will as well. F uture work in this matter includes testing our method ex tensively on real data. This includes analysis of several classification datasets where the classification labels are hidden. A good clustering method combined with a good cluster validation technique should be able to reproduce the labels with a reasonable amount of accurately provided the original classes were well separated.  122  Ch a p te r 6  L im its in H ig h D im e nsions While developing a full theory surrounding our proposed approach to cluster stability is a diffi cult task, several relevant results can be obtained. In this chapter, we describe a number of results concerning the theoretical properties of our method in high dimensions. In particular, we look at the assymptotics of our method as p → ∞. While not perhaps applicable to the more common low dimensional uses, this gives us an idea about how our method is likely to perform on high dimensional data. The chapter is divided logically into two parts. The first concerns the limits and properties clusterings based on the squared error cost function as p → ∞. In particular, we show that in the presence of noisy data, the empirical variance and covariance of the nonrandom components of the √ points in the partition must grow at a minimum rate of Ω p in order for the cost function to be meaningful. The most telling corollary to this result is that if there are only a finite number of meaningful, non-random components, and the rest are noisy, feature selection must be employed. Having established the limits of clustering as the dimension increases, we then look at the properties of our method in the same situation. Our general strategy is to show that when the clustering breaks down, so does our method, and that our method may give reasonable results when the clustering is meaningful.  6 .1  L im its of Cluste ring in H ig h D im e nsions  To investigate the limits of clustering in high dimensions, we use the widely accepted model of real process where the true values of the data are obscured by a noisy random variable, i.e. the observed variable Y is related to the  123  true value x by Y =x+ε  (6.1)  where ε is a random variable having a distribution with a mean of z ero and non-z ero variance. U sing this model, we intend to show that clustering based on the vanilla version of the cost function breaks down in high dimensions √ p . unless the distance between the true data points grows at a rate of Ω In many contex ts, this implies that feature selection is absolutely necessary if a clustering of the data is going to be meaningful. To prove this result, we first need formal definitions of partitions on the data, the cost function, and the noise distribution ε.  6 .1 .1  P a rtitions  A partitioning, or clustering, on a set of n data points is just a grouping of those points into K non-overlapping sets, or partitions. F or our purposes, we restrict ourselves to non-empty partitions, so each partition has at least one point assigned to it. Also, as a matter of notation, our partitions are defined in terms of the indices 1, 2, ..., n. We present this formally in the nex t definition: De fi nition 6 .1.1. Pa rtitioning G iven n points and a number of clusters K, a partitioning P = {P1 , P2 , ..., PK }  (6.2)  is a set of K subsets of {1, 2, ..., n} such that A. ∀ k ∈ {1, 2, ..., K}, Pk = ∅. B. ∀ i, j ∈ {1, 2, ..., K}, i = j, Pi ∩ Pj = ∅. C . ∪K k=1 Pk = {1, 2, ..., n}. In other words, the sets are non-empty, non-overlapping, and all indices are in at least one set. De fi nition 6 .1.2. C la ss of Pa rtitionings L et PnK be the set of all partitionings P of n points into K partitions.  124  6 .1 .2  P rop e rtie s of T h e S q ua re d E rror Cost F unction  We use the same cost function that is used in many clustering algorithms, namely the squared error cost function. Recall the definition of squared error cost from section 1.1.3: n  cost X (p) = i=1  xi − µ  2 2  (6.3)  where 1 µ= n  n  xi .  (6.4 )  i=1  The corresponding version for partitions is just cost X (p) , P =  cost k  xi ∈ X (p) : i ∈ Pk  .  (6.5)  This cost function is the basis for our investigation of clustering in high dimensions. To proceed, we need to first need to formally note a few properties of the cost function. These provide the basis for our result of clustering in high dimensions. T h e ore m 6 .1.3 . E xp e c te d C ost Let E (p) = {ε1 , ε2 , ..., εn } be a set of n points such that for i = 1, 2, ..., n, A. E εi = 0. B. E ε2i = σ 2 , 0 < σ ≤ C < ∞. Suppose P ∈ PnK . Then E cost E (p) , P = (n − K)σ 2  (6.6)  1 |Pk |  (6.7)  Proof. L et Mk =  εi i∈Pk  be the mean of each partition. We proceed by ex panding out the cost function: 125    E cost E (p) , P = E     K k=1 i∈Pk  K  = k=1 i∈Pk  (εi − Mk )2   E ε2i − 2 E [εi Mk ] + E Mk2 .  (6.8)  (6.9 )  Since the partitions of P are disjoint, there are ex actly n terms of the form E ε2i in total, with total ex pected value nσ 2 : K  E cost E (p) , P = nσ 2  k=1 i∈Pk  −2 E [εi Mk ] + E Mk2  (6.10)  Now for each partition Pk ,    2 1 E [εi Mk ] = E  εi + |Pk | = E ε2i +  1 |Pk |    j∈Pk j=i   εi εj    ( E εi )( E εj )  (6.11)  (6.12)  j∈Pk j=i  = σ2  (6.13)    (6.14 )  F urthermore,  E Mk2    1 1 = E  εi   εj   |Pk | |Pk | i∈Pk j∈Pk      1 = E  |P |2 k  i∈Pk   2 εi +   j∈Pk j=i    1  = E ε2i + 0 |Pk |2 i∈P    εi εj    (6.15)  (6.16)  k  =  1 2 σ |Pk |  (6.17) 126  Thus we have that K  E cost E (p) , P = nσ 2 +  k=1 i∈Pk  −  2σ 2 σ2 + |Pk | |Pk |  (6.18)  K  = nσ 2 + k=1 2  −σ 2  (6.19 )  = (n − K)σ  (6.20)  which completes the proof. Often, in our analysis, we are looking at the difference of two cost functions. F or that, we have the following handy corollary. C orolla ry 6 .1.4 . E xp e c te d Diff e re nc e of C ost F unc tions Let E (p) = {ε1 , ε2 , ..., εn } be a set of n points such that for i = 1, 2, ..., n, A. E εi = 0. B. E ε2i = σ 2 , 0 < σ ≤ C < ∞. Suppose P, Q ∈ PnK . Then E cost E (p) , P − cost E (p) , Q  =0  (6.21)  Proof. The proof follows from the previous theorem. E cost E (p) , P − cost E (p) , Q  = (n − K)σ 2 − (n − K)σ 2  (6.22)  =0  (6.23)  We have one final set of results concerning properties of squared error cost of random variables. The following theorems, while perhaps intuitively obvious, allow us to formally prove the impossibility theorem in the nex t section. The main idea is to prove that the probability that the cost of two nonrandom partitions over independent random variables is always less than one. This is a necessary step in applying the central limit theorem as part of the impossibility theorem.  127  L e m m a 6 .1.5 . Let X (p) = x1 , x2 , ..., xn be a set of p points, and suppose P, Q ∈ PnK , P = Q. Then there ex ists a sy m m etric n × n m atrix B = [bij ] such that cost X (p) , P − cost X (p) , Q = xT Bx  (6.24 )  w here x = (x1 , x2 , ..., xn ). Proof. Recall that the squared error cost function can be ex pressed in terms of the difference between points (theorem 1.1.5): cost E (p) , P =  1 2  k  1 |Pk |  x2i −  = i  i,j∈Pk  k  (xi − xj )2  1 |Pk |  xi xj  (6.26)  i,j∈Pk  aPij xi xj  =  (6.25)  (6.27)  i,j  where   (|Pk | − 1)/ |Pk | − 1/ |Pk | aPij =  0  i = j; i, j ∈ Pk ∃ k s.t. i, j ∈ Pk , i = j otherwise  (6.28)  The same is true for partition Q. We can re-ex press this above in matrix form, where AP = aPij and likewise for AQ . cost E (p) , P = xT AP x  (6.29 )  cost E (p) , Q = xT AQ x  (6.30)  Note that aPij = aPji , so AP and AQ are symmetric. The difference between the two cost functions can then be ex pressed as: cost E (p) , P − cost E (p) , Q = xT AP x − xT AQ x T  = x (AP − AQ )x  Thus the theorem is proved. 128  (6.31) (6.32) (6.33)  Now that we have shown these basic properties of the squared error cost function, we just need to formaliz e our notion of noisy data. We will then be ready to present our main clustering impossibility theorem.  6 .1 .3  N oisy D a ta  In this section, we present a formal definition of noisy data, or, more correctly, the class of noise generating distributions. De fi nition 6 .1.6 . Noise Ge ne ra ting Distrib utions L et F be the set of all sequences of distributions F = (F1 , F2 , ...) that satisfy the following properties: A. ∀ q ∈ {1, 2, 3, ...}, if εq ∼ Fq , then E εq = 0. B. ∀ q ∈ {1, 2, 3, ...}, if εq ∼ Fq , then E ε4q = ρq , 0 < Lq ≤ ρq ≤ Uq < ∞. The fourth moment is necessary to govern the convergence of the second moment of the squared error cost function. Note that such a constraint provides similar bounds on all the lower order moments. One of the key steps in the theorem is showing that a sequence of similarly defined random variables functions converges to a normal distribution centered at 0. Such a result is significant in its own right, so we present it, and another useful lemma, in the following section.  6 .1 .4  L e m m a s Conce rning th e Ce ntra l L im it T h e ore m  L e m m a 6 .1.7 . Let Zq , q = 1, 2, ... be a seq uence of rand om v ariables w ith E Zq = rq , w here rq is a non-rand om seq uence such that 1 √ p Zq2  p  q=1  rq → 0 as p → ∞.  (6.34 )  σq2 ,  Suppose E = 0 < L ≤ σq ≤ U < ∞. Let Sq = Z1 + Z2 + · · · + Zq and c2q = σ12 + σ22 + · · · + σq2 Then Sn D −→ ϕ as n → ∞ cn w here ϕ is a rand om v ariable hav ing a stand ard norm al d istribution.  129  (6.35)  Proof. L et Fq be the distribution function of Zq − rq . The proof follows from the L indeberg-F eller conditions on the central limit theorem [ADD9 9 ]. L et Fq be the distribution function of Zq , and let n  rq  µn =  (6.36)  q=1  If, ∀ ε > 0, n  1 LFn = 2 cn  {x : |x−rq |≥ ε cn }  q=1  (x − rq )2 dFq (x) → 0 as n → ∞,  (6.37)  then L indeberg’s theorem [ADD9 9 ] gives us Sn − µn D −→ ϕ as n → ∞ cn  (6.38)  We later show that the µn term drops out in the limit. The strategy now in showing that this condition is satisfied is to upper bound the ex pression and show that this bound goes to z ero. L et s2n = mean σq2 . Then q∈{1,2,...,n}  LFn = ≤  n  1 2 sn n 1 s2n n  √ {x : |x−rq |≥ ε s n n}  q=1  n  max  q∈{1,2,...,n}  (x − rq )2 dFq (x)  {x : |x−rq |≥  √  ε s n n}  (x − rq )2 dFq (x).  (6.39 ) (6.4 0)  Now we know that σq is bounded below by L, and the integral is always positive, so 1 L2 1 ≤ 2 L  LFn ≤  max  q∈{1,2,...,n} {x : |x−rq |≥ ε cn }  max  q∈{1,2,...,n}  {x : |x−rq |≥ √ Now for all rq , {x : |x − rq | ≥ εL n} {x : |x−rq |≥  εL  √  n}  εL  (x − rq )2 dFq (x)  √  n}  (x − rq )2 dFq (x)  (6.4 2)  ∅ as n → ∞, so ∀ q,  (x − rq )2 dFq (x) → 0 as n → ∞. 130  (6.4 1)  (6.4 3)  Thus 1 L2  max  q∈{1,2,...,n}  {x : |x−rq |≥  εL  √  n}  (x − rq )2 dFq (x) → 0 as n → ∞.  (6.4 4 )  so Sn − µn D √ −→ ϕ as n → ∞. nsn  (6.4 5)  Note that this is equivalent to equation (6.38). However, by assumption, 1 µn √ → 0 as n → ∞, sn n  (6.4 6)  so equation (6.4 5) reduces to Sn D −→ ϕ as n → ∞. cn  (6.4 7)  proving the lemma. L e m m a 6 .1.8 . Let Zq be a seq uence of rand om v ariables such that p  q=1  Zq D √ −→ ϕ σ p  (6.4 8)  w here ϕ ∼ N (0, 1) and σ is som e constant. Then as p → ∞, p  P  q=1  Zq ≥ 0  1 . 2  →  (6.4 9 )  Proof. We can rewrite the ex pression in equation (6.4 9 ) as P  p q=1  1 Zq ≥ 0 = P √ p  p q=1  Zq ≥ 0  (6.50)  However, by assumption, p  q=1  Zq D √ −→ ϕ. σ p  (6.51)  Thus by definition of convergence in distribution, P  p q=1  Zq ≥ 0 131  → P(ϕ ≥ 0).  (6.52)  But ϕ comes from a distribution that is symmetric about 0, so P(ϕ ≥ 0) =  1 2  (6.53)  and the statement is proved. We now have everything in place to present our clustering impossibility theorem.  6 .1 .5  Im p ossib ility T h e ore m for Cluste ring in H ig h D im e nsions  We now present our result limiting the meaningfulness of clustering in high dimensions in the following theorem. T h e ore m 6 .1.9 . L im its of C luste ring in H igh Dim e nsions. (p) (p) (p) Suppose X (p) = x1 , x2 , ..., xn , p = 1, 2, .... is a g iv en seq uence of (p) (r) sets of n points w ith increasing d im ension p and such that xq = xq for all p = r and q ≤ min {p, r}. Suppose there ex ists a seq uence of points u(p) so that ∀ i, j ∈ {1, 2, ..., n}, √ (p) (p) (xi − u(p) )T (xj − u(p) ) ∈ o( p)  (6.54 )  N ex t, suppose Fq ∈ F, q = 1, 2, ... is a seq uence of noise g enerating d istri(p) (p) butions. Let εiq ∼ Fq for i = 1, 2, ..., n, w ith εi = (εi1 , εi2 , ..., εip ), and d efi ne (p)  Yi Y  (p)  (p)  (p)  = x i + εi =  (6.55)  (p) (p) (Y1 , Y2 , ..., Y(p) n ).  (6.56)  F inally , suppose P, Q ∈ PnK , P = Q, are tw o d iff erent partitions of n points. Then P cost Y (p) , P ≤ cost Y (p) , Q  →  1 2  (6.57)  as p → ∞. Proof. The structure of the proof is to show that the difference the cost √ functions two cost functions divided by p converges to a standard normal distribution as p → ∞. A random variable from this distribution has 132  equal probability of being negative or positive, so one cost function has a probability of 21 of being greater than the other. We begin by setting up the notation and re-ex pressing the difference in cost functions, then show that the terms either drop out or together converge in distribution to a normal. Continuing, recall that P = (P1 , P2 , ..., PK )  Q = (Q1 , Q2 , ..., QK )  (6.58) (6.59 )  L et u(p) be a sequence in p of p-dimensional points1 such that ∀ i, j ∈ {1, 2, ..., n}, √ (p) (p) (xi − u(p) )T (xj − u(p) ) ∈ o( p),  (6.60)  as guaranteed by the assumptions. F or convenience, define (p)  X Y  (p)  (6.61)  (p)  = xi − u(p) =  (6.62)  (p) i q  =  xi  Y  =  (p) (p) (p) x 1 , x 2 , ..., x n (p) (p) x i + εi (p) (p) (p) (Y 1 , Y 2 , ..., Y n )  (6.63) (6.64 )  By corrolary 1.1.7, the cost function is invariant to such shifts, so q  (6.65)  q  (6.66)  cost Y , P = cost Y (p) , P cost Y , Q = cost Y (p) , Q  Thus proving the theorem for the shifted version is suffi cient. Now, for convenience, define the following terms: 1  T h e u(p) terms h ere a llo w u s to a c c o mid a te seq u en c es o f p o in ts w ith a n o n -z ero mea n .  133  nP k = |Pk |  nQ k  (6.67)  = |Qk |  (6.68) (p)  MP k = mean εi i∈Pk  MQ k µP k  =  (p) mean εi i∈Qk  = mean x i∈Pk  =  1 nP k  (p)  µQ k = mean x i i∈Qk  (p)  (6.70)  εi i∈Qk  1 = P nk =  (6.69 )  i∈Pk  1 = Q nk  (p) i  (p)  εi  1 nQ k  (p)  (6.71)  (p)  (6.72)  xi i∈Pk  xi i∈Qk  By theorem 1.1.3, the cost function is separable into components. F urthermore, by assumption, these components are identical for different p. Thus we can define a sequence of random variables Zq as the cost difference of the qth component. L et q  q  Zq = cost Y q , P − cost Y q , Q .  (6.73) p q=1 Zq  Note that given this definition, if we can show that P (p) , P  1 2  (p) , Q  ≥0 → 0  as p → ∞, then P cost Y ≥ cost Y → and the theorem is proved. Our strategy is to do ex actly this, relying heavily on lemma 6.1.7 and lemma 6.1.8. Now lemma 6.1.5 allows us to re-ex press the differences in cost functions as a matrix equation. L et B be a symmetric matrix such that Zq = y:q  T  q  q  By:q = cost Y q , P − cost Y q , Q  (6.74 )  Note that B depends only on the partitioning, so it will be the same for all q. We can ex pand this ex pression out: Zq = y:q  T  (6.75)  By:q  = (x:q + ε:q )T B(x:q + ε:q ) T  = x:q Bx:q +  εT:q Bε:q 134  (6.76) T  − 2x:q Bε:q  (6.77)  We now show that Zq satisfies the assumptions of lemma 6.1.7 and the theorem follows directly. The ex pectation of Zq is just T  T  E Zq = x:q Bx:q + E εT:q Bε:q − 2x:q B( E ε:q )  (6.78) (6.79 )  Now E εT:q Bε:q = E cost E (p) , P − cost E (p) , Q  =0  (6.80)  by corrolary 6.1.4 . This leaves us with T  T  E Zq = x:q Bx:q − 2x:q B( E ε:q ) T  = x:q Bx:q  (6.81) (6.82)  as E εiq = 0 ∀ i, q. Now, to satisfy the conditions of lemma 6.1.7, we need to show that 1 √ p  p T  q=1  x:q Bx:q → 0 as p → ∞.  (6.83)  Now we have by assumption that x Tx √ → 0 as p → ∞ p  (6.84 )  so also max bij i,j  x Tx √ → 0 as p → ∞. p  (6.85)  Thus equation (6.83) is proved. We now show that 0 < L ≤ E Zq2 ≤ U < ∞, which is the other condition in lemma 6.1.7. Since Var [X] = E X 2 − ( E X)2 , showing that the variance is lower and upper bounded is suffi cient to say that the second moment is also upper and lower bounded. This allows us to ignore the nonrandom x:q T Bx:q term as it does not affect the variance. Now 135  T  εT:q Bε:q − 2x:q Bε:q = (ε:q − 2x:q )T Bε:q = i,j  bij (εiq − 2xiq )εjq  (6.86) (6.87)  F or i = j, εiq and εjq are independent, and Var [εiq ] ≤ U < ∞ for all i and q, so ∃ U such that Var [(εiq − 2xiq )εjq ] ≤ U < ∞  (6.88)  If i = j and bij = 0, we have by assumption that E ε2iq ≤ U < ∞, so that term is bounded. F urthermore, B = 0, so bij = 0 for at least one pair of indices i and j, and Var [εiq ] ≥ L > 0 for all i and q. Because upper bounds ex ist for each term in equation (6.87), and a lower bound ex ists for at least one term, there ex ist bounds L and U such that T  (6.89 )  ⇔ 0 < L ≤ Var [Zq ] ≤ U < ∞  (6.9 0)  0 < L ≤ Var εT:q Bε:q − 2x:q Bε:q ≤ U < ∞ Thus the conditions of lemma 6.1.5 are satisfied. F rom lemma 6.1.5, we thus have that 1 √ p where s2p =  p q q=1 σq .  p  q=1  Zq D −→ ϕ as p → ∞ sp  (6.9 1)  This also satisfies the conditions of lemma 6.1.8; thus P  p q=1  Zq ≥ 0  →  1 . 2  (6.9 2)  But p  q=1  Zq = cost Yq(p) , P − cost Yq(p) , Q ,  (6.9 3)  so the theorem is proved. To relate the above theorem to several practical situations, we also present the following corollary: 136  C orolla ry 6 .1.10. Let X (p) be a p d im ensional d ataset w ith n com ponents d raw n from the sam e rand om process in w hich only a fi nite set of d im ension com ponents C ⊂ {1, 2, ...} are non-rand om . Let the com ponents C hav e no noise, and let the com ponents C C be d raw n from a noise g enerating d istribution. Let P be the optim al partitioning of the d ata consid ering only the non-rand om com ponents, and let Q be any other partitioning . Then P cost X (p) , P − cost X (p) , Q ≤ 0  →  1 2  (6.9 4 )  as p → ∞. Proof. The proof follows directly from theorem 6.1.9 . Because there are only finitely many non-noisy components of X (p) , we have trivially that (p) T  xi (p)  (p)  xj  √ ∈ o( p).  (6.9 5)  (p)  for all xi , xj ∈ X (p) . F urthermore, by the K olmogorov 0-1 law, setting finitely many components of the sequence cost Y (p) , P − cost Y (p) , Q , p = 1, 2, ...  (6.9 6)  to 0 cannot affect the convergence of the series. The theorem is proved.  6 .2  B e h a v ior of th e A v e ra g e d A ssig nm e nt M a trix  This section concerns general results on the partial membership matrix , i.e. what happens under various assumptions on the inputs. F or convenience, we sometimes use the functional notation for Φ (equation (3.6)). Restricting ourselves to hard clusterings, we have  φij = ψj (di , θ) =  I{dij λj ≤ di λ ∀ = j}π(λ; θ)dλ  (6.9 7)  Alternately, we ex tend some of our proofs to a more general form of Φ, where  φij = ψj (di , gi , θ) =  I{dij λj + gij ≤ di λ + gi ∀ = j}π(λ; θ)dλ (6.9 8)  137  The advantage of using this notation is that it can also apply when the priors between different λ ’s differ by a scaling parameter and a constant. In other words, if D  π (λ ) = π  λ −c β  (6.9 9 )  for some constants β and c , then  1dj λj ≤d  π (λ )dλ  λ ∀ =j  =  1dj βj λj +cj dj ≤d  β λ +c d ∀ =j  π(λ )dλ  (6.100)  which differs only notationally from equation (6.9 7). Thus the proofs that use this form will also apply to the location ex ponential prior and the shifted G amma prior from chapter 3 where the slope or scale parameters differ between clusters. Because our proofs often treat di , the input vector of point-to-centroid distances, as a random vector Di , we denote the random form of φij as Hij . F ormally, Hij = ψ(Di , Gi , θ)  (6.101)  Alternately, we generaliz e some of the results to the case when ψj takes both a distance measure di and additive parameters gi .  6 .2 .1  Conv e rg e nce of Φ a s a function of R a nd om V a ria b le s  We present here a lemma proving that the entries of our partial membership matrix Φ, as given by equation (3.1), converge asymptotically to nonrandom values under mild assumptions on the inputs. We use more general version of Φ: Φ = [φij ]i=1,...,n;j=1,...,K φij = Λ  I{dij λj + gij ≤ di λ + gi ∀ = j}π(λ)dλ  (6.102)  F or this lemma, we replace the d’s and g’s with converging sequences of random variables and show that φij will also converge. F or notational 138  conciseness, we omit the i index in the following proof; the result applies for each row of Φ. L e m m a 6 .2.1. Suppose Dj 1 , Dj 2 , ... and Gj 1 , Gj 2 , ..., j = 1, 2, ..., K, are seq uences of rand om v ariables such that Djp P −→ δj pα Gjp − G p P −→ γj pα  (6.103) (6.104 )  as p → ∞ for som e δj = 0, γj ∈ R , and α ∈ R . Let Hjp =  I[λj Djp + Gjp ≤ λ D p + G  ∀ = j]π(λ)dλ.  (6.105)  I[λj δj ≤ δ λ + γj ∀ = j]π(λ)dλ  (6.106)  p  Then if 0 ≤ π(λ) ≤ C < ∞ ∀ λ ∈ R K , P Hjp −→  as p → ∞. Proof. L et S j be the function taking two K × K matrices as input and returning a set on R K that is defined by S j [yj ]j,  , [zj ]i,j=1,2,...,K = {λ : λj ≤ yj λ + zj ∀ = j} . (6.107) We can rewrite the indicator function in the definition of Hjp to use this function: =1,2,...,K  I{λj Djp + Gjp ≤ λ D p + G p }  Dp G p − Gjp + Djp Djp G p − Gjp Dp , = I λ ∈ Sj Djp Djp = I λj ≤ λ  Let 139  (6.108) (6.109)  Yp = Zp =  Dp Djp  (6.110) i,j=1,2,...,K  G p − Gjp Djp  .  (6.111)  i,j=1,2,...,K  T h en π(λ)dλ.  Hjp =  (6.112)  S j (Y p ,Zp ) λ∈S  N ow let Y∞ = [ δ /δj ]  (6.113)  Z∞ = [ γj /δj ]  (6.114 )  S in c e Djp > 0 for j = 1, ..., K, Yp = as p →  −→P  [ δ /δj ] = Y∞  (6.115 )  ∞ . S im ila rly ,  Zp = as p →  D p /pα Dp = Djp Djp /pα  G p − Gjp (G p − Gjp )/pα = Djp Djp /pα  ∞ . N ow π(λ) ≤ C < ∞  −→P  [ γj /δj ] = Z∞  (6.116)  ∀ λ ∈ R K , so th e fu n c tion π(λ)dλ  f (Yp , Zp ) =  (6.117 )  S j (Yp ,Zp ) λ∈S  is a c on tin u ou s fu n c tion of th e en tries in Yp a n d Zp . F u rth erm ore, sin c e π(λ)dλ = 1, Hjp ≤ 1. T h u s w e h a v e Hjp = S j (Y p ,Zp ) λ∈S  as p →  π(λ)dλ −→P  ∞ , w h ich c om p letes th e p roof.  14 0  π(λ)dλ S j (Y ∞ ,Z∞ ) λ∈S  (6.118 )  6.3  Asymptotic Behavior of Φ in High Dimensions  In th is sec tion , w e b oth sh ow th e th eoretic a l lim its of ou r m ea su re in h ig h dim en sion s u n der c erta in c on dition s. E ff ec tiv ely , w e a rg u e th a t if th e c lu sterin g is n ot sen sib le, th en ou r m eth od w ill p rodu c e a u n iform resu lt, in dic a tin g n o sta b ility . U ltim a tely , w e ex a m in e th e c a se w h ere w e h old th e n u m b er of p oin ts n fi x ed a s w e let th e dim en sion p in w h ich th e p oin ts a re loc a ted in c rea se to in fi n ity . T h is ty p e of a ssy m p totic s h a s rec eiv ed som e a tten tion in th e litera tu re [H M N 05 , H AK 00]; h ow ev er, it is fa irly lim ited. P erh a p s th e la ck of a tten tion to th is ty p e of a ssy m p totic s is du e in p a rt to th e fa c t th a t n p oin ts lie in a su b sp a c e of dim en sion a t m ost n − 1, w h ich , in m a n y c a ses, m ea n s th e resu lts of su ch a ssy m p totic s is n ot a p p lic a b le. H ow ev er, in ou r c a se, w e a re b u ildin g ou r p roofs on dista n c es b etw een p oin ts, so a ll of th em h old u n der a n a rb itra ry rota tion of th e c oordin a te sy stem . T h is in c lu des th e P C A rota tion , w h ich p rojec ts th e orig in a l n p-dim en sion a l p oin ts, w h ere p ≥ n, in to a n − 1 dim en sion a l su b sp a c e. A N O V A T y p e D a ta F or th ese resu lts, w e m a k e th e AN O V A a ssu m p tion . S p ec ifi c a lly , w e a ssu m e th a t th e da ta is distrib u ted a c c ordin g to a m ix tu re m odel of K G a u ssia n s defi n ed b y a set of c en ters µ1 , µ2 , ..., µK a n d a c orresp on din g set of dia g on a l c ov a ria n c e Σ1 , Σ2 , ..., ΣK m a tric es, w h ere Σj = dia g σj2 1 , σj2 2 , ..., σj2 p a n d p is th e dim en sion of th e sp a c e. In ex a m in in g th e b eh a v ior of th e p a rtia l m em b ersh ip m a trix Φ a s th e dim en sion p in c rea ses, w e ex a m in e th e b est c a se sc en a rio reg a rdin g th e c lu sterin g . W e a ssu m e th a t th e c lu sterin g a lg orith m p erfec tly p a rtition s th e da ta set – th a t is, w e a ssu m e th a t ev ery p oin t in a p a rtic u la r p a rtition elem en t h a s b een dra w n from th e sa m e m ix tu re m odel c om p on en t, a n d ev ery p oin t dra w n from th e sa m e c om p on en t en ds u p in th e sa m e p a rtition . B ec a u se of th is on e-to-on e a ssu m p tion , w e refer to a ll th e p oin ts dra w n from a c om p on en t a s a c lu ster. W h ile b oth th e AN O V A a ssu m p tion a n d th is a re fa irly stron g a ssu m p tion s, a n d u n rea listic in m a n y c on tex ts, th ey do a llow u s to dem on stra te som e in terestin g p rop erties of ou r sta b ility m ea su re. T h rou g h ou t th is sec tion , w e u se th e n ota tion defi n ed in ta b les (6.1) a n d (6.2). T h e fi rst ta b le defi n es g en era l c lu sterin g n ota tion , w h erea s th e sec on d refers sp ec ifi c a lly to th e AN O V A sty le m ix tu re m odels w e b a se ou r resu lts on . 14 1  p (p) Xi Xiq N K Cj nj C i j,  T h e dim en sion . A p-dim en sion a l da ta p oin t. (p) T h e qth c om p on en t of Xi . T h e n u m b er of da ta p oin ts in tota l. T h e n u m b er of m ix tu re m odel c om p on en ts. T h e set of th e in dic es for a ll p oin ts dra w n from c om p on en t j. T h e n u m b er of p oin ts in c lu ster j. A m a p p in g fu n c tion th a t g iv es th e in dex of th e p oin t a c lu ster is a ssig n ed to, i.e. C (i) = j s.t. i ∈ Cj . In dex u sed to in dex da ta p oin ts. In dic es u sed to in dex c lu sters or m ix tu re m odel c om p on en ts.  T a b le 6.1: N ota tion a l c on v en tion s u sed for a sy m p totic p roofs of m ix tu re m odels in in c rea sin g dim en sion . µ(p ) j (p)  µjq 2 σjq 2 τjq  τj2 ¯ (p) X j  T h e m ea n of th e jth m ix tu re c om p on en t in th e u n derly in g distrib u tion . T h e m ea n of th e qth dim en sion of jth m ix tu re c om p on en t in th e u n derly in g distrib u tion . T h e v a ria n c e of th e qth dim en sion of th e jth m ix tu re c om p on en t. T h e a v era g e v a ria n c e of th e fi rst q dim en sion s of th e 2 jth m ix tu re c om p on en t, eq u a l to 1q qq =1 σjq 2 T h e lim it of τjq a s q → ∞ . T h e em p iric a l m ea n of th e jth c lu ster in p dim en sion s, (p) eq u a l to n1j i∈Cj Xj .  T a b le 6.2: N ota tion a l c on v en tion s u sed for a sy m p totic p roofs of AN O V A m ix tu re m odels.  6.3.1  Asymptotic Behavior of Φ w ith AN O V A-type M ix tu re M od els  T h e follow in g th eorem s desc rib e th e a sy m p totic b eh a v ior of Φ u n der th e AN O V A a ssu m p tion , desc rib ed a b ov e, a s p → ∞ . W e a ssu m e th e n ota tion a l defi n ition s a s g iv en in ta b les (6.1) a n d (6.2). Also, for th ese resu lts, w e trea t (p) th e elem en ts in Φ a s ra n dom v a ria b les, so w e defi n e Hij to b e th e resu lt of 14 2  (p)  (p)  Hij =  RK λ∈R  (p)  π(λ) I λj Dij ≤ λ Di  ∀ = j dλ.  (6.119)  2  (6.120)  w h ere (p)  (p)  (p)  ¯ Dij = Xi − X j  is th e dista n c e b etw een th e ith da ta p oin t a n d th e jth c lu ster c en ter, defi n ed a s a ra n dom v a ria b le. W e u se th e su p ersc rip t (p) to in dic a te th e dim en sion of th e v ec tor. F u rth erm ore, let νj b e th e dista n c e b etw een th e jth a n d th m ix tu re c om p on en t c en ters: (p)  νj = lim p→  µj − µ √ p  ∞  (p)  2  ,  (6.121)  a n d let δij b e defi n ed in term s of th e c om p on en t p rop erties a s follow s:  δij =      τC2(i) − τj2 /nj  j = C (i)  2 τC2(i) + τj2 /nj + νj,C (i) j = C (i)  (6.122)  T h is fa c tor p la y s a k ey role in ou r a n a ly sis. L e m m a 6 .3 .1 . As p →  ∞ , (p)  Dij P √ −→ p  δij  (6.123)  (p) √ P ro o f. B y defi n ition of th e L2-n orm , w e c a n rew rite Dij / p:  (p)  Dij √ = p  p q=1  ¯ jq Xiq − X p  2  (6.124 )  N ow , b y defi n ition , w e k n ow th a t Xiq ∼ N µC (i),q , σC2 (i),q  (6.125 )  ¯ C (i),q ∼ N µC (i),q , σ 2 ⇒X C (i),q /nj .  (6.126)  14 3  N ote th a t b etw een v a lu es of q w e h a v e in dep en den c e. T h u s th e stron g la w of la rg e n u m b ers g iv es u s p q=1  ¯ jq Xiq − X p  2  p  −→P  lim  p→  ∞  ¯ jq E Xiq − X p  q=1  2  (6.127 )  W e w ill u ltim a tely ev a lu a te th is a s tw o c a ses, w h en xi is a m em b er of c lu ster j a n d w h en it is n ot; th e resu lt w ill b e diff eren t in th e tw o c a ses. N ow th e sq u a re root is a c on tin u ou s fu n c tion , so th e a b ov e im p lies th a t p q=1  ¯ jq Xiq − X p  2  p  −→P  lim  p→  ∞  q=1  ¯ jq E Xiq − X p  2  (6.128 )  N ow let (p)  (p)  Yi = (Yi1 , Yi2 , ..., Yip ) = Xi − µC (i)  (6.129)  ¯ −µ . Zj = (Zj 1 , Zj 2 , ..., Zj p ) = X j j  (6.130)  (p)  (p)  W e n ow n eed to w orry a b ou t tw o c a ses – c a se 1, w h en j = C (i), a n d c a se 2, w h en j = C (i). C a se 1 : j = C (i) (p) (p) ¯ (p) . T h u s S in c e j = C (i), Xi a n d µC (i) a re in dep en den t from X j ¯ jq E Xiq − X  2  = E µC (i),q + Yiq − µjq − Zjq  2  = µC (i),q − µjq  2  = µC (i),q − µjq  2  +  2  2 + σC2 (i),q + σjq /nj  = µC (i),q − µjq  (6.131)  + 2 µC (i),q − µjq ( E Yiq − E Zjq ) + E (Yiq − Zjq )2  E Yiq2  − 2 E Yiq E Zjq +  (6.132)  2 E Zjq  (6.133) (6.134 )  W e c a n u se th is to fi n d th e lim it of eq u a tion (6.124 ). (p)  Dij ⇒ √ p  P  = −→P =  1 p  p q=1  ¯ jq Xiq − X  σC2 (i) + σj2 /nj +  2  1 p  2 τC2(i) + τj2 /nj + νj,C (i)  14 4  (6.135 ) p q=1  µC (i),q − µjq  2  (6.136) (6.137 )  C a se 2 : j = C (i) (p) (p) N ow if j = C (i), th en Xi a n d µj a re n ot in dep en den t: ¯ jq E Xiq − X  2  1 nj  = E Xiq −  i ∈Cj  = E µjq + Yjq − = E Yjq − = E Yjq2 +  1 nj  1 nj  1 nj  Yi q  i ∈Cj  (6.138 )  µjq + Yi q  i ∈Cj  i ∈Cj  E  2  Xi q  2  2  (6.14 0) 2  Yi q  (6.139)  − 2 n1j E Yiq  i ∈Cj  Yi q  (6.14 1)  N ow Yiq a n d Yi q a re in dep en den t for i = i , a n d E Yiq = 0, so m a n y of th ese term s drop ou t: ¯ jq ⇒ E Xiq − X  2  = E Yjq2 + 2 σjq  2 = σjq +  =  nj  σC2 (i),q  −  1 nj  E  i ∈Cj 2 σjq  Yi2q − 2 E n1j Yiq2  − 2 nj  2 σjq /nj  (6.14 2) (6.14 3)  .  (6.14 4 )  T hus (p)  Di,C (i) P = √ p −→P  p q=1  1 p  ¯ C (i),q Xiq − X  2  τC2(i) − τj2 /nj .  (6.14 5 ) (6.14 6)  W e n ow h a v e th a t (p)  Dij P √ −→ p  δij  (6.14 7 )  w h ich c om p letes th e lem m a . T h e o re m 6 .3 .2 . S u p p o se (p)  Hj  (p)  = RK λ∈R  T h en , a s p → (p)  Hj  (p)  π(λ) I λj Dj + Gj  ≤λD  (p)  +G  (p)  ∀ = j dλ. (6.14 8 )  ∞ , −→P  RK λ∈R  π(λ) I{λj δj ≤ λ δ + νj ∀ = j}dλ 14 5  (6.14 9)  √ P ro o f. F rom lem m a 6.3.1, w e h a v e th a t Dij / p −→P follow s direc tly from lem m a 6.2.1, a n d α = 0.5 .  6.3.2  δij . T h e th eorem th en  An Impossib ility T heorem for Φ in High Dimensions  U sin g th ese lem m a s, w e c a n g et a n on -triv ia l im p ossib ility th eorem for ou r sta b ility m ea su re, essen tia lly sa y in g th a t u n less th e dista n c e b etw een c lu ster √ c en ters g row s a t a ra te Ω p a s th e dim en sion p in c rea ses, th e p a rtia l p oin t a ssig n m en t m a trix dec a y s to a n u n in form a tiv e u n iform . T h e o re m 6 .3 .3 . If (p)  µj − µ  (p)  2  √ ∈ o( p)  (6.15 0)  th e n (p)  −→P  Hij as p →  ∞  a n d nj →  P ro o f. Let p →  ∞  1 K  (6.15 1)  ∞ . fi rst. W e h a v e, b y defi n ition , th a t (p)  νj = lim p→  ∞  µj − µ √ p  (p)  2  = 0.  (6.15 2)  B y lem m a 6.3.1, (p)  Dij P √ −→ p  δij  (6.15 3)  w h ere δij =      τC2(i) − τj2 /nj τC2(i) + τj2 /nj  In th e la rg e sa m p le lim it, w h ere nj →  j = C (i) j = C (i)  .  (6.15 4 )  ∞ , th is b ec om es  (p)  Dij P √ −→ p  σ ¯C (i) = σ ¯  (6.15 5 )  w h ere for n ota tion a l c on v en ien c e w e drop th e i’s a s th ey a re c on sta n t th rou g h th e rest of th e p roof. N ow b y th eorem 6.3.2, w e h a v e th a t 14 6  (p)  Hij  −→P  RK λ∈R  π(λ) I{λj δj ≤ λ δ ∀ = j}dλ  = P(λj δj ≤ λ δ ∀ = j) = P(λj σ ¯≤λ σ ¯ ∀ = j)  = P(λj ≤ λ ∀ = j)  w h ich is ju st th e p rob a b ility th a t λj is th e low est in a set of K i.i.d. ra n dom v a ria b les. T h is p rob a b ility is 1/K.  6.4  P rior Behavior  N ow th a t w e h a v e th is resu lt, w e c a n m a k e som e u sefu l ob serv a tion s rela tin g to th e ty p e of p rior w e sh ou ld u se. T h e n ex t th eorem sh ow s th a t w e c a n desig n a p rior c a p a b le of m a k in g th e p a rtia l m em b ersh ip of th e in dic a tor fu n c tion sen sib le p rov ided th a t th e dista n c es th em selv es a re distin g u ish a b le. T h e o re m 6 .4 .1 . S u p p o se th a t dj < d ∀ = j. T h e n e x ists su c h th a t φi,C (i) − m a x φi, ≥ 1 − ε =C (i)  ∀ η > 0, a p rio r π (6.15 6)  fo r ε su ffi c ie n tly sm a ll. P ro o f. T h e p roof follow s direc tly from lem m a 3.6.1. Let π(λ) b e th e loc a tion ex p on en tia l distrib u tion . R ec a ll th a t ∀ ε ∈ (0, 1) a n d S ⊂ {1, ..., K}, if j ∈ / S a n d dk ≥ dj 1 − lo θg ε for a ll k ∈ S, th en k∈S φk (d, θ) ≤ ε. T h u s, if w e u sin g th e loc a tion ex p on en tia l p rior w ith th e slop e θ set so θ ≥ (− log ε)  1 dk dj  −1  ,  (6.15 7 )  w e h a v e th a t  =j  φ (d, θ) ≤ ε  If th is is th e c a se, w e k n ow th a t 14 7  (6.15 8 )  φi,C (i) − m a x φi, ≥ 1 − ε, =C (i)  w h ich c om p letes th e p roof.  14 8  (6.15 9)  C hapter 7  Ad d itional R elevant R esearch 7 .1  V alid ation Algorithms involving R eclu stering  In th is sec tion w e c on sider op tion s for effi c ien tly c a lc u la tin g eq u a tion (3.1) w h en th e p ertu rb a tion in trodu c ed req u ires th e da ta to b e rec lu stered. In th is c a se, w e p resen t a n a lg orith m th a t m a y en a b le su ch a c a lc u la tion to b e c om p u ta tion a lly fea sib le. O u r sta rtin g a ssu m p tion is th a t th e c om p u ta tion a l ex p en se of in itia lly c rea tin g a c lu sterin g solu tion g rea tly ou tw eig h s th e c om p u ta tion a l ex p en se of m odify in g a c lu sterin g w ith a c lose v a lu e of th e p ertu rb a tion h y p erp a ra m eter. W h ile th e m otiv a tin g c a se for th e a lg orith m w e p rop ose is from c lu sterin g , oth er fu n c tion s in th is c la ss a re ea sy to im a g in e. F or ex a m p le, m a n y c a n b e ev a lu a ted on ly th rou g h a n itera tiv e p roc edu re w h ere a n a p p rox im a te solu tion c a n q u ick ly b e refi n ed to on e w ith a desired lev el of a c c u ra c y . T h erefore, it m a k es sen se to look a t th e g en era l form u la tion of th e p rob lem in stea d of th e sp ec ifi c m otiv a tin g ex a m p le of c lu sterin g . T h e p rob lem of c a lc u la tin g eq u a tion (3.1) is essen tia lly th e p rob lem of c a lc u la tin g th e n orm a liz in g c on sta n t of a p osterior distrib u tion , w h ich w e form u la te a s Z=  s(θ)p(θ)dθ.  (7 .1)  T h e solu tion to th is p rob lem is a diffi c u lt p rob lem , esp ec ia lly in h ig h dim en sion s, a n d it h a s rec eiv ed som e a tten tion in th e M on te C a rlo litera tu re.  14 9  7 .1 .1  M onte C arlo Ab straction  As M on te C a rlo m eth ods a re a p p lied to in c rea sin g ly c om p lex a n d n on sta n da rd fu n c tion s, desig n in g M C a lg orith m s th a t ta k e th e c om p lex ities in to a c c ou n t a re in c rea sin g ly im p orta n t. Addition a lly , on e of th e m ore ch a llen g in g p rob lem s for w h ich M C is u sed is ev a lu a tin g n orm a liz in g c on sta n ts in h ig h -dim en sion a l sp a c e. N u m erou s tech n iq u es for doin g so h a v e b een p rop osed, in c lu din g k ern el den sity b a sed m eth ods, im p orta n c e sa m p lin g m eth ods, a n d u sin g th e h a rm on ic m ea n [D ou 07 ]. B y a n d la rg e, th ese m eth ods b rea k dow n in h ig h er dim en sion s, b u t b ridg e sa m p lin g , p a th sa m p lin g [G M 98 ], a n d lin k ed im p orta n c e sa m p lin g [N ea 05 ] h a v e sh ow n sig n ifi c a n t im p rov em en ts. E v en so, it is still a n op en p rob lem a n d w ill lik ely rem a in so for som e tim e. T h e a lg orith m w e p rop ose is a m odifi c a tion of p a th sa m p lin g . In sec tion sec tion 7 .1.2, w e rev iew p a th sa m p lin g a n d h ow th e m eth od c a n b e u sed to estim a te th e v a lu e of a n orm a liz in g c on sta n t. In sec tion 7 .1.2 w e desc rib e sw eep sa m p lin g a s a n ex ten sion to p a th sa m p lin g , p resen t c om p u ta tion a l resu lts in 7 .1.3, a n d disc u ss th e stren g th s a n d lim ita tion s of p a th sa m p lin g in 7 .1.4 .  7 .1 .2  E valu ating the N ormaliz ing C onstant  O f th e m a n y p rop osed M on te C a rlo tech n iq u es for ev a lu a tin g Z = γ(θ)dθ, p a th sa m p lin g [G M 98 ] rem a in s on e of th e m ost rob u st. M ost oth er m eth ods, in c lu din g k ern el-b a sed m eth ods, sim p le im p orta n c e sa m p lin g , or h a rm on ic m ea n tech n iq u es, b rea k dow n in h ig h dim en sion s; often th e v a ria n c e of th e resu lt in c rea ses ex p on en tia lly w ith th e n u m b er of dim en sion s. P a th S a m p lin g T h e idea b eh in d p a th sa m p lin g is to sta rt from a distrib u tion p0 (θ), from w h ich w e c a n rea dily sa m p le a n d for w h ich w e k n ow th e n orm a liz in g c on sta n t Z0 . W e th en c on stru c t a sm ooth p a th from p0 (θ) to th e ta rg et distrib u tion p1 (θ), req u irin g th e p rob a b ility to b e n on z ero for a ll θ ∈ Θ a n d α ∈ [αs tart , αe nd ]. F or ex a m p le, th e g eom etric p a th – w h ich w e em p loy la ter in th is p a p er – is γ(θ|α) = p0 (θ)1−α p1 (θ)α  (7 .2)  w h ere w e in dex th e p a th b y α ∈ [0, 1] a n d den ote th e (u n n orm a liz ed) distrib u tion a lon g th e p a th b y γ(θ|α). G iv en su ch a p a th , th e log ra tio of th e tw o n orm a liz in g c on sta n ts is g iv en b y th e p a th sa m p lin g iden tity : 15 0  λ = log  Z(α = αe nd ) = Z(α = αs tart )  αe n  d  αs t a rt αe n d  Θ  d log γ(θ|α) π(θ|α)dθ dα dα  G(α)dα  =  (7 .3)  αs t a rt  w h ere π(θ|α) = γ(θ|α)/Z(α) is th e n orm a liz ed v ersion of γ(θ|α), i.e. π(θ|α)dθ = 1.  (7 .4 )  N ote th a t th e g eom etric p a th a s desc rib ed in eq u a tion (7 .2) u ses αs tart = 0 a n d αe nd = 1. T o estim a te (7 .3) u sin g M on te C a rlo, w e fi rst b rea k th e in teg ra l ov er α in to a set L disc rete v a lu es: A = {αs tart = α(1) < α(2) < ... < α(L) = αe nd .}  (7 .5 )  F or m ost a p p lic a tion s, a u n iform ly sp a c ed g rid is su ffi c ien t, th ou g h [G M 98 ] in v estig a tes sa m p lin g th e α(j) from a n op tim a l p rior distrib u tion p(α). S a m p lin g G iv en a n α(j) ∈ A, w e c a n estim a te G(α(j) ) in eq u a tion (7 .3) b y ob ta in in g N i.i.d. sa m p les from π(θ|α = α(j) ) a n d ev a lu a tin g G(α(j) ) =  1 N  N i=1  d log γ(θ(i) |α) dα  (7 .6) α=α(j)  In g en era l, w e c a n on ly sa m p le from p(θ|α = αs tart ), b u t w e c a n ob ta in i.i.d. sa m p les for p(θ|α(j) ) b y c on stru c tin g a n M C M C ch a in from γ(θ|α = αs tart ) to γ(θ|α = αe nd ) th a t redistrib u tes th e sa m p les from p(θ|α(j−1) ) a c c ordin g to p(θ|α(j) ) u sin g a M etrop olis-H a stin g s ra n dom w a lk [AdF D J 03]. ˆ b y p erform in g n u m eric a l in teg ra tion W e c a n th en ob ta in a n estim a te of λ (1) (2) on th e seq u en c e G(α ), G(α ), ..., G(α(L) ). N u m erou s p ossib le m eth ods ex ist for doin g th is, e.g . tra p ez oida l or c u b ic sp lin e in teg ra tion . S w e e p S a m p lin g H ere w e p rop ose h ere a n ex ten sion to p a th sa m p lin g th a t a llow s for c om p u ta tion a lly ex p en siv e distrib u tion s. W e c a n div ide th e reg ion Θ in to R distin c t su b reg ion s Θ1 , Θ2 , ..., ΘR , a llow in g u s w rite eq u a tion (7 .3) a s: 15 1  αe n  d  λ= r  αs t a rt  Θr  d log γ(θ|α) π(θ|α)dθdα. dα  (7 .7 )  w h ere n ota tion a lly , r in dex es th e reg ion s. F u rth erm ore, let ρ(θ) den ote th e in dex of th e reg ion θ is in (i.e. ρ(θ) = r ⇔ θ ∈ Θr ). In form a lly , th e idea b eh in d sw eep sa m p lin g is to h old th e reg ion s Θr a t th e p rop osa l distrib u tion u n til th e ta rg et distrib u tion b ec om es c om p u ta tion a lly fea sib le. In itia lly , a t α = αs tart , w e defi n e Θ1 a s a g iv en c ollec tion of on e or m ore seed p oin ts w h ere s(θ) h a s b een solv ed. As soon a s w e defi n e Θr , w e sta rt th e reg ion on a p a th to th e ta rg et distrib u tion . W e store a solu tion to s(θ) for θ ∈ Θr , w h ich a llow s s(θ) to b e c om p u ta tion a lly fea sib le for θ ∈ Θr+ 1 . T h is “ dom in o eff ec t” c a u ses th e tra n sition reg ion to sw eep th rou g h Θ, h en c e th e n a m e. F orm a lly w e c a n th u s defi n e ou r term s (ta b le 7 .1): P(α): Q(α): T (α): tr : αr : C(α):  cr :  T h e set of a ll reg ion s still a t th e p rop osa l distrib u tion (i.e. θ ∈ P(α) ⇒ γ(θ|α) ∝ p(θ)). T h e set of a ll reg ion s c u rren tly in tra n sition from th e p rior to th e ta rg et. T h e set of a ll reg ion s th a t h a v e a rriv ed a t th e ta rg et distrib u tion (i.e. θ ∈ T (α) ⇒ γ(θ|α) ∝ s(θ)p(θ)). T h e v a lu e of α a t w h ich reg ion r b eg in s tra n sition in g from th e p rop osa l to th e ta rg et. T h e loc a l v a lu e of a lp h a , α − tr . A g lob a l c on sta n t fa c tor a tta ch ed to th e p rop osa l, a dju sted to reg u la te th e fl ow of p a rtic les a c ross th e tra n sition reg ion (disc u ssed in sec tion 7 .1.2). T h e v a lu e of C(α) w h en reg ion r b eg in s tra n sition in g (i.e. cr = C(tr )). T a b le 7 .1: T h e term s u sed in th is ch a p ter.  W ith ou t loss of g en era lity , w e a ssu m e a p a th len g th of 1 for ea ch in div idu a l reg ion ; in oth er w ords, γ(θ ∈ Θr |α = tr ) = cr p(θ) a n d γ(θ ∈ Θr |α = tr + 1) = s(θ)p(θ). Allow in g ea ch reg ion to follow a g eom etric p a th from p rop osa l to ta rg et, w e c a n defi n e γ(θ|α) ∝ π(θ|α) a s  θ ∈ P(α)  C(α)p(θ) 1−α α r r p(θ) θ ∈ Θr ⊆ Q(α) s(θ) cr (7 .8 ) γ(θ|α) =  s(θ)p(θ) θ ∈ T (α) 15 2  B y b rea k in g Θ in to th e p rop osa l, tra n sition a n d p ost-tra n sition reg ion s a s a n d slidin g th e in terv a l of th e tra n sition reg ion to [0, 1], w e c a n ex p ress (7 .3) a s 1  λ= r  w h ere  d da  tr  Hr (α)dα  Gr (αr )dαr +  (7 .9)  0  0  log γ(θ|α) = 0 in th e p ost-tra n sition reg ion a n d  log  Gr (αr ) = Θr  Hr (α) = Θr  s(θ) π(θ|α = αr + tr )dθ cr  d log C(α) π(θ|α)dθ dα  (7 .10) (7 .11)  W e c a n ob ta in a M on te C a rlo estim a te of (7 .9) a s desc rib ed in sec tion (7 .1.2) u sin g th e sa m e tech n iq u e a s for reg u la r p a th sa m p lin g . W e c on stru c t th e seq u en c e A = {α(1) , α(2) , ..., α(L) } a s in eq u a tion (7 .5 ) on th e in terv a l [αs tart , αe nd ] = [0, 1 + m a x r tr ]. At ea ch v a lu e of α(j) , w e c a n estim a te (7 .10) a n d (7 .11) u sin g M C :  Gr (αr ) =  1 N  log i:θ (i) ∈Θ  s(θ(i) ) cr  (7 .12)  r  Nr d log C(a) Hr (α) = N da  (7 .13) a=α  w h ere Nr den otes th e n u m b er of sa m p les from Θr . W e c a n th en ev a lu a te λ b y su m m in g ov er th e reg ion s for ea ch α(j) a n d n u m eric a lly in teg ra tin g th e resu ltin g seq u en c e. D e fi n in g th e S u b re g io n s B ec a u se w e do n ot k n ow b eforeh a n d th e p rec ise b eh a v ior of th e c ost fu n c tion a n d/ or th e g eom etry of th e ta rg et distrib u tion , w e c a n n ot, in g en era l, determ in e th e reg ion s Θ1 , ..., ΘR b eforeh a n d. If w e a ssu m e th a t th e c ost of m ov in g to c loser p oin ts is u su a lly less th a n or eq u a l to th e c ost of m ov in g to fu rth er p oin ts, th en w e c a n defi n e ea ch reg ion in term s of a set ϑr of fi x ed p oin ts. In form a lly , a p oin t θ is in reg ion r if it is c loser to a t lea st on e p oin t in ϑr th a n th ose of a n y oth er reg ion . F orm a lly , 15 3  θ ∈ Θr ⇔  a rg m in φ∈ϑ1 ∪ϑ2 ∪...  d(θ, φ) ∈ ϑr  (7 .14 )  w h ere d(·, ·) is th e dista n c e m etric . T h is w ill a llow u s to u se M on te C a rlo sa m p les to defi n e n ew reg ion s. S p ec ifi c a lly , ϑr c a n b e a n on -em p ty set su ch th a t ϑr ⊆ {θ(i) : θ(i) ∈ P(α), c ost(θ(i) , φ ∈ ϑ1:r−1 ) ≤ C M a x }  (7 .15 )  w h ere th e u ser-sp ec ifi ed p a ra m eter C M a x sets th e m a x im u m c ost th e a lg orith m w ill c on sider b etw een a c u rren t a n d a n ew ev a lu a tion of s(θ), θ(i) is a n ex istin g M C p a rtic le, a n d ϑ1:r is sh orth a n d for ϑ1 ∪ ϑ2 ∪ ... ∪ ϑr . In ou r a lg orith m , w e defi n ed a n ew reg ion a t ea ch α(j) u sin g a n y p a rtic les n ot p rev iou sly rea ch a b le. S p ec ifi c a lly ,  ϑr+  1  =  θ(i) : θ(i) ∈ P(α(j) ),  ∀φ ∈ ϑ1:r−1 ∃φ ∈ ϑ1:r  c ost(θ(i) , φ) > C M a x c ost(θ(i) , φ) ≤ C M a x  .  (7 .16) If n o su ch p a rtic les ex ist, w e w a it u n til som e p a rtic les sa tisfy (7 .16) to defi n e a n ew reg ion . F u rth erm ore, b ec a u se w e u se a n ea rest n eig h b or sea rch to determ in e w h ich reg ion a p a rtic le is in , storin g a solu tion to s(θ) for ea ch p oin t in ϑr m a k es it effi c ien t to c a lc u la te s(θ) for a n y M C p a rtic les in s(θ). C a lc u la tin g c (α) W h ile th e v a n illa v ersion a b ov e m eth od w ill w ork fi n e if s(θ)π(θ) is rou g h ly th e sa m e m a g n itu de a s π(θ), p rob lem s m a y a rise if th e ta rg et distrib u tion is sev era l orders of m a g n itu de diff eren t th en th e p rop osa l. W h en th is h a p p en s, a disp rop ortion a l n u m b er of sa m p les w ill eith er lea v e or rem a in in P(α). W e th u s p rop ose th e follow in g a da p tiv e m eth od a s a n a ttem p t to solv e th is p rob lem . W e c a n ch ose C(α), th e sc a lin g fa c tor of th e p re-tra n sition reg ion to ev ery th in g else, a rb itra rily . W e th u s c on stru c t a fu n c tion th a t a ttem p ts to h old th e a v era g e p rob a b ility in th e p re-tra n sition reg ion a t rou g h ly th e sa m e v a lu e a s th a t in th e p ost-tra n sition reg ion s. S p ec ifi c a lly , th is m ea n s: 1 C(α) |P(α)|  p(θ)dθ P(α)  15 4  1 |T (α)|  s(θ)p(θ)dθ T (α)  (7 .17 )  w h ere |A| = A dθ den otes siz e. Assu m in g P(α) = ∅ a n d T (α) = ∅, th en ˜ = a) from (7 .17 ) a s follow s: w e c a n c on stru c t a n on lin e estim a te C(α  ˜ = a) = log log C(α = log  |T (a)|  P(a) γ(θ|α  |P(a)|  = a)dθ  (7 .18 )  T (a) p(θ)dθ  P(a) γ(θ|α  = a)dθ  P(a) γ(θ|α  = 0)dθ  + log  |T (a)| |P(a)|  P(a) p(θ)dθ T (a) p(θ)dθ  = λ(a) + ω(a)  (7 .19) (7 .20)  w h ere |T (a)| |P(a)|  ω(a) = log  P(a) p(θ)dθ T (a) p(θ)dθ  (7 .21)  a n d λ(a) is ju st th e log ra tio of th e n orm a liz in g c on sta n ts of th e p osttra n sition reg ion . R estric tin g eq u a tion (7 .9) to T (a), w e h a v e 1  tr  Hr (α)dα.  Gr (αr )dα +  λ(a) =  (7 .22)  0  r:Θr ⊆T (a) 0  N ote th a t in (7 .22) tr ≤ a−1 a s a ll reg ion s in T h a v e rea ch ed th e ta rg et. W e c a n th u s estim a te u sin g v a lu es of Gr (αr ) a n d Hr (α) a lrea dy a v a ila b le. F u rth erm ore, w e c a n estim a te ω(α) b y sa m p lin g N {U } p a rtic les from U n(()Θ) a n d N {P r} p a rtic les from P r(Θ). An M C estim a te for ω(a) is th en :  ω(a) = log  {U }  {P r}  {U }  {P r}  NP(a) + 1 NT (a) + 1 NT (a) + 1 NP(a) + 1  (7 .23)  w h ere w e a dd 1 to ea ch term to m a k e th e ra tio w ell defi n ed if on e q u a n tity g oes to z ero. ˆ or λa, w e c a n u se sim p le n u m eric a l diff erIn estim a tin g b oth th e fi n a l λ d C(α). en tia tion on th e seq u en c e A to ev a lu a te dα  15 5  Dim 2 3 5 8 10 15 20 30  N 400 600 10 0 0 16 0 0 20 0 0 30 0 0 6000 9000  C M ax 0 .1 0 .15 0 .25 0 .4 0 .5 0 .75 2 3  T ru e λ -1.4 7 9 4 -2.219 1 -3.6 9 86 -5.9 177 -7.39 72 -11.0 9 57 -14 .79 4 3 -22.19 15  λ fr o m P S -1.5711 -2.330 1 -4 .0 14 9 -6 .6 576 -8.50 57 -13.0 9 9 3 -17.6 714 -26 .714 3  S S λ (b e st) -1.4 9 27 -2.26 9 9 -3.886 -6 .3333 -7.9 26 -12.24 28 -17.30 6 9 -25.59 13  S S λ (a v g ) -1.5182 -2.29 53 -3.9 50 2 -7.14 53 -10 .3783 -13.514 -17.329 6 -25.6 584  S S λ (w o r st) -1.54 0 9 -2.310 9 -3.9 86 5 -9 .4 6 6 3 -12.9 788 -17.224 8 -17.3556 -25.7884  Table 7.2: The results for the target distribution with C(α) = 1. Dim 2 3 5 8 10 15 20 30  N 400 600 10 0 0 16 0 0 20 0 0 30 0 0 6000 9000  C M ax 0 .1 0 .15 0 .25 0 .4 0 .5 0 .75 2 3  T ru e λ -1.4 79 4 -2.219 1 -3.6 9 86 -5.9 177 -7.39 72 -11.0 9 57 -14 .79 4 3 -22.19 15  λ fr o m P S -1.5711 -2.330 1 -4 .0 14 9 -6 .6 576 -8.4 789 -13.0 6 13 -17.6 515 -26 .7133  S S λ (b e st) -1.4 9 27 -2.26 9 9 -3.6 0 2 -5.7327 -7.4 20 3 -10 .0 86 5 -16 .80 2 -25.4 50 6  S S λ (a v g ) -1.50 28 -2.29 53 -3.9 118 -6 .4 6 6 -7.74 9 -10 .134 8 -17.0 0 37 -25.6 6 9 2  S S λ (w o r st) -1.36 55 -2.310 9 -3.9 74 7 -7.722 -8.0 9 6 1 -23.854 6 -17.29 6 8 -25.89 25  Table 7.3 : The results for the target distribution with C(α) dynam ic ally up dated.  7.1.3  Computational Results  To test the ac c urac y of sweep sam p ling, we started with a uniform distribution as the p rop osal and used a (transition) p ath length of 5 0 (we found that results c om p ared sim ilarly at other p ath lengths). W e generated a target distribution by starting with a sp heric al m ultiv ariate norm al distribution c entered at the origin and with σ 2 = 0 .5 . W e then added 10 0 n-dim ensional hyp erc ubes with side-length 0 .25 random ly to the distribution. This allowed us to choose the dim ension and struc ture of the target distribution while still allowing us to c om p ute s(θ)p(θ)dθ ex ac tly. W e c om p ared p ath sam p ling with the geom etric p ath (eq uation 7.2) with sweep sam p ling at a num ber of dim ensions and found that in m any c ases sweep sam p ling gav e c om p arably ac c urate results. B ec ause M A TL A B did not hav e the p rop er data struc tures (such as dynam ic k d-trees) for p arts of the sweep sam p ling algorithm , we c ould not ac c urately c om p are ex ec ution tim es. W e defi ned the c ost sim p ly as the sq uared L 2 distanc e. W here we found sweep sam p ling tended to break down was in how we defi ned new regions; as the dim ension of the sp ac e inc reased, the num ber of total p artic les needs to inc rease ex p onentially to achiev e a c om p arable rate of growth. W e off set this by inc reasing the m ax im um c ost, but it rem ains an issue and area for further study. V arious results are shown in 7.3 . The results for p ath sam p ling are the av erage of two c ases, as rep eated results were v ery sim ilar. W e ran the sweep sam p ling tests 10 tim es for 2-15 dim ensions and 4 tim es for 20 and 3 0 15 6  dim ensions. W e threw out obv ious outliers (λ sev eral orders of m agnitude or m ore away from the m edian) and tabulated the results. In general, up dating the c onstant dynam ic ally im p rov es the results and yielded the m ost ac c urate results. The bias toward sm aller v alues of the norm aliz ing c onstant is lik ely due to the sharp “ edges” in the target distribution, which p ose a challenge for any m ethod and req uire signifi c antly m ore sam p les to estim ate. The p urp ose of this sim ulation was to p ush the m ethods to their lim its for the p urp ose of c om p arison; we susp ec t that a real distribution will be better behav ed.  7.1.4  D isc ussion  This m ethod dem onstrates that sweep sam p ling c an achiev e fav orable results while tak ing into ac c ount one typ e of c om p utationally diffi c ult distribution. H owev er, there are still sev eral areas for p ossible im p rov em ent. F or one, the estim ate of C is bound to be highly noisy. O ne p ossible way of im p rov ing the algorithm , then, is to treat the estim ation of C(α(j) ) as a ˜ (j) ) is a noisy inp ut. This p roblem tim e-series c ontrol p roblem in which C(α is well studied, and c onstruc ting a K alm an fi lter [W B 0 1], unsc ented K alm an fi lter [W V D M 0 0 ], or p artic le fi lter c ould be a fruitful area for further study.  15 7  B ib liog raph y [A D D 9 9 ] R .B . A sh and C .A . D oleans-D ade. Probability and Measure Theory. A c adem ic P ress, 19 9 9 . [A dF D J 0 3 ] C . A ndrieu, N . de F reitas, A . D ouc et, and M .I. J ordan. A n Introduc tion to M C M C for M achine L earning. Machine Learning, 5 0 (1):5 – 4 3 , 20 0 3 . [A L A  +0 3 ]  O . A bul, A . L o, R . A lhajj, F . P olat, and K . B ark er. C luster v alidity analysis using subsam p ling. Systems, Man and Cybernetics, 2003. IEEE International Conference on, 2, 20 0 3 .  [A V 0 7] D . A rthur and S . V assilv itsk ii. k -m eans+ + : the adv antages of c areful seeding. Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, p ages 10 27– 10 3 5 , 20 0 7. [A W R 9 6 ] G .B . A rfk en, H .J . W eber, and L . R uby. M athem atic al M ethods for P hysic ists. American Journal of Physics, 6 4 :9 5 9 , 19 9 6 . [B B 9 9 ] A . B araldi and P . B londa. A surv ey of fuz z y c lustering algorithm s for p attern rec ognition. II. IEEE Transactions on Systems, Man and Cybernetics, Part B, 29 (6 ):78 6 – 8 0 1, 19 9 9 . [B D v L P 0 6 ] S . B en-D av id, U . v on L ux burg, and D . P al. A sober look at c lustering stability. Proceedings of the 19th Annual Conference on Learning Theory (COLT), p ages 5 – 19 , 20 0 6 . [B er0 2] P . B erk hin. S urv ey of c lustering data m ining techniq ues. 20 0 2. [B H E G 0 2] A . B en-H ur, A . E lisseeff , and I. G uyon. A stability based m ethod for disc ov ering struc ture in c lustered data. Pacific Symposium on Biocomputing, 7:6 – 17, 20 0 2.  15 8  [B ol9 8 ] D . B oley. P rinc ip al D irec tion D iv isiv e P artitioning. Data Mining and K now ledge Discov ery, 2(4 ):3 25 – 3 4 4 , 19 9 8 . [B re8 9 ] J am es N . B reck enridge. R ep lic ating c luster analysis: m ethod, c onsistenc y, and v alidity. Multiv ariate Behav ioral R esearch, 24 (2):14 7– 16 1, 19 8 9 . [B re0 0 ] J .N . B reck enridge. V alidating c luster analysis: c onsistent rep lic ation and sym m etry. Multiv ariate Behav ioral R esearch, 3 5 (2):26 1– 28 5 , 20 0 0 . [C hv 8 3 ] V aek C hv tal. Linear Programming. W . H . F reem an and C om p any, N ew Y ork , 19 8 3 . [D F 0 2] S . D udoit and J . F ridlyand. A p redic tion-based resam p ling m ethod for estim ating the num ber of c lusters in a dataset. G enome Biology, 3 (7):1– 21, 20 0 2. [D H Z + 0 1] C hris H . Q . D ing, X iaofeng H e, H ongyuan Z ha, M ing G u, and H orst D . S im on. A m in-m ax c ut algorithm for grap h p artitioning and data c lustering. In Proceedings of ICDM 2001, p ages 10 7– 114 , 20 0 1. [D ou0 7] A rnaud D ouc et. C om p uting norm aliz ing c onstants. A p ril 20 0 7. [F an9 7] J . F an. C om m ents on W av elets in statistic s: A rev iew, by A . A ntoniadis. J. Italian Statist. Soc, 6 :13 1– 13 8 , 19 9 7. [F D 0 1] J . F ridlyand and S . D udoit. A p p lic ations of resam p ling m ethods to estim ate the num ber of c lusters and to im p rov e the ac c urac y of a c lustering m ethod. Technic al rep ort, D iv ision of B iostattistic s, U niv ersity of C alifornia, B erk ley., 20 0 1. [F L 0 1] J . F an and R . L i. V ariable S elec tion V ia N onc onc av e P enaliz ed L ik elihood and Its O rac le P rop erties. Journal of the American Statistical Association, 9 6 (4 5 6 ):13 4 8 – 13 6 1, 20 0 1. [F TT0 4 ] G .W . F lak e, R .E . Tarjan, and K . Tsioutsioulik lis. G rap h c lustering and m inim um c ut trees. Internet Mathematics, 1(4 ):3 8 5 – 4 0 8 , 20 0 4 . [G G G 0 2] J . G oldberger, H . G reensp an, and S . G ordon. U nsup erv ised im age c lustering using the inform ation bottleneck m ethod. Proc. DAG M, 20 0 2. 15 9  [G M 9 8 ] A . G elm an and X .L . M eng. S im ulating norm aliz ing c onstants: from im p ortanc e sam p ling to bridge sam p ling to p ath sam p ling. Statist. Sci, 13 (2):16 3 – 18 5 , 19 9 8 . [G T0 4 ] C .D . G iurc aneanu and I. Tabus. C luster S truc ture Inferenc e B ased on C lustering S tability with A p p lic ations to M ic roarray D ata A nalysis. EU R ASIP Journal on Applied Signal Processing, 20 0 4 (1):6 4 – 8 0 , 20 0 4 . [H A 8 5 ] L . H ubert and P . A rabie. C om p aring p artitions. Journal of Classification, 2(1):19 3 – 218 , 19 8 5 . [H A K 0 0 ] A lex ander H inneburg, C haru C . A ggarwal, and D aniel A . K eim . W hat is the nearest neighbor in high dim ensional sp ac es? In The V LDB Journal, p ages 5 0 6 – 5 15 , 20 0 0 . [H en0 4 ] C . H ennig. A general robustness and stability theory for c luster analysis. 0 7 20 0 4 . [H G 0 5 ] K .A . H eller and Z . G hahram ani. B ayesian hierarchic al c lustering. In Proceedings of the 22nd international conference on Machine learning, p ages 29 7– 3 0 4 . A C M P ress N ew Y ork , N Y , U S A , 20 0 5 . [H M N 0 5 ] P . H all, J S M arron, and A . N eem an. G eom etric rep resentation of high dim ension, low sam p le siz e data. Journal of the R oyal Statistical Society series B, 6 7:4 27– 4 4 4 , 20 0 5 . [H M S 0 2] A . H otho, A . M aedche, and S . S taab. Tex t c lustering based on good aggregations. K u ¨ nstliche Intelligenz (K I), 16 (4 ), 20 0 2. [IT9 5 ] M ak oto Iwayam a and Tak enobu Tok unaga. H ierarchic al B ayesian c lustering for autom atic tex t c lassifi c ation. In C hris E . M ellish, editor, Proceedings of IJCAI-95 , 14 th International Joint Conference on Artificial Intelligence, p ages 13 22– 13 27, M ontreal, C A , 19 9 5 . M organ K aufm ann P ublishers, S an F ranc isc o, U S . [J M F 9 9 ] A K J ain, M N M urty, and P J F lynn. D ata c lustering: a rev iew. ACM Computing Surv eys (CSU R ), 3 1(3 ):26 4 – 3 23 , 19 9 9 . [J TZ 0 4 ] D . J iang, C . Tang, and A . Z hang. C luster analysis for gene ex p ression data: a surv ey. K now ledge and Data Engineering, IEEE Transactions on, 16 (11):13 70 – 13 8 6 , 20 0 4 . 16 0  [K uh5 5 ] H W K uhn. The H ungarian M ethod for the A ssignm ent A lgorithm . N av al R esearch Logistics Q uarterly, 1(1/ 2):8 3 – 9 7, 19 5 5 . [K V V 0 4 ] R . K A N N A N , S . V E M P A L A , and A . V E TTA . O n C lusterings: G ood, B ad and S p ec tral. Journal of the ACM, 5 1(3 ):4 9 7– 5 15 , 20 0 4 . [L B R B 0 2] T. L ange, M . B raun, V . R oth, and J .M . B uhm ann. S tabilitybased m odel selec tion. Adv ances in N eural Information Processing Systems, 15 , 20 0 2. [L R B B 0 4 ] T. L ange, V . R oth, M .L . B raun, and J .M . B uhm ann. S tabilitybased v alidation of c lustering solutions. N eural Computation, 16 (6 ):129 9 – 13 23 , 20 0 4 . [M B 0 2] U . M aulik and S . B andyop adhyay. P erform anc e ev aluation of som e c lustering algorithm s and v alidity indic es. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (12):16 5 0 – 16 5 4 , 20 0 2. [M C 8 5 ] G W M IL L IG A N and M C C O O P E R . A n ex am ination of p roc edures for determ ining the num ber of c lusters in a data set. Psychometrik a, 5 0 (2):15 9 – 179 , 19 8 5 . [M ei0 3 ] M . M eila. C om p aring c lusterings. Proceedings of the Conference on Computational Learning Theory (COLT), 20 0 3 . [M ei0 7] M . M eila. C om p aring c lusterings: an inform ation based distanc e. Journal of Multiv ariate Analysis, 9 8 :8 73 – 8 9 5 , 20 0 7. [M P 0 7] M arina M eila and W illiam P entney. C lustering by weighted c uts in direc ted grap hs. In Proceedings of the 2007 SIAM International Conference on Data Mining, 20 0 7. [M R 0 6 ] U . M oller and D . R adk e. A C luster V alidity A p p roach based on N earest-N eighbor R esam p ling. Proceedings of the 18 th International Conference on Pattern R ecognition (ICPR ’06 )-V olume 01, p ages 8 9 2– 8 9 5 , 20 0 6 . [N ea0 5 ] R . M . N eal. E stim ating R atios of N orm aliz ing C onstants U sing L ink ed Im p ortanc e S am p ling. ArX iv Mathematics e-prints, N ov em ber 20 0 5 .  16 1  [N J W 0 2] A . N g, M . J ordan, and Y . W eiss. O n sp ec tral c lustering: A nalysis and an algorithm . Adv ances in N eural Information Processing Systems 14 : Proceedings of the 2002 [sic] Conference, 20 0 2. [P Z ] Y . P ei and O . Z aıane. A S ynthetic D ata G enerator for C lustering and O utlier A nalysis. Technic al rep ort, Technic al rep ort, TR 0 6 -15 , D ep artm ent of c om p uting S c ienc e, U niv ersity of A lberta, 20 0 6 . [R L B B 0 2] V . R oth, T. L ange, M . B raun, and J . B uhm ann. A resam p ling ap p roach to c luster v alidation. Statistics– COMPSTAT, p ages 123 – 128 , 20 0 2. [R W 79 ] R .H . R andles and D .A . W olfe. Introduction to the Theory of N onparametric Statistics. K rieger P ublishing C om p any, 19 79 . [S B 9 9 ] K . S toff el and A . B elk oniene. P arallel k / h-M eans C lustering for L arge D ata S ets. Proceedings of the 5 th International Euro-Par Conference on Parallel Processing, p ages 14 5 1– 14 5 4 , 19 9 9 . [S G 0 3 ] M . S m olk in and D . G hosh. C luster stability sc ores for m ic roarray data in c anc er studies. BMC Bioinformatics, 4 :3 6 , 20 0 3 . [S K K 0 0 ] M . S teinbach, G . K aryp is, and V . K um ar. A c om p arison of doc um ent c lustering techniq ues. K DD W ork shop on Tex t Mining, 3 4 :3 5 , 20 0 0 . [S lo0 2] N . S lonim . The Inform ation B ottleneck : Theory and A p p lic ations. U npublished doctoral dissertation, H ebrew U niv ersity, Jerusalem, Israel, 20 0 2. [S T0 0 a] N . S lonim and N . Tishby. A gglom erativ e inform ation bottleneck . Adv ances in N eural Information Processing Systems, 12:6 17– 23 , 20 0 0 . [S T0 0 b] N . S lonim and N . Tishby. D oc um ent c lustering using word c lusters v ia the inform ation bottleneck m ethod. Proceedings of the 23rd annual international ACM SIG IR conference on R esearch and dev elopment in information retriev al, p ages 20 8 – 215 , 20 0 0 . [S T0 7] O . S ham ir and N . Tishby. C luster S tability for F inite S am p les. Adv ances in N eural Information Processing Systems, 20 0 7. 16 2  [S te0 4 ] D . S teinley. P rop erties of the hubert-arabie adjusted rand index . Psychol Methods, 9 (3 ):3 8 6 – 9 6 , 20 0 4 . [S te0 6 ] D . S teinley. K -m eans c lustering: A half-c entury synthesis. British Journal of Mathematical and Statistical Psychology, 5 9 (1):1– 3 4 , 20 0 6 . [TP B 0 0 ] N . Tishby, F .C . P ereira, and W . B ialek . The inform ation bottleneck m ethod. Arx iv preprint physics/ 0004 05 7 , 20 0 0 . [TS 0 0 ] N . Tishby and N . S lonim . D ata c lustering by M ark ov ian relax ation and the inform ation bottleneck m ethod. Adv ances in N eural Information Processing Systems, 13 :6 4 0 – 6 4 6 , 20 0 0 . [TW 0 5 ] R . Tibshirani and G . W alther. C luster V alidation by P redic tion S trength. Journal of Computational & G raphical Statistics, 14 (3 ):5 11– 5 28 , 20 0 5 . [TW H 0 1] R . Tibshirani, G . W alther, and T. H astie. E stim ating the num ber of c lusters in a data set v ia the gap statistic . Journal of the R oyal Statistical Society. Series B (Statistical Methodology), 6 3 (2):4 11– 4 23 , 20 0 1. [V M 0 1] D . V erm a and M . M eila. A c om p arison of sp ec tral c lustering algorithm s. U niv ersity of W ashington Computer Science & Engineering, Technical R eport, p ages 1– 18 , 20 0 1. [W B 0 1] G . W elch and G . B ishop . A n Introduc tion to the K alm an F ilter. ACM SIG G R APH 2001 Course N otes, 20 0 1. [W V D M 0 0 ] E A W an and R . V an D er M erwe. The unsc ented K alm an fi lter for nonlinear estim ation. Adaptiv e Systems for Signal Processing, Communications, and Control Symposium 2000. ASSPCC. The IEEE 2000, p ages 15 3 – 15 8 , 20 0 0 . [X W 0 5 ] R . X u and D . W unsch. S urv ey of c lustering algorithm s. IEEE Transactions on N eural N etw ork s, 16 (3 ):6 4 5 – 6 78 , 20 0 5 . [Y R 0 1] K .Y . Y eung and W .L . R uz z o. D etails of the A djusted R and index and C lustering algorithm s S up p lem ent to the p ap er A n em p iric al study on P rinc ip al C om p onent A nalysis for c lustering gene ex p ression data(to ap p ear in B ioinform atic s). Bioinformatics, 17(9 ):76 3 – 774 , 20 0 1. 16 3  [Z K 0 2] Y . Z hao and G . K aryp is. C om p arison of agglom erativ e and p artitional doc um ent c lustering algorithm s. SIAM (2002) w ork shop on Clustering H igh-dimentional Data and Its Applications, p ages 0 2– 0 14 , 20 0 2.  16 4  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051357/manifest

Comment

Related Items